Skip to main content

TAG Operational Resilience Charter

Review and contributions from: Rafa Brito, Mario Fahlandt, Nabarun Pal, Saiyam Pathak, Raffaele Spazzoli and Matt Young

Introduction

In today's digital landscape, the resilience and efficient operation of cloud native systems are both a technical requirement and necessity for widespread adoption. As organizations globally adopt and deploy on cloud native architectures to deliver their services, the impact of downtime or performance degradation can have widespread ramifications if not addressed and dually considered in the projects being adopted for production-level use. This Technical Advisory Group (TAG) is established to address the complex challenges of maintaining system availability, performance, and reliability. By focusing on "Day 2 Operations," the period after a system is deployed, this TAG aims to provide the community with the principles, practices, and guidance needed to build and operate robust and efficient cloud native services. The TAG serves as an extension of the CNCF TOC for this domain area and supports the TOC with project due diligences, additional reviews, and requests within the scope of the TAG.

Mission Statement

To define and advance practices and standards for building, operating, adopting and managing resilient, observable, and efficient cloud native systems, applications, and architectures beyond their initial deployment.

To accomplish this in alignment with TOC's guidance, we will adopt their specific Goals for Technical Advisory Groups

  • Strengthen the project ecosystem to meet the needs of end users and project contributors.
  • Serve as a “bridge” between projects to surface common problems and drive alignment.
  • Identify gaps in the CNCF project portfolio. Find and attract projects to fill these gaps.
  • Reduce some project workload on TOC while retaining executive control & tonal integrity with this elected body.
  • Provide technical expertise during the evaluation of projects to join the CNCF and as they move levels.
  • Educate and inform users with unbiased, effective, and practically useful information scoped to cloud native.
  • Focus attention & resources on helping foster project maturity, systematically across CNCF projects.

Responsibilities & Deliverables

In-Scope

This TAG covers the following sub-domains and topics within its scope:

  • Observability: Monitoring, logging, and tracing to gain deep insights into system behavior and health.
  • Management: The processes and tools essential for managing the complete lifecycle of cloud native applications and infrastructure.
  • Operational Continuity: Strategies and practices to ensure that technical operations can continue in the event of a disruption.
  • Resource Optimization: The efficient use of computational, storage, and network resources to balance performance and cost.
    • Cost Efficiency: Methodologies for optimizing expenditures without compromising system reliability or performance.
    • Energy Efficiency: Focusing on and providing guidance for reducing the energy consumption of cloud native services.
  • Performance: Ensuring cloud native applications and systems consistently and efficiently execute their functions while managing and optimizing resources with the overall reliability and stability under varying conditions.
  • Troubleshooting: The systematic process of identifying, diagnosing, and resolving issues within a distributed system.
  • Testing: Verifying the reliability and resilience of applications
    • Reliability: The ability of a system to perform its required functions under stated conditions for a specified period of time.
    • Day 2 Ops: A focus on the post-deployment activities required to keep a system running smoothly, including maintenance, scaling, updating and upgrading.
    • Chaos Engineering: The practice of experimenting on a system to build confidence in its capability to withstand turbulent and unexpected conditions.

Out of Scope

  • Topics in scope of other tags (collaboration on such topics with other TAGs where it touches on this TAG's charter is in scope)
  • Form an umbrella organization beyond the CNCF
  • Evaluate individual company infrastructures and provide advice
  • Focus outside of cloud native technologies, according to the CNCF Cloud Native definition
  • creation of certifications for Conformance

Deliverables to TOC

The anticipated deliverables for TAG Operational Resilience include:

  • Subprojects: Ongoing services or programs requiring stewardship from the TAG. Examples include:
    • Production Readiness Assessments
    • Resource Optimization and Green Reviews
  • Initiatives: Time-bound, objective-focused units of work submitted to the TOC. Examples may include:
    • Specific assessments for Resource Optimization or Cost Efficiency best practices
    • Guidelines for performance testing best practices
  • Whitepapers and Guides: In-depth documents that provide frameworks and guidelines consistent with advancing practices in the TAG's scope. These serve projects, other TAGs, and the TOC.
  • Landscape Analysis: Curated information and overview of the projects, tools, and standards within the operational resilience ecosystem to identify gaps and opportunities, inform adopters of useful and novel ways of addressing use cases, and developing practices and guidance for enterprise scale.
  • Knowledge Gap Identification: Reports and presentations to the TOC on identified gaps within the cloud native ecosystem related to operational resilience.
  • Suggestions for improvements: Recommend improvements for CNCF internal processes. Examples include suggestions for:
    • New educational initiatives, such as for sustainability.
  • Project reviews and due diligence: The TAG will support TOC with the due diligence activities pertaining to the scope of the TAG.

Audiences

  • Adopters, Developers, and Operators: Adopters, developers, and system operators will benefit from educational materials, best practices, and tools that help them build and maintain more resilient systems.
  • CNCF Projects & Community: The TAG will provide project intelligence and support to the CNCF TOC and the broader community, helping to align projects and initiatives.
  • External Organizations: The TAG will engage in collaboration with external organizations, initiatives, and efforts to share knowledge and advance the state of operational resilience.

Coordination

The TAG Operational Resilience will coordinate with various stakeholders within the CNCF ecosystem:

  • CNCF Projects: The TAG provides services to projects, and its work helps align projects within the CNCF ecosystem. This coordination ensures alignment across the foundation and provides pathways for community-focused initiatives to be supported
  • Other TAGs: Coordination is essential, as TAGs serve the needs of multiple projects and other TAGs.
  • TOC Subprojects: TAG leadership participates in TOC Subprojects like Project Reviews and Contributor Strategy.
  • Community Groups: Community Groups are encouraged to discuss initiative ideas and may submit applications for initiatives within a TAG.

Success Criteria

Success is based on the effective establishment and operation of the TAG's Subprojects and Initiatives. This includes the creation and dissemination of valuable resources, such as best practices and assessments. The TAG's ability to maintain focus on its defined scope and align with TOC requirements is crucial. Success also involves recruiting new leadership and community members to support and drive the TAG's mission, as well as making incremental progress in identifying and addressing knowledge gaps within the ecosystem.

Alignment with the CNCF TOC Charter

The TAG Operational Resilience charter is directly aligned with the CNCF TOC charter. The TOC is the technical governing body responsible for maintaining the technical vision and driving standard practices across projects. The TOC's vision is problem-centric, encouraging projects to solve challenges faced by adopters. By focusing on critical areas such as observability, reliability, and efficiency (Day 2 Ops), TAG Operational Resilience directly addresses the significant problems faced by cloud-native adopters. The TAG's work in defining best practices, frameworks, and performing assessments contributes to driving common practices and aligning projects within the ecosystem, as mandated by the TOC.