Skip to main content

TAG Operational Resilience

TAG Operational Resilience is dedicated to operational resilience and reliability practices in cloud native systems, covering observability, management, business continuity, resource optimization, cost efficiency, energy, performance, troubleshooting, reliability, and Day 2 operations.

Mission Statementโ€‹

Observability, Management, Business Continuity, Resource Optimization, Cost Efficiency, Energy, Performance, Troubleshooting, Reliability, Day 2 Ops

Leadershipโ€‹

Chairsโ€‹

NameGitHubOrganizationTerm
Mario Fahlandt@mfahlandt-2025-07-01 to 2027-06-30
Rafael Brito@brito-rafa-2025-07-01 to 2026-06-30
Saiyam Pathak@saiyam1814-2025-07-01 to 2027-06-30

Tech Leadsโ€‹

NameGitHubOrganizationTerm
Alolita Sharma@alolita-2025-07-02 to 2026-06-30
Carol Valencia@krol3-2025-07-02 to 2027-06-30
Raffaele Spazzoli@raffaelespazzoli-2025-07-02 to 2026-06-30
Matt Young@halcyondude-2025-07-02 to 2027-06-30
Nabarun Pal@palnabarun-2025-07-02 to 2027-06-30

TOC Liaisonโ€‹

Meetingsโ€‹

TAG Operational Resilience Meetings

All meetings are open to the public. No registration required - simply join the meeting from the calendar link.

Communication Channelsโ€‹

Slackโ€‹

Join the CNCF Slack workspace and connect with the TAG:

Mailing Listโ€‹

Focus Areasโ€‹

TAG Operational Resilience works on several key operational domains:

Observabilityโ€‹

  • Metrics, logging, and tracing patterns
  • Observability tooling selection and integration
  • Distributed tracing architectures
  • Signal correlation and analysis

Managementโ€‹

  • Lifecycle management of cloud native applications
  • Configuration management
  • Change management processes
  • Capacity planning

Business Continuityโ€‹

  • Disaster recovery strategies
  • High availability patterns
  • Backup and restore procedures
  • Incident management

Resource Optimizationโ€‹

  • Resource allocation and scheduling
  • Auto-scaling patterns
  • Right-sizing workloads
  • Multi-tenancy optimization

Cost Efficiency (FinOps)โ€‹

  • Cost visibility and attribution
  • Cost optimization strategies
  • FinOps best practices
  • Budget management and forecasting

Energy & Sustainabilityโ€‹

  • Energy-efficient architectures
  • Carbon-aware computing
  • Sustainability metrics and reporting
  • Green cloud native practices

Performanceโ€‹

  • Performance testing and benchmarking
  • Performance optimization techniques
  • Latency reduction strategies
  • Throughput optimization

Troubleshooting & Reliabilityโ€‹

  • Debugging distributed systems
  • Root cause analysis
  • Chaos engineering
  • SRE practices and SLO/SLI definitions

Day 2 Operationsโ€‹

  • Production operations best practices
  • Operational runbooks
  • On-call procedures
  • Post-incident reviews

Subprojectsโ€‹

Initiativesโ€‹

View current and past initiatives:

Getting Involvedโ€‹

We welcome contributions from anyone interested in operational resilience and reliability:

Attend Meetingsโ€‹

Join our regular meetings to hear about ongoing work and participate in discussions. Check the meeting calendar for details.

Contribute to Initiativesโ€‹

Browse active initiatives and volunteer to help with specific deliverables.

Share Your Experienceโ€‹

  • Present operational use cases or lessons learned at TAG meetings
  • Write blog posts about observability and reliability practices
  • Contribute to white papers and best practices documents

Join the Conversationโ€‹

Resourcesโ€‹