Skip to main content

TAG Operational Resilience

TAG Operational Resilience is dedicated to operational resilience and reliability practices in cloud native systems, covering observability, management, business continuity, resource optimization, cost efficiency, energy, performance, troubleshooting, reliability, and Day 2 operations.

Mission Statement

Observability, Management, Business Continuity, Resource Optimization, Cost Efficiency, Energy, Performance, Troubleshooting, Reliability, Day 2 Ops

Leadership

Chairs

NameGitHubOrganizationTerm
Mario Fahlandt@mfahlandt-2025-07-01 to 2027-06-30
Rafael Brito@brito-rafa-2025-07-01 to 2026-06-30
Saiyam Pathak@saiyam1814-2025-07-01 to 2027-06-30

Tech Leads

NameGitHubOrganizationTerm
Alolita Sharma@alolita-2025-07-02 to 2026-06-30
Carol Valencia@krol3-2025-07-02 to 2027-06-30
Raffaele Spazzoli@raffaelespazzoli-2025-07-02 to 2026-06-30
Matt Young@halcyondude-2025-07-02 to 2027-06-30
Nabarun Pal@palnabarun-2025-07-02 to 2027-06-30

TOC Liaison

Meetings

TAG Operational Resilience Meetings

All meetings are open to the public. No registration required - simply join the meeting from the calendar link.

Communication Channels

Slack

Join the CNCF Slack workspace and connect with the TAG:

Mailing List

Focus Areas

TAG Operational Resilience works on several key operational domains:

Observability

  • Metrics, logging, and tracing patterns
  • Observability tooling selection and integration
  • Distributed tracing architectures
  • Signal correlation and analysis

Management

  • Lifecycle management of cloud native applications
  • Configuration management
  • Change management processes
  • Capacity planning

Business Continuity

  • Disaster recovery strategies
  • High availability patterns
  • Backup and restore procedures
  • Incident management

Resource Optimization

  • Resource allocation and scheduling
  • Auto-scaling patterns
  • Right-sizing workloads
  • Multi-tenancy optimization

Cost Efficiency (FinOps)

  • Cost visibility and attribution
  • Cost optimization strategies
  • FinOps best practices
  • Budget management and forecasting

Energy & Sustainability

  • Energy-efficient architectures
  • Carbon-aware computing
  • Sustainability metrics and reporting
  • Green cloud native practices

Performance

  • Performance testing and benchmarking
  • Performance optimization techniques
  • Latency reduction strategies
  • Throughput optimization

Troubleshooting & Reliability

  • Debugging distributed systems
  • Root cause analysis
  • Chaos engineering
  • SRE practices and SLO/SLI definitions

Day 2 Operations

  • Production operations best practices
  • Operational runbooks
  • On-call procedures
  • Post-incident reviews

Subprojects

Initiatives

View current and past initiatives:

Getting Involved

We welcome contributions from anyone interested in operational resilience and reliability:

Attend Meetings

Join our regular meetings to hear about ongoing work and participate in discussions. Check the meeting calendar for details.

Contribute to Initiatives

Browse active initiatives and volunteer to help with specific deliverables.

Share Your Experience

  • Present operational use cases or lessons learned at TAG meetings
  • Write blog posts about observability and reliability practices
  • Contribute to white papers and best practices documents

Join the Conversation

Resources