TAG Operational Resilience
TAG Operational Resilience is dedicated to operational resilience and reliability practices in cloud native systems, covering observability, management, business continuity, resource optimization, cost efficiency, energy, performance, troubleshooting, reliability, and Day 2 operations.
Mission Statement
Observability, Management, Business Continuity, Resource Optimization, Cost Efficiency, Energy, Performance, Troubleshooting, Reliability, Day 2 Ops
Leadership
Chairs
| Name | GitHub | Organization | Term |
|---|---|---|---|
| Mario Fahlandt | @mfahlandt | - | 2025-07-01 to 2027-06-30 |
| Rafael Brito | @brito-rafa | - | 2025-07-01 to 2026-06-30 |
| Saiyam Pathak | @saiyam1814 | - | 2025-07-01 to 2027-06-30 |
Tech Leads
| Name | GitHub | Organization | Term |
|---|---|---|---|
| Alolita Sharma | @alolita | - | 2025-07-02 to 2026-06-30 |
| Carol Valencia | @krol3 | - | 2025-07-02 to 2027-06-30 |
| Raffaele Spazzoli | @raffaelespazzoli | - | 2025-07-02 to 2026-06-30 |
| Matt Young | @halcyondude | - | 2025-07-02 to 2027-06-30 |
| Nabarun Pal | @palnabarun | - | 2025-07-02 to 2027-06-30 |
TOC Liaison
- Jeremy Rickard (@jeremyrickard)
Meetings
TAG Operational Resilience Meetings
- Calendar: View and join meetings
- Recordings: YouTube Channel
All meetings are open to the public. No registration required - simply join the meeting from the calendar link.
Communication Channels
Slack
Join the CNCF Slack workspace and connect with the TAG:
- Channel: #tag-operational-resilience
- CNCF Slack Invite: slack.cncf.io
Mailing List
- Subscribe: cncf-tag-operational-resilience
- Use for announcements, discussions, and coordination
Focus Areas
TAG Operational Resilience works on several key operational domains:
Observability
- Metrics, logging, and tracing patterns
- Observability tooling selection and integration
- Distributed tracing architectures
- Signal correlation and analysis
Management
- Lifecycle management of cloud native applications
- Configuration management
- Change management processes
- Capacity planning
Business Continuity
- Disaster recovery strategies
- High availability patterns
- Backup and restore procedures
- Incident management
Resource Optimization
- Resource allocation and scheduling
- Auto-scaling patterns
- Right-sizing workloads
- Multi-tenancy optimization
Cost Efficiency (FinOps)
- Cost visibility and attribution
- Cost optimization strategies
- FinOps best practices
- Budget management and forecasting
Energy & Sustainability
- Energy-efficient architectures
- Carbon-aware computing
- Sustainability metrics and reporting
- Green cloud native practices
Performance
- Performance testing and benchmarking
- Performance optimization techniques
- Latency reduction strategies
- Throughput optimization
Troubleshooting & Reliability
- Debugging distributed systems
- Root cause analysis
- Chaos engineering
- SRE practices and SLO/SLI definitions
Day 2 Operations
- Production operations best practices
- Operational runbooks
- On-call procedures
- Post-incident reviews
Subprojects
- Operational Resilience-sub-foo: Mailing List
Initiatives
View current and past initiatives:
Getting Involved
We welcome contributions from anyone interested in operational resilience and reliability:
Attend Meetings
Join our regular meetings to hear about ongoing work and participate in discussions. Check the meeting calendar for details.
Contribute to Initiatives
Browse active initiatives and volunteer to help with specific deliverables.
Share Your Experience
- Present operational use cases or lessons learned at TAG meetings
- Write blog posts about observability and reliability practices
- Contribute to white papers and best practices documents
Join the Conversation
- Participate in Slack discussions
- Engage on the mailing list
- Comment on GitHub issues in the TOC repository
Resources
Related TAGs
- TAG Infrastructure - Infrastructure supporting resilient operations
- TAG Security and Compliance - Security aspects of operations