TAG Operational Resilience
TAG Operational Resilience is dedicated to operational resilience and reliability practices in cloud native systems, covering observability, management, business continuity, resource optimization, cost efficiency, energy, performance, troubleshooting, reliability, and Day 2 operations.
Mission Statementโ
Observability, Management, Business Continuity, Resource Optimization, Cost Efficiency, Energy, Performance, Troubleshooting, Reliability, Day 2 Ops
Leadershipโ
Chairsโ
Name | GitHub | Organization | Term |
---|---|---|---|
Mario Fahlandt | @mfahlandt | - | 2025-07-01 to 2027-06-30 |
Rafael Brito | @brito-rafa | - | 2025-07-01 to 2026-06-30 |
Saiyam Pathak | @saiyam1814 | - | 2025-07-01 to 2027-06-30 |
Tech Leadsโ
Name | GitHub | Organization | Term |
---|---|---|---|
Alolita Sharma | @alolita | - | 2025-07-02 to 2026-06-30 |
Carol Valencia | @krol3 | - | 2025-07-02 to 2027-06-30 |
Raffaele Spazzoli | @raffaelespazzoli | - | 2025-07-02 to 2026-06-30 |
Matt Young | @halcyondude | - | 2025-07-02 to 2027-06-30 |
Nabarun Pal | @palnabarun | - | 2025-07-02 to 2027-06-30 |
TOC Liaisonโ
- Jeremy Rickard (@jeremyrickard)
Meetingsโ
TAG Operational Resilience Meetings
- Calendar: View and join meetings
- Recordings: YouTube Channel
All meetings are open to the public. No registration required - simply join the meeting from the calendar link.
Communication Channelsโ
Slackโ
Join the CNCF Slack workspace and connect with the TAG:
- Channel: #tag-operational-resilience
- CNCF Slack Invite: slack.cncf.io
Mailing Listโ
- Subscribe: cncf-tag-operational-resilience
- Use for announcements, discussions, and coordination
Focus Areasโ
TAG Operational Resilience works on several key operational domains:
Observabilityโ
- Metrics, logging, and tracing patterns
- Observability tooling selection and integration
- Distributed tracing architectures
- Signal correlation and analysis
Managementโ
- Lifecycle management of cloud native applications
- Configuration management
- Change management processes
- Capacity planning
Business Continuityโ
- Disaster recovery strategies
- High availability patterns
- Backup and restore procedures
- Incident management
Resource Optimizationโ
- Resource allocation and scheduling
- Auto-scaling patterns
- Right-sizing workloads
- Multi-tenancy optimization
Cost Efficiency (FinOps)โ
- Cost visibility and attribution
- Cost optimization strategies
- FinOps best practices
- Budget management and forecasting
Energy & Sustainabilityโ
- Energy-efficient architectures
- Carbon-aware computing
- Sustainability metrics and reporting
- Green cloud native practices
Performanceโ
- Performance testing and benchmarking
- Performance optimization techniques
- Latency reduction strategies
- Throughput optimization
Troubleshooting & Reliabilityโ
- Debugging distributed systems
- Root cause analysis
- Chaos engineering
- SRE practices and SLO/SLI definitions
Day 2 Operationsโ
- Production operations best practices
- Operational runbooks
- On-call procedures
- Post-incident reviews
Subprojectsโ
- Operational Resilience-sub-foo: Mailing List
Initiativesโ
View current and past initiatives:
Getting Involvedโ
We welcome contributions from anyone interested in operational resilience and reliability:
Attend Meetingsโ
Join our regular meetings to hear about ongoing work and participate in discussions. Check the meeting calendar for details.
Contribute to Initiativesโ
Browse active initiatives and volunteer to help with specific deliverables.
Share Your Experienceโ
- Present operational use cases or lessons learned at TAG meetings
- Write blog posts about observability and reliability practices
- Contribute to white papers and best practices documents
Join the Conversationโ
- Participate in Slack discussions
- Engage on the mailing list
- Comment on GitHub issues in the TOC repository
Resourcesโ
Related TAGsโ
- TAG Infrastructure - Infrastructure supporting resilient operations
- TAG Security and Compliance - Security aspects of operations