What is SRE
Site Reliability Engineering — system reliability
SRE (Site Reliability Engineering) is an engineering discipline that combines development and operations to ensure the reliability, scalability, and performance of systems.
Core Principles
- Error budget — acceptable level of failures
- SLI/SLO/SLA — service level metrics and agreements
- Toil reduction — automating routine tasks
- Postmortem culture — incident analysis without blame
Key Practices
- Monitoring and alerting
- Incident management (on-call)
- Capacity planning
- Chaos engineering
- Release automation
SRE Metrics
- Availability — service uptime
- Latency — response time
- Error rate — frequency of errors
- MTTR — mean time to recovery
Tools
- Prometheus + Grafana
- PagerDuty / Opsgenie
- Kubernetes
- Terraform