Reducing MTTD for SEV
par Yury Niño
1. Step 0. Incident Classification
1.1. SEV Levels
1.1.1. SEV 0. Catastrophic service impact
1.1.2. SEV 1. Critical device impact
1.1.3. SEV 2. High service impact
1.2. SEV Timeline
1.2.1. Detection
1.2.2. Diagnosis
1.2.3. Mitigation
1.2.4. Prevention
1.2.5. Closure
1.3. TTD Timeline
1.3.1. SEV detection
1.3.2. Determining the SEV level
1.3.3. Routing to a TL who is responssable
1.3.4. Solving the SEV
2. Step 1. Critical-Service Monitoring
2.1. Four Golden Signals
2.1.1. Latency
2.1.2. Traffic
2.1.3. Errors
2.1.4. Saturation
2.2. RED Method
2.2.1. Rate
2.2.2. Error
2.2.3. Duration
2.3. KPIs
2.4. Service Dashboard
3. Step 2. Service Ownership and Metrics
3.1. Service Triage
3.2. Service Ownership
3.2.1. List services and teams
3.2.2. Build a responsible
3.2.3. Automate Ownership
3.2.4. On-call systems
4. Step 3: On-call Principles
4.1. Pareto Principle
4.1.1. 80/20 rule
4.1.2. Law of the vital few
4.1.3. Principle of factor sparsity
4.2. Rotation structure
4.3. Alert threshold maintenance
4.3.1. False Positives
4.3.2. Alert Burden
4.3.3. Alert Maintenance
4.4. Escalation practices
5. Step 4: Chaos Engineering
5.1. Principles Chaos Engineering
5.1.1. Steady state
5.1.2. Hypothesize
5.1.3. Introduce variables
5.1.4. Try to disprove the hypothesis