Reducing MTTD for SEV

Get Started. It's Free
or sign up with your email address
Rocket clouds
Reducing MTTD for SEV by Mind Map: Reducing MTTD for SEV

1. Step 0. Incident Classification

1.1. SEV Levels

1.1.1. SEV 0. Catastrophic service impact

1.1.2. SEV 1. Critical device impact

1.1.3. SEV 2. High service impact

1.2. SEV Timeline

1.2.1. Detection

1.2.2. Diagnosis

1.2.3. Mitigation

1.2.4. Prevention

1.2.5. Closure

1.3. TTD Timeline

1.3.1. SEV detection

1.3.2. Determining the SEV level

1.3.3. Routing to a TL who is responssable

1.3.4. Solving the SEV

2. Step 3: On-call Principles

2.1. Pareto Principle

2.1.1. 80/20 rule

2.1.2. Law of the vital few

2.1.3. Principle of factor sparsity

2.2. Rotation structure

2.3. Alert threshold maintenance

2.3.1. False Positives

2.3.2. Alert Burden

2.3.3. Alert Maintenance

2.4. Escalation practices

3. Step 1. Critical-Service Monitoring

3.1. Four Golden Signals

3.1.1. Latency

3.1.2. Traffic

3.1.3. Errors

3.1.4. Saturation

3.2. RED Method

3.2.1. Rate

3.2.2. Error

3.2.3. Duration

3.3. KPIs

3.4. Service Dashboard

4. Step 4: Chaos Engineering

4.1. Principles Chaos Engineering

4.1.1. Steady state

4.1.2. Hypothesize

4.1.3. Introduce variables

4.1.4. Try to disprove the hypothesis

4.2. Chaos Days

4.3. Continuous chaos

5. Step 2. Service Ownership and Metrics

5.1. Service Triage

5.2. Service Ownership

5.2.1. List services and teams

5.2.2. Build a responsible

5.2.3. Automate Ownership

5.2.4. On-call systems

6. Step 5: Metrics for self-healing system automation

6.1. Self-healing systems automate the fix before a human takes action.

6.2. Examples: Facebook’s FBAR, LinkedIn’s Nurse and Netflix’s Winston.

6.3. system automation

7. Step 6: Listening to your people and creating a high-reliability culture

7.1. Preoccupation with failure

7.2. Reluctance to simplify interpretations

7.3. Sensitivity to operations

7.4. Commitment to resilience

7.5. Deference to expertise