Reducing MTTD for SEV

马上开始. 它是免费的哦
注册 使用您的电邮地址
Reducing MTTD for SEV 作者: Mind Map: Reducing MTTD for SEV

1. Step 0. Incident Classification

1.1. SEV Levels

1.1.1. SEV 0. Catastrophic service impact

1.1.2. SEV 1. Critical device impact

1.1.3. SEV 2. High service impact

1.2. SEV Timeline

1.2.1. Detection

1.2.2. Diagnosis

1.2.3. Mitigation

1.2.4. Prevention

1.2.5. Closure

1.3. TTD Timeline

1.3.1. SEV detection

1.3.2. Determining the SEV level

1.3.3. Routing to a TL who is responssable

1.3.4. Solving the SEV

2. Step 1. Critical-Service Monitoring

2.1. Four Golden Signals

2.1.1. Latency

2.1.2. Traffic

2.1.3. Errors

2.1.4. Saturation

2.2. RED Method

2.2.1. Rate

2.2.2. Error

2.2.3. Duration

2.3. KPIs

2.4. Service Dashboard

3. Step 2. Service Ownership and Metrics

3.1. Service Triage

3.2. Service Ownership

3.2.1. List services and teams

3.2.2. Build a responsible

3.2.3. Automate Ownership

3.2.4. On-call systems

4. Step 3: On-call Principles

4.1. Pareto Principle

4.1.1. 80/20 rule

4.1.2. Law of the vital few

4.1.3. Principle of factor sparsity

4.2. Rotation structure

4.3. Alert threshold maintenance

4.3.1. False Positives

4.3.2. Alert Burden

4.3.3. Alert Maintenance

4.4. Escalation practices

5. Step 4: Chaos Engineering

5.1. Principles Chaos Engineering

5.1.1. Steady state

5.1.2. Hypothesize

5.1.3. Introduce variables

5.1.4. Try to disprove the hypothesis

5.2. Chaos Days

5.3. Continuous chaos

6. Step 5: Metrics for self-healing system automation

6.1. Self-healing systems automate the fix before a human takes action.

6.2. Examples: Facebook’s FBAR, LinkedIn’s Nurse and Netflix’s Winston.

6.3. system automation

7. Step 6: Listening to your people and creating a high-reliability culture

7.1. Preoccupation with failure

7.2. Reluctance to simplify interpretations

7.3. Sensitivity to operations

7.4. Commitment to resilience

7.5. Deference to expertise