SRE core responsibilities

Get Started. It's Free
or sign up with your email address
SRE core responsibilities by Mind Map: SRE core responsibilities

1. Engineering

1.1. Ensure durable focus on engineering = 50% cap

1.2. DO

1.2.1. Must Measure time

1.2.2. Enforce a positive feedback loop

1.2.2.1. Redirect to Service Dev team when above the 50% cap

1.2.2.2. End redirection when < 50%

1.2.3. Focus on & fix root causes

1.2.3.1. Postmortem for all significant incident

1.2.3.2. Specially if not paged

1.2.3.3. Improve, improve, improve !

1.2.3.4. Blame-free postmortem culture

1.2.4. Seek for 1 to 2 events per shifts on a regular basis

1.2.4.1. > More ?

1.2.4.1.1. no learning as overwhelmed

1.2.4.1.2. pager fatigue

1.2.4.1.3. Weak postmortem

1.2.4.1.4. Dirty clean up / service restoration

1.2.4.2. < Less ?

1.2.4.2.1. loose ability to manage events

1.2.4.2.2. waste of time

1.2.4.2.3. challenge scope / release rythm

2. Velocity

2.1. Maximize change velocity without violating SLO

2.2. DO

2.2.1. Explain why 100% reliability target is usually wrong?

2.2.1.1. user cannot see the different between 100% and 99,999%

2.2.2. Bring Dev / Ops structural conflict in fore by implementing ERROR BUDGET

2.2.3. Let Business / Product determine the relevant target, aka Velocity / Reliability balance

2.2.4. Have strong CxO sponsorship to keep playing by error budget rules

3. Monitoring

3.1. Enable health, availability and insights

3.2. DO

3.2.1. Configure monitoring to NEVER require an human to interpret any part of the alerting domain

3.2.2. Page humans only if they need to take an action

3.2.2.1. No human action

3.2.2.1.1. just log, for maybe future forensic

3.2.2.2. Human action

3.2.2.2.1. Latter

3.2.2.2.2. Now !

4. Emergency

4.1. Minimize outage timespan

4.2. Context

4.2.1. MTTF Mean Time Between Failure

4.2.1.1. Archi brings reliability by improving MTTF

4.2.2. MTTR Mean Time To Restore

4.2.2.1. Self healing best way to improve MTTR

4.2.2.2. But, what if still outage, not covered by self healing?

4.3. DO

4.3.1. Measure MTTR: "the" KPI

4.3.2. Write, maintain "on-call play books"

4.3.2.1. 3x improvement on MTTR

4.3.2.2. Must have

4.3.3. Train your self restoring the service

4.3.3.1. E.g. Wheel of misfortune

5. Changes & releases

5.1. Automate progressive rollouts & secure rollbacks

5.2. Context

5.2.1. 70% outage = changes in a live system

5.3. DO

5.3.1. Automate the release pipeline including tests

5.3.1.1. Infra

5.3.1.2. Apps

5.3.2. Use progressive rollouts

5.3.2.1. Environments

5.3.2.2. Canary

5.3.2.3. Production slices post Canary

5.3.3. Ensure quick and accurate problem detection

5.3.4. Be able to roll back safely

6. Capacity

6.1. Take care of capacity because it is critical to reliability (just in SRE `scope)

6.2. DO

6.2.1. Forecast demand

6.2.1.1. Organic

6.2.1.1.1. adoption

6.2.1.1.2. usage

6.2.1.2. Inorganic

6.2.1.2.1. mkt campains

6.2.1.2.2. launches

6.2.2. Plan capacity by using load testing

7. Efficiency

7.1. Impact service efficiency

7.2. Context

7.2.1. SRE ultimately controls service provisioning

7.2.2. Utilization depends on

7.2.2.1. How service is provisioned? How the service works?

7.2.3. Resource use depends on

7.2.3.1. Demand (load) Software efficiency

7.2.4. Service efficiency have a direct money impact

7.2.5. Capacity target @ a specific response speed

7.3. DO

7.3.1. Predict demand

7.3.2. Provision capacity

7.3.3. Monitor efficieny

7.3.4. Modify software