Blameless postmortems

Site Reliability Engineering / Blameless postmortem

Get Started. It's Free
or sign up with your email address
Blameless postmortems by Mind Map: Blameless postmortems

1. Under the hood

1.1. Why blameless is not the regular way?

1.2. 1 - Missing psychological safety

1.2.1. Amy Edmondson

1.2.2. TEDx 10 minutes talk

1.2.3. Key take aways

1.2.3.1. NOT Psyhco Safety VS Accountability

1.2.3.1.1. because NOT the same dimension

1.2.3.1.2. Psycho Safety AND Accountability

1.2.3.2. When to use?

1.2.3.2.1. Uncertainty AND Inter dependency

1.2.3.3. What to do? (e.g. applied to postmortem)

1.2.3.3.1. Create safe environment by

1.2.3.4. Benefit

1.2.3.4.1. High performing team

1.3. 2 - Fundamental Attribution Error

1.3.1. What?

1.3.1.1. Explain behaviours

1.3.1.1.1. Charlie

1.3.1.1.2. for myself

1.3.1.1.3. for others

1.3.2. So what, applied to postmortems?

1.3.2.1. Be counsious of this bias to avoid fundamendal attribution error of outages to someone or some team during a postmortem

1.4. 3 - Hindsight Bias

1.4.1. What?

1.4.1.1. The inclination, after an event has occurred, to see the event as having been predictable

1.4.1.2. E.g.

1.4.1.2.1. COVID

1.4.1.2.2. Mao Kissinger 1973

1.4.2. So what, applied to postmortems?

1.4.2.1. Be counsious of this bias to avoid "it was obvisous it will crash like this" too easy posteriori wisdom

1.5. 4 - Dischargin discomfort

1.5.1. Brene Brown

1.5.2. RSA Hi, I am a blamer ! 3min talk

1.5.3. Key take aways

1.5.3.1. We love blaming as

1.5.3.1.1. it discharge disconfort and pain at neurobiological level

1.5.3.2. Blaming vs being accountable

1.5.3.2.1. Being accountable is being vulnerable

1.5.3.2.2. Being vulnerable means use speech to resolve conflicts instead of violence

1.5.3.3. But is cost too much as

1.5.3.3.1. it is corrosive on relationship

1.5.3.3.2. it consume all our energy, missing opportunities for empathy

2. Learning by doing

2.1. Example 1

2.1.1. Shakespeare Sonnet++ Postmortem (incident #465)

2.2. Example 2

2.2.1. Satellite machines sent to disk erase

2.2.1.1. Version A

2.2.1.2. Version B

2.3. Example 3

2.3.1. Bring your own Postmortem

3. How to do?

3.1. Setup

3.1.1. Drive with "how"

3.1.1.1. Focus on Systems and Processes NOT on people

3.1.1.2. Using "How it fails" NOT "who break it?" or even "why it fails?"

3.1.2. Have Accountability AND Psy safety

3.1.2.1. What is our current approach?

3.1.2.2. DO

3.1.2.2.1. Frame the work as a learning problem, not an execution problem

3.1.2.2.2. Acknowledge your own fallibility

3.1.2.2.3. Model curiosity and ask lots of questions

3.1.3. Clearly define when to do / not do postmortem

3.1.3.1. e.g.

3.1.3.1.1. User-visible downtime / threshold impacted users

3.1.3.1.2. Data loss of any kind

3.1.3.1.3. Resolution time above threshold

3.1.3.1.4. Error budget burn / freeze period

3.2. Steps

3.2.1. 1 Write

3.2.1.1. Avoid blameful / animated language

3.2.1.2. Leverage real-time collarboration

3.2.1.2.1. We simply use Google Docs as our main postmortem tool because it fooster collaboration

3.2.1.3. Focus on improvement by

3.2.1.3.1. including concreate action items

3.2.1.3.2. Adapt 5 Whys to be 5 Hows

3.2.1.4. Structure using a template

3.2.1.4.1. Postmortem template

3.2.2. 2 Review

3.2.2.1. By a rotating team of senior Ops / SRE

3.2.2.2. looking at

3.2.2.2.1. Blameless language

3.2.2.2.2. Usefull

3.2.2.2.3. Accurate

3.2.2.2.4. Complete

3.2.2.2.5. Protect user info

3.2.3. 3 Publicize

3.2.3.1. Audience

3.2.3.1.1. Internal

3.2.3.1.2. External

3.2.3.2. Channel and tools

3.2.3.2.1. Internal repo

3.2.3.2.2. External site

4. Build a competitive advantage

4.1. Why changing for blameless?

4.2. Bad news should travel fast

4.2.1. The time the service is down matters

4.2.2. Hidding early warning, early issue may have catastrophic impact on large distributed systems

4.3. Catching the opportunity to improve the service againt competition

4.3.1. Every crisis is an opportunity

4.3.2. Finding the root cause and executing the acction plan to fix it make a better product

4.4. Build a safe workspace

4.4.1. by make great value and principle effective in everyday work