
1. Context
1.1. Innovation fuels the digital economy
1.2. Short time to market means
1.2.1. Frequent release
1.2.2. of
1.2.2.1. new features
1.2.2.2. improvements
1.3. It is not easy to prioritize reliability works
2. Conviction
2.1. Reliability is a feature
2.2. It is the most important
3. Problem statement
3.1. How to re-prioritize reliability works, time & effort, in a sustainable manner that reduce conflicts ?
3.2. How to protect customers from repeated service outages?
3.3. How to avoid executive fatigue to arbitrate between agility and reliability?
4. Solution approach
4.1. Blameless
4.1.1. Freeze is not a punishment
4.2. Data driven
4.2.1. Based on the error budget burning rate
4.2.2. An error budget is the amount of unavailability the SLO tolerates
4.3. Gradual
4.3.1. How fast are we burning the error budget?
4.3.1.1. Little by little
4.3.1.2. Fuel speed head
4.4. With explicit progressive consequences
4.4.1. Investigate
4.4.2. Rollback
4.4.3. Dedicate dev resource to reliability work
4.4.4. Freeze release (unless security or reliability fix)
4.5. Agreed
4.5.1. Discuss and find an agreement before a major incident happen
4.6. Documented
4.6.1. In the Error budget policy
4.7. Empowered
4.7.1. Executive sponsorship and policy sign-off
4.7.2. Exception will required a tough, visible escalation to predefined executives
5. Example 1
5.1. A service is
5.1.1. A collection of associated components and systems which compose a user-visible product
5.1.2. Defined by XX exec role in case of ambiguity
5.2. Freeze
5.2.1. Cause
5.2.1.1. Measurement
5.2.1.1.1. Services are performing below their availability SLO over the last 28 days
5.2.1.2. Event
5.2.1.2.1. A post mortem reveals an opportunity to soften a hard dependency
5.2.2. Consequence
5.2.2.1. Service must halt all roll-outs except cherry-picks for priority zero defects and security issues
5.2.2.2. A decision about what parts of the service need to freeze should be considered
5.2.2.3. This freeze period lasts until unavailability over the last 28 days is within the error budget
5.3. Unfreeze
5.3.1. Cause
5.3.1.1. Measurement
5.3.1.1.1. Services are performing at or above their availability SLO over the last 28 days
5.3.2. Consequence
5.3.2.1. Service may proceed with feature releases per the agreed roll-out policy
5.3.3. Exception to discuss
5.3.3.1. A hard dependency outage results in the service freezing its releases
6. Example 2
6.1. Threshold 1
6.1.1. Cause
6.1.1.1. Automated alerts
6.1.2. Consequence
6.1.2.1. Notify SRE of an at-risk SLO
6.2. Threshold 2
6.2.1. Cause
6.2.1.1. SREs conclude they need help to defend SLO
6.2.2. Consequence
6.2.2.1. Escalate to Dev team
6.3. Threshold 3
6.3.1. Cause
6.3.1.1. The 30-days error budget is exhausted and the root cause has not been found
6.3.2. Consequence
6.3.2.1. SRE blocks releases and asks for more support from the dev team
6.4. Threshold 4
6.4.1. Cause
6.4.1.1. The 90-day error budget is exhausted and the root cause has not been found
6.4.2. Consequence
6.4.2.1. SRE escalates to executive leadership to obtain more engineering time for reliability work
6.5. Threshold 5
6.5.1. Cause
6.5.1.1. One semester of error budget exhausted, and the situation do no improve
6.5.2. Consequence
6.5.2.1. SRE give the pager back to Dev team
7. Example 3
7.1. Orange
7.1.1. Cause
7.1.1.1. Burning error budget during 1 hour at 9x the target rate
7.1.1.2. Burning error budget during 12 hours at 3x the target rate
7.1.2. Consequence
7.1.2.1. Page SRE
7.2. Red
7.2.1. Cause
7.2.1.1. One outage burns One Week (25%) of Error Budget
7.2.2. Consequence
7.2.2.1. Dev team dedicates two Engineers to the action items of the post-mortem
7.3. Black
7.3.1. Cause
7.3.1.1. One outage burns One Week (100%) of Error Budget
7.3.2. Consequence
7.3.2.1. Stop release unless related to reliability or security
8. Best practices
8.1. Check policy content
8.1.1. Reference to SLO and Error budget dashboards
8.1.2. List of
8.1.2.1. Cause
8.1.2.1.1. when this is observed
8.1.2.2. Consequence
8.1.2.2.1. then that happens ...
8.1.3. Check the list describe both
8.1.3.1. when entering a freeze
8.1.3.2. when exiting a freeze
8.1.4. How is re prioritise the back log ?
8.1.5. Document whom to escalade disagreements to
8.1.5.1. There will be
8.1.6. Agreed upon and signed off by all parties
8.2. The service error budget is gone when at least on SLO is gone
8.3. Reuse the same policy for multiple services
8.4. Publish in a place widely visible, management included and before a main outage
8.5. Result in engineering to improve reliability
8.6. Consequences provide benefits to Dev / Biz and Ops
8.7. Is consistently applied
8.7.1. only a few silver bullet (one or two per year, to by pass)