Measure Efficiency, Effectiveness, and Culture to Optimize DevOps Transformations ebook

Get Started. It's Free
or sign up with your email address
Rocket clouds
Measure Efficiency, Effectiveness, and Culture to Optimize DevOps Transformations ebook by Mind Map: Measure Efficiency,  Effectiveness, and Culture to  Optimize DevOps  Transformations ebook

1. What is this ebook about

1.1. "This paper summarizes a position on DevOps measurement that we intend to be a catalyst for accelerating more mature guidance on measuring and steering software delivery"

1.2. Focus on internal and cultural measures

2. Principles

2.1. Why measures?

2.1.1. Remove Subjectivity

2.1.2. Improve Excellence

2.1.3. Focus on Strategy

2.1.4. Create Predictability

2.2. How to make them work

2.2.1. Use to drive crucial conversations

2.2.2. Do not use to evaluate performance and impact

2.2.2.1. Will be gamed

2.2.3. Team performance, not individual performance

2.3. What powers the transformation to metrics-driven organization

2.3.1. A better work environment more accepting of new ideas

2.3.2. Desire for accountable teams

2.3.3. Continuous improvement via the scientific method

2.4. Some key principles

2.4.1. Software's most important quality is adaptiveness (easy of change)

2.4.1.1. Rate of change

2.4.1.2. Cost of change

2.4.2. Primary source of measurement is the product pipeline

2.4.2.1. Process pipelines are secondary sources

2.4.3. Track at least one measure in each of three dimensions

2.4.3.1. Internal

2.4.3.1.1. Efficiency - cost related

2.4.3.2. External

2.4.3.2.1. Effectiveness - value related

2.4.3.3. Cultural

2.4.3.3.1. Team dynamics - people related

2.4.4. Automate metric collection

2.4.4.1. Eliminates manual overhead

2.4.4.2. Eliminates subjectivity

2.4.4.3. Improves consistency

2.4.5. Metrics should be useful and transparent

2.4.5.1. To leadership

2.4.5.2. To engineering

2.4.5.3. Consistent format

2.4.6. Targets should be a disribution of probable outcomes

2.4.6.1. Variance

2.4.6.2. Uncertainty

2.4.7. Improve precision and fidelity over time

2.4.7.1. Not static - imrpove along with development processes

2.4.7.2. Best teams evolve their measures

2.5. Classes of measurements that matter most

2.5.1. Internal

2.5.1.1. Inside-out - Efficiency and resource consumption (cost/time/materials)

2.5.1.1.1. Technical progress

2.5.1.1.2. Product pipeline trends

2.5.1.1.3. Resource consumption

2.5.2. External

2.5.2.1. Outside-in - effectiveness and value delivered

2.5.2.1.1. Quality

2.5.2.1.2. Usefulness

2.5.2.1.3. Performance

2.5.2.1.4. Business outcomes

2.5.2.2. Not covered in this book - well covered by accounting rules and industry experts

2.5.3. Cultural

2.5.3.1. Team dynamics

2.5.3.2. Process overhead

2.5.3.3. Trustworthiness

2.5.3.4. Shared objectives

2.5.3.5. Morale

2.5.3.6. Motivation

2.5.3.7. Team/product/enterprise identity

2.6. Lead time vs. process time

2.6.1. Some of the most powerful internal measures

2.6.1.1. Reveal how long it takes to deliver value

2.6.2. Help identify

2.6.2.1. Odds of delivering items within expected timeframes

2.6.2.2. Bottleneck in the workflow

2.6.2.3. Inefficiencies in the workflow

2.6.2.3.1. Idle time

2.6.2.3.2. Non-value adding time

2.6.3. Enable continued improvement in feedback cycles

2.6.4. Definitions

2.6.4.1. Lead Time (LT)

2.6.4.1.1. The elapsed time from receiving a customer request to delivering on that request.

2.6.4.1.2. Process Time plus Wait Time.

2.6.4.2. Process Time (PT)

2.6.4.2.1. Process time begins when the work has been pulled into a doing state and ends when the work is delivered to the next downstream customer.

2.6.4.3. Wait Time (WT)

2.6.4.3.1. The time that work sits idle not being worked.

2.7. Code deployment lead time

2.7.1. Significant predictor of IT and organization performance

2.7.1.1. Throughput and agility metrics

2.7.1.1.1. 200x faster deployment lead times

2.7.1.1.2. 30x more frequent deployments

2.7.1.2. Reliability and stability metrics

2.7.1.2.1. 60x higher change success rates

2.7.1.2.2. 168x faster mean time to restore service (MTTR)

2.7.1.3. Organizational performance metrics

2.7.1.3.1. 2x more likely to exceed productivity, market share, and profitability goals

2.7.1.3.2. 50% higher market capitalization growth over three years.

2.7.2. Two value streams

2.7.2.1. Design and Development

2.7.2.1.1. From product owners coming up with hypothesis to code committed into source control

2.7.2.1.2. Resembles lean product development

2.7.2.1.3. Long process times (days of weeks)

2.7.2.2. Testing and Operations

2.7.2.2.1. From code committed to version control to deployed into production

2.7.2.2.2. Resembles lean manufacturing

2.7.2.2.3. Short process times (minutes or hours)

2.7.2.3. Must happen simultaneously to achieve fast flow and high reliability - DevOps outcomes

2.7.2.3.1. Frequent commit of small pieces of code

2.7.2.3.2. Fast test feedback

2.7.3. Drives feedback cycle

2.7.3.1. How fast experiments can be performed

2.7.3.2. How fast customer feedback can be received

2.8. Work in progress (WIP)

2.8.1. Definition

2.8.1.1. The amount of work in a system that has been started, but not finished.

2.8.2. Leading indicator

2.8.2.1. The more items in flight, the longer it will take to complete those items

2.8.3. Causes of delay

2.8.3.1. Context switching

2.8.3.2. Dependencies

2.9. Measuring culture

2.9.1. Types of culture

2.9.1.1. Formal

2.9.1.1.1. Mission statement

2.9.1.1.2. Goals

2.9.1.2. Informal

2.9.1.2.1. Shared values

2.9.1.2.2. Social norms

2.9.2. How to measure culture

2.9.2.1. Perceptual measure - captured with surveys

2.9.2.1.1. Base on well-tested theories and expert input

2.9.2.1.2. With psychometric scale, e.g. Likert

2.9.2.1.3. Westrum

2.9.2.2. Most important measure

2.9.2.2.1. Information flow across previously separate silos

2.9.3. When to measure culture

2.9.3.1. Regular intervals

2.9.3.1.1. Quarterly or twice-yearly

2.9.4. Why measure culture

2.9.4.1. Leading indicator of challenges

3. Scenarios

3.1. Scenario 1 - Moving to a SaaS culture

3.1.1. Areas for improvement

3.1.1.1. Improving live site health

3.1.1.2. Understanding usage

3.1.1.3. Improving engineering flow and reducing engineering debt

3.1.2. Metrics

3.1.2.1. Live site health

3.1.2.1.1. Time to detect

3.1.2.1.2. Time to communicate

3.1.2.1.3. Time to escalate

3.1.2.1.4. Time to mitigate

3.1.2.1.5. Time to implement learning, repair, and prevention (at root cause)

3.1.2.1.6. Total customer minutes outside of SLA

3.1.2.1.7. Outlier performance

3.1.2.1.8. Alerting efficiency

3.1.2.2. Usage

3.1.2.2.1. Model: Funnel from customer acquisition to first time use, then engagement, retention and conversion

3.1.2.2.2. Goal: Run the maximum number of meaningful experiments in the least time at the least cost

3.1.2.2.3. How: work on hypotheses, experiment with a new version exposed to a subset, analyze results, decide to proceed or discard.

3.1.2.2.4. Add custom instrumentation around each business scenario to measure transitions in the funnel

3.1.2.3. Engineering velocity and debt indicators

3.1.2.3.1. Velocity

3.1.2.3.2. Debt

3.1.2.3.3. Tools and techniques

3.2. Scenario 2 - High Incident Volume

3.2.1. Problem

3.2.1.1. Team is overwhelmed with alerts

3.2.2. Metrics

3.2.2.1. Signal/noise ratio

3.2.2.1.1. total alerts - false alarms)/total alerts

3.2.2.1.2. Eliminate false alerts

3.2.2.2. Number of alerting and incident management systems

3.2.2.2.1. See the whole playing field

3.2.2.3. Number of alerts by origin

3.2.2.3.1. ??

3.2.2.4. Signal/noise ratio scoped to High Severity Incidents

3.2.2.4.1. Threshold crossing alarm (e.g. X times in Y minutes)

3.2.2.4.2. Reduce noise in transient errors

3.2.2.5. Top Correlation by Incident Volume

3.2.2.5.1. Identify and fix top root causes for alerts

3.3. Scenario 3 -Complex Integration Test Problems

3.3.1. Problem

3.3.1.1. Long code integration and test times

3.3.1.2. Low deployment and release frequencies

3.3.1.3. Poor release outcomes

3.3.1.3.1. High number of post-change incidents

3.3.1.3.2. High number of major customer-impacting incidents

3.3.1.3.3. Low release success rates

3.3.2. Solutions

3.3.2.1. Re-architect into smaller components

3.3.2.2. Build and test automation

3.3.2.3. Feature flags