GCP DevOps and SRE

Get Started. It's Free
or sign up with your email address
GCP DevOps and SRE by Mind Map: GCP DevOps and SRE

1. GCP

1.1. Instance types

1.1.1. Managed instance group (MIG)

1.1.1.1. Same config

1.1.1.2. Autoscaling, updating, load balancing

1.1.2. Unmanaged instance group

1.1.2.1. individual config

1.1.2.2. No-Auto

1.2. Charging

1.2.1. Preemptible VM

1.2.1.1. short-life

1.2.1.2. apps need to be fault-tolerant like batch

1.2.2. Committed Use discounts

1.2.2.1. at least 1 year

2. Design

2.1. Reliability

2.1.1. Availability

2.1.1.1. Single point of failure

2.1.1.2. Correlated failure

2.1.1.3. Cascading failure

2.1.2. Durability

2.1.3. Scalability

2.1.3.1. Metric based auto scaling

2.1.3.1.1. metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 30

2.2. Security

2.2.1. People

2.2.2. Machine

2.2.2.1. Service account: Identity and access

2.2.3. Network

2.2.4. Encryption

2.2.4.1. Cloud KMS

2.2.4.2. DLP API

3. App Maint and Monitor

3.1. Versioning

3.1.1. Rolling release

3.1.2. Canary release

3.2. Cost Planning

3.2.1. Data Studio

3.3. Monitoring Dashboard

3.3.1. Latency

3.3.2. Alert for SLO

3.4. White / Black

3.4.1. Internal / External

4. Build Team

4.1. Unify Vision

4.1.1. Value

4.2. Strategy

4.2.1. Identify threats and opportunities

4.2.2. Understand resources, capabilities and practice

4.2.3. Consider strategies for addressing threats and opportunities

4.2.4. Create alignment on communicating and coordinating work process

4.2.5. Value stream

4.3. Forster Collaboration

4.3.1. Service oriented meeting

4.3.2. Team

4.3.2.1. Tech Lead

4.3.2.2. Manager

4.3.2.3. Project Manager

4.4. Knowledge Sharing

4.4.1. Cross training

4.4.2. Employee to employee network

4.4.3. Job shadowing

5. Culture

5.1. Build Psychological Safety

5.1.1. Focus on System and Process not people

5.2. Reduce organizational silos

5.3. Design Thinking

5.3.1. Creativity & Structure

5.3.2. Steps

5.3.2.1. Empathize

5.3.2.2. Define

5.3.2.3. Ideate

5.3.2.4. Prototype

5.3.2.4.1. Prototyping

5.3.2.5. Test

5.4. Trust

6. Practice

6.1. Error Budget

6.1.1. Balance innovation and reliability

6.1.2. = 100 - SLO

6.1.3. Slow burn = 12 hrs, Fast burn = 5mins

6.2. SLI

6.2.1. quantifiable measure of a single aspect of service reliability that ideally has a close linear relationship with your users’ experience with your service.

6.3. SLO

6.3.1. combines an SLI with a target reliability—that is, it’s the threshold that, if crossed, turns happy customers into unhappy customers

6.3.1.1. as low as you can get away with while making user happy

6.3.1.2. the higher the SLO, the higher the cost in resource and people 1 more 9 = x10 cost

6.3.1.3. SLO is stricter than SLA

6.4. CICD

6.4.1. Canarying

6.5. Toil Automation

6.5.1. Toil

6.5.1.1. Manual

6.5.1.2. Repetitive

6.5.1.3. Automatable

6.5.1.4. Tactical

6.5.1.5. Without enduring value

6.5.1.6. Scales linearly as service grows

6.5.2. Value of automation

6.5.2.1. Consistency

6.5.2.2. A platform

6.5.2.3. Quicker resolution

6.5.2.4. Faster action

6.5.2.5. Save time

6.6. Psychology of change

6.6.1. Types of individual

6.6.1.1. Navigator

6.6.1.2. Critics

6.6.1.3. Victims

6.6.1.4. Bystander

6.6.2. Help ppl to change

6.6.2.1. Head

6.6.2.2. Heart

6.6.2.3. Feet

6.6.3. Handle resistance to change

6.6.4. a

6.7. Measure Everything

6.7.1. Why?

6.7.1.1. IT and biz can understand the current status of service

6.7.1.2. IT can analyze data and identify necessary actions to improve

6.7.1.3. Request base: 99% req have latency < 100ms

6.7.2. Reliability

6.7.3. Toil

6.7.4. Monitoring

6.7.4.1. Symptoms rather than causes

6.7.4.2. Error budget burn

6.7.4.3. What?

6.7.4.3.1. Latency

6.7.4.3.2. Traffic

6.7.4.3.3. Errors

6.7.4.3.4. Saturation

6.8. Microservices

6.8.1. Best Practices Twelve-Factor App methodology

6.8.1.1. 1. Codebase

6.8.1.2. 2. Dependencies

6.8.1.3. 3. Config

6.8.1.4. 4. Backing services

6.8.1.5. 5. Build, release, run

6.8.1.6. 6. Processes

6.8.1.7. 7. Port binding

6.8.1.8. 8. Concurrency

6.8.1.9. 9. Disposability

6.8.1.10. 10. Dev/PROD parity

6.8.1.11. 11. Logs

6.8.1.12. 12. Admin process

6.9. Infra as Code

6.9.1. Tools

6.9.1.1. Terraform

6.9.1.2. CHEF

6.9.1.3. Ansible

6.9.1.4. Puppet

6.9.2. Disposable

6.9.2.1. Destroy and create new

6.10. Incident Management

6.10.1. Roles

6.10.1.1. Incident Commander

6.10.1.2. Ops Lead

6.10.1.3. Comm Lead

6.10.2. Strengthen trust

6.10.2.1. respond and learn quickly

6.10.2.2. transparency

6.10.2.2.1. After action review

6.10.2.3. minimize outage

7. CI/CD

7.1. CI

7.1.1. Cloud Source Repo

7.1.1.1. Private Git

7.1.1.2. One way Sync from Github

7.1.1.3. Code search

7.1.1.4. IAM: admin/writer/reader

7.1.2. Source - Build - Test - Report - Release

7.1.3. Artifact Repo

7.2. CD

7.2.1. Cloud Build

7.2.1.1. Access

7.2.1.1.1. User IAM

7.2.1.1.2. Service Account

7.2.1.2. Container Registry (Artifact Registry)

7.2.1.2.1. Cloud storage

7.2.1.2.2. gcr.io - default US

7.2.1.2.3. IAM: push image - Storage Admin

7.2.1.3. Best Practices

7.2.1.3.1. Leaner container

7.2.1.3.2. Use Cache

7.2.1.3.3. Customize VM

7.2.2. Mode

7.2.2.1. Blue/Green

7.2.2.1.1. Run 2 version app

7.2.2.2. Canary

7.2.2.2.1. Small subset users

7.2.2.3. YAML

7.2.3. Tools

7.2.3.1. Jenkins

7.2.3.1.1. on GKE - less overhead

7.2.3.1.2. on compute engine: master - agent

7.2.3.2. Travis CI

7.2.3.3. Cloud build

7.2.3.4. Spinnaker

7.2.3.4.1. auto create staging -> test staging -> approve -> prod

7.2.3.4.2. Fallback - Rerun prev deployment

7.3. Security

7.3.1. DO NOT store secret in image

7.3.1.1. App layer encryption by Cloud KMS

7.3.2. Vulnerability - Container Registry scan

7.3.3. Binary Authorization

7.3.3.1. Encrypted signed attestation

8. Operation Monitoring

8.1. Workspace

8.1.1. Project by app

8.1.1.1. By Dev / PROD

8.1.2. IAM by workspace: view/edit/admin

8.2. Metric

8.2.1. GCP 1000 metrics

8.2.2. Data to create chart, group in dashboard

8.2.3. Log based metric

8.2.3.1. Event - logging - filter - metric

8.2.4. Gauge, Delta

8.2.4.1. Boolean, Int, Double, String

8.3. GCP Monitoring API

8.3.1. manipulate metric

8.3.2. expose to Grafana (service account)

8.3.3. Export data to BigQuery (after 6 weeks)

8.4. GKE Cloud Operation

8.4.1. Integrate with Prometheus

8.4.2. Cluster-Node-Pod-Container

8.5. GCP Uptime check

8.5.1. VM, App Engine, URL, AWS LB

8.6. Logging

8.6.1. IAM

8.6.1.1. Viewer: except Data access, same as project viewer

8.6.1.2. Private: all

8.6.1.3. Log config writer

8.6.1.4. Log admin = project owner

8.6.1.5. Service Account

8.6.1.5.1. writer only, no view

8.6.1.5.2. bucket writer

8.6.2. Action

8.6.2.1. Collect

8.6.2.1.1. Agent

8.6.2.2. Analyze

8.6.2.3. Export

8.6.2.3.1. BigQuery, Cloud Storage, Pub/Sub

8.6.2.3.2. Sink = Filter + Dest

8.6.2.4. Retain

8.6.2.4.1. 400 days

8.6.3. Audit logs

8.6.3.1. Admin activity

8.6.3.2. Data access - default disabled

8.6.3.3. System event

8.6.3.4. Access Transparency - logs record the actions taken by Google personnel.

8.6.4. Option logs

8.6.4.1. VPC flow logs - Network logs by subnet

8.6.4.2. Firewall logs by rule

8.6.5. Log-based metrics

8.7. Error Reporting

8.7.1. Aggregate

8.7.2. Alerting Policy

8.7.2.1. Condition

8.7.2.1.1. threshold

8.7.2.1.2. duration

8.7.2.2. Notificatoin

8.7.2.2.1. webhook: HTTP POST in JSON

8.7.2.3. evaluating / alert parms

8.8. Service Monitoring

8.8.1. Error budget burn

8.8.1.1. Performance metric and goal

8.8.2. SLO

8.8.2.1. Request base: 99% req have latency < 100ms

8.8.2.2. Window base: 99% of 15mins window meet 95th percentile latency < 100ms

8.9. APM

8.9.1. Debugger

8.9.1.1. Languages

8.9.1.2. Env

8.9.1.3. Source code search

8.9.1.4. Breakpoint - Snapshot

8.9.1.5. Logpoint - insert logging to running code

8.9.2. Trace (latency)

8.9.2.1. end to end aggregated latency data

8.9.2.2. bottleneck

8.9.3. Profiler (CPU memory)

8.9.4. Error reporting API

8.9.4.1. Import library

8.9.4.2. Init Client

8.9.4.3. Call report()

8.10. Network

8.10.1. Intelligence center

8.10.1.1. Topology

8.10.1.2. Conn test

8.10.1.3. Performance

8.10.1.4. Firewall insights

8.11. Why?

8.11.1. KISS

8.12. What

8.12.1. Specific

8.12.2. Measurable

8.12.3. Achievable

8.12.4. Relevant

8.12.5. Time bound