AWS Well-Architected Framework

Get Started. It's Free
or sign up with your email address
Rocket clouds
AWS Well-Architected Framework by Mind Map: AWS Well-Architected Framework

1. Compilated by Marat Levykin. July 2018. For comments and suggestions please contact me [email protected]

2. Intro

2.1. The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building systems on AWS. By using the Framework you will learn architectural best practices for designing and operating reliable, secure, efficient, and cost-effective systems in the cloud. It provides a way for you to consistently measure your architectures against best practices and identify areas for improvement.

3. 5 Pillars

3.1. Operational Excellence

3.1.1. The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.

3.2. Security

3.2.1. The ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies.

3.3. Reliability

3.3.1. The ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.

3.4. Performance Efficiency

3.4.1. The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.

3.5. Cost Optimization

3.5.1. The ability to run systems to deliver business value at the lowest price point.

4. Mindset

4.1. “Good intentions never work, you need good mechanisms to make anything happen” Jeff Bezos. This means replacing humans best efforts with mechanisms (often automated) that check for compliance with rules or process.

5. TOGAF The TOGAF® Standard, Version 9.2

6. General Design Principles

6.1. Stop guessing your capacity needs

6.1.1. With cloud computing, these problems can go away. You can use as much or as little capacity as you need, and scale up and down automatically.

6.2. Test systems at production scale

6.2.1. In the cloud, you can create a production-scale test environment on demand, complete your testing, and then decommission the resources

6.3. Automate to make architectural experimentation easier

6.3.1. Automation allows you to create and replicate your systems at low cost and avoid the expense of manual effort.

6.4. Allow for evolutionary architectures

6.4.1. As a business and its context continue to change, initial decisions might hinder the system’s ability to deliver changing business requirements. In the cloud, the capability to automate and test on demand lowers the risk of impact from design changes.

6.5. Drive architectures using data

6.5.1. In the cloud you can collect data on how your architectural choices affect the behavior of your workload. This lets you make fact-based decisions on how to improve your workload. Your cloud infrastructure is code, so you can use that data to inform your architecture choices and improvements over time.

6.6. Improve through game days

6.6.1. Test how your architecture and processes perform by regularly scheduling game days to simulate events in production.

7. Operational Excellence

7.1. Design Principles

7.1.1. Perform operations as code

7.1.1.1. You can define your entire workload (applications, infrastructure) as code and update it with code.

7.1.1.2. You can script your operations procedures and automate their execution by triggering them in response to events.

7.1.2. Annotate documentation

7.1.2.1. In the cloud, you can automate the creation of annotated documentation after every build (or automatically annotate hand-crafted documentation). Annotated documentation can be used by people and systems. Use annotations as an input to your operations code.

7.1.3. Make frequent, small, reversible changes

7.1.3.1. Design workloads to allow components to be updated regularly. Make changes in small increments that can be reversed if they fail (without affecting customers when possible).

7.1.4. Refine operations procedures frequently

7.1.4.1. As you use operations procedures, look for opportunities to improve them. As you evolve your workload, evolve your procedures appropriately. Set up regular game days to review and validate that all procedures are effective and that teams are familiar with them.

7.1.5. Anticipate failure (прогнозируйте неудачу)

7.1.5.1. Test your failure scenarios and validate your understanding of their impact. Test your response procedures to ensure that they are effective, and that teams are familiar with their execution

7.1.6. Learn from all operational failures

7.1.6.1. Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.

7.2. Best practice areas

7.2.1. The AWS service that is essential to Operational Excellence is AWS CloudFormation

7.2.2. 1. Prepare

7.2.2.1. Mechanisms to monitor and gain insight ( application, platform, infrastructure components, customer experience andbehavior)

7.2.2.2. Validate through checklists to ensure a workload meets defined standards and capture it in runbooks

7.2.2.3. Using AWS CloudFormation enables you to have consistent, templated, sandbox development, test, and production environments

7.2.2.4. Implement the minimum number of architecture standards for your workloads

7.2.2.5. Balance the cost to implement a standard against the benefit to the workload and operations

7.2.2.6. Invest in scripting operations activities to maximize the productivity of operations personnel, minimize error rates, and enable automated responses

7.2.3. 2. Operate

7.2.3.1. Оperational health includes both the health of the workload and operations (for example, deployment and incident response).

7.2.3.2. Communicate the operational status of workloads through dashboards and notifications that are tailored to the target audience (for example, customer, business, developers, operations)

7.2.3.3. AWS provides workload insights through logging capabilities including AWS X-Ray, CloudWatch, CloudTrail, and VPC Flow Logs

7.2.3.4. Routine operations, responses to unplanned events, should be automated. NO manual processes for deployments, release management, changes, and rollbacks

7.2.3.5. Releases should NOT be large batches that are done infrequently.

7.2.3.6. Have a rollback plan, and the ability to mitigate failure impacts for continuity of operations

7.2.4. 3. Evolve (развивай)

7.2.4.1. Dedicate work cycles to making continuous incremental improvements.

7.2.4.2. Regularly evaluate and prioritize opportunities for improvement, including both the workload and operations procedures.

7.2.4.3. Share lessons learned across teams to share the benefits of those lessons.

7.2.4.4. Analyze metrics and trends within lessons learned and perform cross-team retrospective analysis

8. Security

8.1. Design Principles

8.1.1. Implement a strong identity foundation

8.1.1.1. Appropriate authorization for each interaction with your AWS resources. Centralize privilege management and reduce or even eliminate reliance on long-term credentials.

8.1.2. Enable traceability

8.1.2.1. Monitor, alert, and audit actions and changes to your environment in real time. Integrate logs and metrics with systems to automatically respond and take action.

8.1.3. Apply security at all layers

8.1.3.1. E.g., edge network, VPC, subnet, load balancer, every instance, operating system, and application)

8.1.4. Automate security best practices

8.1.4.1. Create secure architectures, including the implementation of controls that are defined and managed as code in version-controlled templates.

8.1.5. Protect data in transit and at rest (защ. динамические и статические данные)

8.1.5.1. Classify your data into sensitivity levels and use mechanisms, such as encryption, tokenization, and access control where appropriate.

8.1.6. Keep people away from data

8.1.6.1. Create mechanisms and tools to reduce or eliminate the need for direct access or manual processing of data

8.1.7. Prepare for security events

8.1.7.1. Have an incident management process. Run incident response simulations and use tools with automation to increase your speed for detection, investigation, and recovery.

8.2. Best practice areas (Key AWS Services)

8.2.1. 1. Identity and Access Management

8.2.1.1. IAM enables you to securely control access to AWS services and resources.

8.2.1.2. MFA (multi-factor authentication) adds an additional layer of protection on user access.

8.2.1.3. AWS Organizations lets you centrally manage and enforce policies for multiple AWS accounts.

8.2.2. 2. Detective Controls

8.2.2.1. AWS CloudTrail records AWS API calls

8.2.2.2. AWS Config provides a detailed inventory of your AWS resources and configuration.

8.2.2.3. Amazon GuardDuty is a managed threat detection service that continuously monitors for malicious or unauthorized behavior.

8.2.2.4. Amazon CloudWatch is a monitoring service for AWS resources which can trigger CloudWatch Events to automate security responses.

8.2.3. 3. Infrastructure Protection

8.2.3.1. Amazon Virtual Private Cloud (Amazon VPC) enables you to launch AWS resources into a virtual network

8.2.3.2. Amazon CloudFront is a global CDN that securely delivers data, videos, applications, and APIs to your viewers which integrates with AWS Shield for DDoS mitigation

8.2.3.3. AWS WAF is a web application firewall that is deployed on either Amazon CloudFront or Application Load Balancer to help protect your web applications from common web exploits.

8.2.4. 4. Data Protection

8.2.4.1. Services such as ELB, Amazon Elastic Block Store (Amazon EBS), Amazon S3, and Amazon Relational Database Service (Amazon RDS) include encryption capabilities to protect your data in transit and at rest.

8.2.4.2. Amazon Macie automatically discovers, classifies and protects sensitive data

8.2.4.3. AWS Key Management Service (AWS KMS) makes it easy for you to create and control keys used for encryption

8.2.5. 5. Incident Response

8.2.5.1. IAM should be used to grant appropriate authorization to incident response teams and response tools

8.2.5.2. AWS CloudFormation can be used to create a trusted environment or clean room for conducting investigations.

8.2.5.3. Amazon CloudWatch Events allows you to create rules that trigger automated responses including AWS Lambda

9. Reliability

9.1. Design principles

9.1.1. Test recovery procedures

9.1.1.1. In the cloud, you can test how your system fails, and you can validate your recovery procedures. You can use automation to simulate different failures or to recreate scenarios that led to failures before.

9.1.2. Automatically recover from failure

9.1.2.1. Trigger automation when a threshold is breached. This allows for automatic notification and tracking of failures, and for automated recovery processes that work around or repair the failure. With more sophisticated automation, it's possible to anticipate and remediate failures before they occur.

9.1.3. Scale horizontally to increase aggregate system availability

9.1.3.1. Replace one large resource with multiple small resources to reduce the impact of a single failure on the overall system

9.1.4. Stop guessing capacity

9.1.4.1. In the cloud, you can monitor demand and system utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand without over or under- provisioning.

9.1.5. Manage change in automation

9.1.5.1. Changes to your infrastructure should be done using automation. The changes that need to be managed are changes to the automation.

9.2. Best practice areas (Key AWS Services)

9.2.1. 1. Foundations

9.2.1.1. IAM enables you to securely control access to AWS services and resources.

9.2.1.2. Amazon VPC lets you provision a private, isolated section of the AWS Cloud where you can launch AWS resources in a virtual network

9.2.1.3. AWS Trusted Advisor provides visibility into service limits

9.2.1.4. AWS Shield is a managed Distributed Denial of Service (DDoS) protection service that safeguards web applications running on AWS.

9.2.2. 2. Change Management

9.2.2.1. AWS CloudTrail records AWS API calls for your account and delivers log files to you for auditing

9.2.2.2. AWS Config provides a detailed inventory of your AWS resources and configuration, and continuously records configuration changes

9.2.2.3. Auto Scaling is a service that will provide an automated demand management for a deployed workload

9.2.2.4. CloudWatch provides the ability to alert on metrics, including custom metrics

9.2.2.5. CloudWatch also has a logging feature that can be used to aggregate log files from your resources

9.2.3. 3. Failure Management

9.2.3.1. AWS CloudFormation provides templates for the creation of AWS resources and provisions them in an orderly and predictable fashion

9.2.3.2. Amazon S3 provides a highly durable service to keep backups

9.2.3.3. Amazon Glacier provides highly durable archives

9.2.3.4. AWS KMS provides a reliable key management system that integrates with many AWS services

10. Performance Efficiency

10.1. Design principles

10.1.1. Democratize advanced technologies

10.1.1.1. Rather than having your IT team learn how to host and run a new technology (NoSQL databases, media transcoding, machine learning etc), they can simply consume it as a service focusing on product development rather than resource provisioning and management.

10.1.2. Go global in minutes

10.1.2.1. Easily deploy your system in multiple Regions around the world with just a few clicks.

10.1.3. Use serverless architectures

10.1.3.1. This not only removes the operational burden of managing servers, but also can lower transactional costs because these managed services operate at cloud scale.

10.1.4. Experiment more often

10.1.4.1. With virtual and automatable resources, you can quickly carry out comparative testing using different types of instances, storage, or configurations.

10.1.5. Mechanical sympathy

10.1.5.1. Use the technology approach that aligns best to what you are trying to achieve. For example, consider data access patterns when selecting database or storage approaches.

10.2. Best practice areas (Key AWS Services)

10.2.1. 0

10.2.1.1. Take a data-driven approach to selecting a high-performance architecture

10.2.1.2. Your architecture will likely combine a number of different architectural approaches (for example, event-driven, ETL, or pipeline).

10.2.2. 1. Selection

10.2.2.1. Four main resource types that you should consider (compute, storage, database, and network)

10.2.2.2. Compute

10.2.2.2.1. In AWS, compute is available in three forms: instances, containers, and functions

10.2.2.2.2. Instances

10.2.2.2.3. Containers

10.2.2.2.4. Functions

10.2.2.3. Storage

10.2.2.3.1. The optimal storage solution for a particular system will vary based on the kind of

10.2.2.3.2. Access method (block, file, or object)

10.2.2.3.3. Patterns of access (random or sequential)

10.2.2.3.4. Throughput required (требуемая пропускная способность)

10.2.2.3.5. Frequency of access (online, offline, archival)

10.2.2.3.6. Frequency of update (WORM, dynamic)

10.2.2.3.7. Availability and durability constraints

10.2.2.3.8. Well-architected systems use multiple storage solutions and enable different features to improve performance.

10.2.2.4. Database

10.2.2.4.1. The optimal database solution for a particular system can vary based on requirements for availability, consistency, partition tolerance, latency, durability, scalability, and query capability

10.2.2.5. Network

10.2.2.5.1. In AWS, networking is virtualized and is available in a number of different types and configurations.

10.2.2.5.2. AWS offers product features (for example, Enhanced Networking, Amazon EBS-optimized instances, Amazon S3 transfer acceleration, dynamic Amazon CloudFront) to optimize network traffic

10.2.2.5.3. AWS also offers networking features (for example, Amazon Route 53 latency routing, Amazon VPC endpoints, and AWS Direct Connect) to reduce network distance or jitter.

10.2.3. 2. Review

10.2.3.1. Over time new technologies and approaches become available that could improve the performance of your architecture.

10.2.4. 3. Monitoring

10.2.4.1. After you have implemented your architecture you will need to monitor its performance so that you can remediate any issues before your customers are aware.

10.2.4.2. Amazon CloudWatch provides the ability to monitor and send notification alarms. You can use automation to work around performance issues by triggering actions through Amazon Kinesis, Amazon Simple Queue Service (Amazon SQS), and AWS Lambda.

10.2.5. 4. Tradeoffs (компромиссы)

10.2.5.1. Depending on your situation you could trade consistency, durability, and space versus time or latency to deliver higher performance.

10.2.5.1.1. Using AWS, you can go global in minutes and deploy resources in multiple locations across the globe to be closer to your end users

10.2.5.1.2. You can also dynamically add readonly replicas to information stores such as database systems to reduce the load on the primary database

10.2.5.1.3. AWS also offers caching solutions such as Amazon ElastiCache, which provides an in-memory data store or cache

10.2.5.1.4. Amazon CloudFront, which caches copies of your static content closer to end users

10.2.5.1.5. Amazon DynamoDB Accelerator (DAX) provides a read-through/write-through distributed caching tier in front of Amazon DynamoDB, supporting the same API, but providing sub-millisecond latency for entities that are in the cache

10.2.5.2. Amazon ElastiCache, Amazon CloudFront, and AWS Snowball are services that allow you to improve performance. Read replicas in Amazon RDS can allow you to scale read-heavy workloads.

11. Cost Optimization

11.1. Design principles

11.1.1. Adopt a consumption model

11.1.1.1. Pay only for the computing resources that you require and increase or decrease usage depending on business requirements, not by using elaborate forecasting.

11.1.1.1.1. For example, development and test environments are typically only used for eight hours a day during the work week. You can stop these resources when they are not in use for a potential cost savings of 75% (40 hours versus 168 hours).

11.1.2. Measure overall efficiency

11.1.2.1. Measure the business output of the system and the costs associated with delivering it.

11.1.3. Stop spending money on data center operations

11.1.3.1. AWS does the heavy lifting of racking, stacking, and powering servers, so you can focus on your customers and business projects rather than on IT infrastructure.

11.1.4. Analyze and attribute expenditure

11.1.4.1. The cloud makes it easier to accurately identify the usage and cost of systems, which then allows transparent attribution of IT costs to individual business owners. This helps measure return on investment (ROI).

11.1.5. Use managed and application level services to reduce cost of ownership

11.1.5.1. In the cloud, managed and application level services remove the operational burden of maintaining servers for tasks such as sending email or managing databases.

11.2. Best practice areas (Key AWS Services)

11.2.1. 0

11.2.1.1. There are tradeoffs to consider. For example, do you want to optimize for speed to market or for cost?

11.2.1.2. Design decisions are sometimes guided by haste as opposed to empirical data, as the temptation always exists to overcompensate “just in case” rather than spend time benchmarking for the most cost-optimal system over time.

11.2.1.3. AWS Cost Explorer allows you to view and track your usage in detail. AWS Budgets will notify you if your usage or spend exceeds actual or forecast budgeted amounts.

11.2.2. 1. Cost-Effective Resources

11.2.2.1. Using the appropriate instances and resources for your system is key to cost savings.

11.2.2.1.1. For example, a reporting process might take five hours to run on a smaller server but one hour to run on a larger server that is twice as expensive. Both servers give you the same outcome, but the smaller server incurs more cost over time.

11.2.2.1.2. For example, rather than maintaining servers to deliver email, you can use a managed service that charges on a per-message basis.

11.2.2.2. Instance types

11.2.2.2.1. On-Demand Instances allow you to pay for compute capacity by the hour, with no minimum commitments required.

11.2.2.2.2. Reserved Instances allow you to reserve capacity and offer savings of up to 75% off On-Demand pricing

11.2.2.2.3. With Spot Instances individual servers can come and go dynamically, you can leverage unused Amazon EC2 capacity and offer savings of up to 90% off On-Demand pricing.

11.2.3. 2. Matching supply and demand

11.2.3.1. In AWS, you can automatically provision resources to match demand. Auto Scaling and demand, buffer, and time-based approaches allow you to add and remove resources as needed.

11.2.3.2. Demand can be fixed or variable, requiring metrics and automation to ensure that management does not become a significant cost.

11.2.4. 3. Expenditure Awareness

11.2.4.1. Ease of use and virtually unlimited on-demand capacity may require a new way of thinking about expenditures.

11.2.4.2. It eliminates the manual processes and time associated with provisioning on-premises infrastructure, including identifying hardware specifications, negotiating price quotations, managing purchase orders, scheduling shipments, and then deploying the resources.

11.2.4.3. The capability to attribute resource costs to the individual business or product owners drives efficient usage behavior and helps reduce waste.

11.2.5. 4. Optimizing Over Time

11.2.5.1. When regularly reviewing your deployments, assess how newer services can help save you money

12. The Review Process

12.1. The review of architectures needs to be done in a consistent manner, with a blamefree approach that encourages diving deep. It should be a light-weight process (hours not days) that is a conversation and not an audit. The purpose of reviewing an architecture is to identify any critical issues that might need addressing or areas that could be improved. The outcome of the review is a set of actions that should improve the experience of a customer using the workload.