1. Key Concepts
1.1. Horizontal vs. Vertical Scaling
1.1.1. Vertical
1.1.1.1. increasing the resources of a specific node
1.1.2. Horizontal
1.1.2.1. increasing number of nodes
1.2. Load Balancer
1.2.1. software
1.2.1.1. application
1.2.1.2. DNS based
1.2.2. hardware
1.2.2.1. routers
1.3. Database Denormalization and NoSQL
1.3.1. to avoid joins
1.4. Database Partitioning (Sharding)
1.4.1. Vertical Partitioning
1.4.1.1. by feature
1.4.1.1.1. if one db gets very large you might need to repartition db
1.4.2. Key-Based (or Hash-Based) Partitioning
1.4.2.1. adding new servers means reallocating all data
1.4.3. Directory-Based Partitioning
1.4.3.1. lookup table can be a single point of failure
1.4.3.1.1. consider ranges and duplicating the table
1.4.3.2. accessing the table impacts performance
1.5. Caching
1.5.1. in-memory
1.5.1.1. between your app layer and your data store
1.6. Asynchronous Processing & Queues
1.6.1. pre-process
1.6.1.1. notifying when the process is done
1.7. Networking Metrics
1.7.1. Bandwidth (bit/s, Gbits/s)
1.7.1.1. Max amount of data that can be transferred in a unit of time
1.7.1.1.1. Best conditions
1.7.2. Throughput
1.7.2.1. Actual amount of data that is transferred in time
1.7.2.1.1. In real
1.7.3. Latency
1.7.3.1. How long it takes data to go from one end to the other
1.7.3.2. Delay between the sender and receiver
1.8. MapReduce
1.8.1. Map
1.8.1.1. Takes in some data and emits a <key,value> pair
1.8.2. Reduce
1.8.2.1. Takes a key and a set of associated values and "reduces" them, emitting a new key and value
1.8.2.1.1. The result can be reduced again
2. Considerations (FARS-RW)
2.1. Failures
2.1.1. Plan for many or all of them
2.2. Availability
2.2.1. Uptime / Total time
2.2.1.1. operational availability->
2.2.1.2. function of the percentage of time the system is operational
2.3. Reliability
2.3.1. Mean Time Between Failure (MTBF) = total time in service / number of failures
2.3.2. Failure Rate = number of failures / total time in service
2.3.3. a reliable machine has high availability but an available machine may or may not be very reliable
2.4. Security
2.5. Read-heavy vs. Write-heavy
2.5.1. caching vs queuing up the writes
3. Demonstrate you can analyse the problems
4. Handling the Questions
4.1. Communicate
4.2. Go broad first
4.2.1. consider
4.2.1.1. easy to use
4.2.1.1.1. clients
4.2.1.1.2. owners
4.2.1.2. flexibility
4.2.1.3. scalability and efficiency
4.3. Use the whiteboard
4.3.1. help interviewer follow your proposed design
4.4. Acknowledge interviewer concerns
4.4.1. validate them
4.4.1.1. make changes accordingly
4.5. Be careful about assumptions
4.5.1. State them explicitly
4.6. Estimate when necessary
4.7. Drive
4.7.1. ask questions
4.7.1.1. be open about tradeoffs
4.7.1.1.1. go deeper
5. Design
5.1. Systems (SMraDIR)
5.1.1. Scope the Problem
5.1.1.1. what exactly you need to implement
5.1.1.1.1. list major features
5.1.2. Make Reasonable Assumptions
5.1.2.1. Evaluate "product sense"
5.1.3. Draw the Major Components
5.1.3.1. What the system might look like
5.1.3.2. Walk through from end-to-end to provide a flow
5.1.3.2.1. Pretend to ignore major scaleability challenges at the moment
5.1.4. Identify the Key Issues
5.1.4.1. What will be the bottlenecks or major challenges?
5.1.5. Redesign for the Key Issues
5.1.5.1. Be open about limitations in your design
5.2. Algorithms (AMaGS)
5.2.1. Ask Questions
5.2.2. Make Believe
5.2.2.1. pretend that data can fit on one machine
5.2.2.2. provide the general outline for your solution
5.2.3. Get Real
5.2.3.1. How much data can fit on one machine?
5.2.3.2. What problems will occur when you split up the data?
5.2.4. Solve problems
5.2.4.1. a solution can remove or mitigate an issue
5.2.4.2. Mind iterative approach
5.2.4.2.1. Solve problems as they emerge