1. About
1.1. @rantav
1.2. @outbrain
1.3. http://prettyprint.me
2. NoSQL
2.1. Not Only not No
2.2. SQL is Good
2.2.1. SQL DB's implement ACID
3. Problem
3.1. In Internet Scale
3.1.1. SQL gets too expensive
3.1.2. usually Hardware costs
3.1.3. Internet changed the scale of applications & introduced the need
3.1.3.1. previous corporate apps didn't have such scale
3.2. Social Apps - not banks
3.2.1. high RW
3.2.2. frequent schema changes
3.2.3. growth is usually non-linear
3.2.4. no need for same level ACID
3.2.4.1. e.g.
3.2.4.1.1. Twitter feed not always consistent
3.3. Usage
3.3.1. Facebook still has sharded MySQL for main storage
3.3.1.1. but also Cassandra
3.3.2. Twitter too
3.3.2.1. but also Cassandra
3.3.3. Google's main storage is BigTable
3.3.3.1. but also some sharded MySQL
4. Scaling solutions
4.1. Replication
4.1.1. Master & slaeves
4.1.1.1. scales Reads
4.1.2. problem in consistency, due to replica's synchronization
4.1.2.1. similar problem in caching
4.2. Sharding
4.2.1. scales also Writes
4.2.2. makes you "sharding slaves"
4.2.2.1. all the time resharding
4.2.2.2. causes pain - lot's of maintenance
4.2.3. you loose some of SQL's features
4.2.3.1. joins, sorting, grouping &c
5. Brewer's CAP theorem
5.1. Berkley proffessor & CTO of Akamai
5.2. needed to deal with
5.2.1. consistency
5.2.2. availability
5.2.3. partition tolerance
5.2.3.1. toelrating disconnection between nodes
5.3. & found that
5.3.1. you can only choose 2
5.4. as apps gets larger, partition toelrance is a must
5.4.1. mid-time between failures is large
5.5. so you need to choose between consistency & availability
5.6. A+C (no partition tolerance)
5.6.1. Master server
5.6.1.1. MySQL
5.7. C + P
5.7.1. not available
5.7.1.1. unavailable some time
5.8. A + P
5.8.1. tolerate inconsistencies
5.8.2. e.g.
5.8.2.1. BigTable
5.8.2.2. Dynamo
5.8.2.3. Cassandra
6. Consistency levels
6.1. Strong
6.2. Weak
6.3. Eventual
6.3.1. levels
6.3.1.1. Casual
6.3.1.2. Read your writes
6.3.1.3. Monotonic
6.3.2. eventually the data will be in all replica's properly
7. NoSQL examples
7.1. BigTable
7.1.1. to some extent not A
7.2. Dynamo
7.3. Cassandra
7.4. MongoDB
7.5. CouchDB
7.5.1. scaling is awfull, but nice features
7.6. Hbase
7.6.1. similar to Cassandra
7.6.2. much more difficult to manage
8. Cassandra
8.1. Developed at FaceBook
8.2. Data model of BigTable
8.3. Network mgmt modelled on Dynamo
8.3.1. Eventual Consistency
8.4. 2 developers that worked on Amazon's Dynamo, & didn't like it there
8.4.1. moved to FaceBook
8.5. Implemented in Java
8.5.1. beautiful implementation
8.6. distributed
8.6.1. tens to thousands of nodes
8.6.2. useful for lots of nodes, not few
9. "Down to earth" Consistency
9.1. N/R/W
9.1.1. N
9.1.1.1. number of replicas
9.1.1.1.1. for any data item
9.1.2. W
9.1.2.1. nomber of nodes a write operation blocks on
9.1.3. R
9.1.3.1. nomber of nodes a read operation blocks on
9.2. typical values
9.2.1. Untitled
9.2.1.1. W=1
9.2.1.1.1. block until 1st node written successfully
9.2.1.2. W=N
9.2.1.2.1. blocks until all nodes written succesfully
9.2.1.3. W=0
9.2.1.3.1. async writes
9.2.2. Untitled
9.2.2.1. R=1
9.2.2.1.1. blocks until the 1st node returns an answer
9.2.2.2. R=N
9.2.2.2.1. block until all nodes return an answer
9.2.2.3. R=0
9.2.2.3.1. doesn't make sense
9.2.3. Quorum
9.2.3.1. R=N/2+1
9.2.3.2. W=N/2+1
9.2.3.2.1. you write to all but await just for W acks
9.2.3.3. Fully consistent
9.3. N defined on server
9.4. different R/W setups on different column-families, API calls, clients &c
10. Data model
10.1. Forget SQL
10.1.1. or any query language
10.1.1.1. you can integrate with text search
10.1.1.1.1. Lucandra
10.1.1.1.2. you search by text, get an id & then continue
10.1.1.1.3. in BigTable, they added indices, as secondary key
10.1.2. or any grouping/aggregation
10.2. Column-Based
10.2.1. modelled after BigTable
10.2.2. difference from Key-Value
10.2.2.1. you can get/set just some columns
10.3. scales really well
10.4. denormalization is a must
10.4.1. client responsibility to write all
10.4.1.1. not transactional
10.4.2. works well with their disk usage
10.5. Vocabulary
10.5.1. Keyspace
10.5.1.1. like namespaces for unique keys
10.5.1.2. schema
10.5.2. Column Family
10.5.2.1. very much like a Table..
10.5.2.1.1. rows & columns
10.5.2.1.2. but not quite
10.5.3. Key
10.5.3.1. a key that represents a row (of columns)
10.5.3.1.1. search is always by key
10.5.4. Column
10.5.4.1. represents a value with
10.5.4.1.1. Column name
10.5.4.1.2. Value
10.5.4.1.3. Timestamp
10.5.5. Super Column
10.5.5.1. column that holds list of columns inside
11. API
11.1. programmatic not declarative
11.2. support many languauges
11.3. API
11.3.1. get
11.3.2. get_slice
11.3.2.1. some columns
11.3.3. multi_get
11.3.3.1. saves round-trip
11.3.4. multi_get_slice
11.3.5. get_count
11.3.6. get_range_slice
11.3.6.1. slice over rows & columns
11.3.7. get_range_slices
11.3.8. insert
11.3.9. remove
11.3.10. batch_insert
11.3.11. batch_mutate
11.4. also meta-api
11.4.1. e.g.
11.4.1.1. describe ring peers
11.5. not convenient
11.5.1. to say the least
11.6. written in Thrift
11.6.1. language/compiler developed in Facebook
11.6.2. takes .idl & generates implementation in different languages
11.6.2.1. even Erlang
11.7. Map/Reduce also supported, in an extension, not part of the API
11.7.1. using Hadoop
11.7.2. which takes the data from Cassandra
12. You usually
12.1. If Key-Value works, use it
12.2. Else, if Column-based works, use it
12.3. Else, if Document-oriented works, use it
12.4. Else, use SQL
13. Your main consideration
13.1. How much Data will you have
13.2. Unfortunately, it's hard to know in advance..
14. SQL over NoSQL
14.1. e.g.
14.1.1. Hive
14.2. also SQL vendors offering scaling similar to NoSQL
14.2.1. e.g.,
14.2.1.1. Sybase IQ
14.2.1.2. more
15. Eventually data-store solutions will merge
15.1. RDBMS will offer NoSQL features
15.1.1. to enable scale-out
15.2. NoSQL will feature RDBMS features
15.2.1. to enable better features