Intro to NoSQL & Cassandra, Ran Tavori

Get Started. It's Free
or sign up with your email address
Intro to NoSQL & Cassandra, Ran Tavori by Mind Map: Intro to NoSQL & Cassandra, Ran Tavori

1. About

1.1. @rantav

1.2. @outbrain


2. NoSQL

2.1. Not Only not No

2.2. SQL is Good

2.2.1. SQL DB's implement ACID

3. Problem

3.1. In Internet Scale

3.1.1. SQL gets too expensive

3.1.2. usually Hardware costs

3.1.3. Internet changed the scale of applications & introduced the need previous corporate apps didn't have such scale

3.2. Social Apps - not banks

3.2.1. high RW

3.2.2. frequent schema changes

3.2.3. growth is usually non-linear

3.2.4. no need for same level ACID e.g. Twitter feed not always consistent

3.3. Usage

3.3.1. Facebook still has sharded MySQL for main storage but also Cassandra

3.3.2. Twitter too but also Cassandra

3.3.3. Google's main storage is BigTable but also some sharded MySQL

4. Scaling solutions

4.1. Replication

4.1.1. Master & slaeves scales Reads

4.1.2. problem in consistency, due to replica's synchronization similar problem in caching

4.2. Sharding

4.2.1. scales also Writes

4.2.2. makes you "sharding slaves" all the time resharding causes pain - lot's of maintenance

4.2.3. you loose some of SQL's features joins, sorting, grouping &c

5. Brewer's CAP theorem

5.1. Berkley proffessor & CTO of Akamai

5.2. needed to deal with

5.2.1. consistency

5.2.2. availability

5.2.3. partition tolerance toelrating disconnection between nodes

5.3. & found that

5.3.1. you can only choose 2

5.4. as apps gets larger, partition toelrance is a must

5.4.1. mid-time between failures is large

5.5. so you need to choose between consistency & availability

5.6. A+C (no partition tolerance)

5.6.1. Master server MySQL

5.7. C + P

5.7.1. not available unavailable some time

5.8. A + P

5.8.1. tolerate inconsistencies

5.8.2. e.g. BigTable Dynamo Cassandra

6. Consistency levels

6.1. Strong

6.2. Weak

6.3. Eventual

6.3.1. levels Casual Read your writes Monotonic

6.3.2. eventually the data will be in all replica's properly

7. NoSQL examples

7.1. BigTable

7.1.1. to some extent not A

7.2. Dynamo

7.3. Cassandra

7.4. MongoDB

7.5. CouchDB

7.5.1. scaling is awfull, but nice features

7.6. Hbase

7.6.1. similar to Cassandra

7.6.2. much more difficult to manage

8. Cassandra

8.1. Developed at FaceBook

8.2. Data model of BigTable

8.3. Network mgmt modelled on Dynamo

8.3.1. Eventual Consistency

8.4. 2 developers that worked on Amazon's Dynamo, & didn't like it there

8.4.1. moved to FaceBook

8.5. Implemented in Java

8.5.1. beautiful implementation

8.6. distributed

8.6.1. tens to thousands of nodes

8.6.2. useful for lots of nodes, not few

9. "Down to earth" Consistency

9.1. N/R/W

9.1.1. N number of replicas for any data item

9.1.2. W nomber of nodes a write operation blocks on

9.1.3. R nomber of nodes a read operation blocks on

9.2. typical values

9.2.1. Untitled W=1 block until 1st node written successfully W=N blocks until all nodes written succesfully W=0 async writes

9.2.2. Untitled R=1 blocks until the 1st node returns an answer R=N block until all nodes return an answer R=0 doesn't make sense

9.2.3. Quorum R=N/2+1 W=N/2+1 you write to all but await just for W acks Fully consistent

9.3. N defined on server

9.4. different R/W setups on different column-families, API calls, clients &c

10. Data model

10.1. Forget SQL

10.1.1. or any query language you can integrate with text search Lucandra you search by text, get an id & then continue in BigTable, they added indices, as secondary key

10.1.2. or any grouping/aggregation

10.2. Column-Based

10.2.1. modelled after BigTable

10.2.2. difference from Key-Value you can get/set just some columns

10.3. scales really well

10.4. denormalization is a must

10.4.1. client responsibility to write all not transactional

10.4.2. works well with their disk usage

10.5. Vocabulary

10.5.1. Keyspace like namespaces for unique keys schema

10.5.2. Column Family very much like a Table.. rows & columns but not quite

10.5.3. Key a key that represents a row (of columns) search is always by key

10.5.4. Column represents a value with Column name Value Timestamp

10.5.5. Super Column column that holds list of columns inside

11. API

11.1. programmatic not declarative

11.2. support many languauges

11.3. API

11.3.1. get

11.3.2. get_slice some columns

11.3.3. multi_get saves round-trip

11.3.4. multi_get_slice

11.3.5. get_count

11.3.6. get_range_slice slice over rows & columns

11.3.7. get_range_slices

11.3.8. insert

11.3.9. remove

11.3.10. batch_insert

11.3.11. batch_mutate

11.4. also meta-api

11.4.1. e.g. describe ring peers

11.5. not convenient

11.5.1. to say the least

11.6. written in Thrift

11.6.1. language/compiler developed in Facebook

11.6.2. takes .idl & generates implementation in different languages even Erlang

11.7. Map/Reduce also supported, in an extension, not part of the API

11.7.1. using Hadoop

11.7.2. which takes the data from Cassandra

12. You usually

12.1. If Key-Value works, use it

12.2. Else, if Column-based works, use it

12.3. Else, if Document-oriented works, use it

12.4. Else, use SQL

13. Your main consideration

13.1. How much Data will you have

13.2. Unfortunately, it's hard to know in advance..

14. SQL over NoSQL

14.1. e.g.

14.1.1. Hive

14.2. also SQL vendors offering scaling similar to NoSQL

14.2.1. e.g., Sybase IQ more

15. Eventually data-store solutions will merge

15.1. RDBMS will offer NoSQL features

15.1.1. to enable scale-out

15.2. NoSQL will feature RDBMS features

15.2.1. to enable better features