Intro to NoSQL & Cassandra, Ran Tavori

Get Started. It's Free
or sign up with your email address
Intro to NoSQL & Cassandra, Ran Tavori by Mind Map: Intro to NoSQL & Cassandra, Ran Tavori

1. About

1.1. @rantav

1.2. @outbrain

1.3. http://prettyprint.me

2. NoSQL

2.1. Not Only not No

2.2. SQL is Good

2.2.1. SQL DB's implement ACID

3. Problem

3.1. In Internet Scale

3.1.1. SQL gets too expensive

3.1.2. usually Hardware costs

3.1.3. Internet changed the scale of applications & introduced the need

3.1.3.1. previous corporate apps didn't have such scale

3.2. Social Apps - not banks

3.2.1. high RW

3.2.2. frequent schema changes

3.2.3. growth is usually non-linear

3.2.4. no need for same level ACID

3.2.4.1. e.g.

3.2.4.1.1. Twitter feed not always consistent

3.3. Usage

3.3.1. Facebook still has sharded MySQL for main storage

3.3.1.1. but also Cassandra

3.3.2. Twitter too

3.3.2.1. but also Cassandra

3.3.3. Google's main storage is BigTable

3.3.3.1. but also some sharded MySQL

4. Scaling solutions

4.1. Replication

4.1.1. Master & slaeves

4.1.1.1. scales Reads

4.1.2. problem in consistency, due to replica's synchronization

4.1.2.1. similar problem in caching

4.2. Sharding

4.2.1. scales also Writes

4.2.2. makes you "sharding slaves"

4.2.2.1. all the time resharding

4.2.2.2. causes pain - lot's of maintenance

4.2.3. you loose some of SQL's features

4.2.3.1. joins, sorting, grouping &c

5. Brewer's CAP theorem

5.1. Berkley proffessor & CTO of Akamai

5.2. needed to deal with

5.2.1. consistency

5.2.2. availability

5.2.3. partition tolerance

5.2.3.1. toelrating disconnection between nodes

5.3. & found that

5.3.1. you can only choose 2

5.4. as apps gets larger, partition toelrance is a must

5.4.1. mid-time between failures is large

5.5. so you need to choose between consistency & availability

5.6. A+C (no partition tolerance)

5.6.1. Master server

5.6.1.1. MySQL

5.7. C + P

5.7.1. not available

5.7.1.1. unavailable some time

5.8. A + P

5.8.1. tolerate inconsistencies

5.8.2. e.g.

5.8.2.1. BigTable

5.8.2.2. Dynamo

5.8.2.3. Cassandra

6. Consistency levels

6.1. Strong

6.2. Weak

6.3. Eventual

6.3.1. levels

6.3.1.1. Casual

6.3.1.2. Read your writes

6.3.1.3. Monotonic

6.3.2. eventually the data will be in all replica's properly

7. NoSQL examples

7.1. BigTable

7.1.1. to some extent not A

7.2. Dynamo

7.3. Cassandra

7.4. MongoDB

7.5. CouchDB

7.5.1. scaling is awfull, but nice features

7.6. Hbase

7.6.1. similar to Cassandra

7.6.2. much more difficult to manage

8. Cassandra

8.1. Developed at FaceBook

8.2. Data model of BigTable

8.3. Network mgmt modelled on Dynamo

8.3.1. Eventual Consistency

8.4. 2 developers that worked on Amazon's Dynamo, & didn't like it there

8.4.1. moved to FaceBook

8.5. Implemented in Java

8.5.1. beautiful implementation

8.6. distributed

8.6.1. tens to thousands of nodes

8.6.2. useful for lots of nodes, not few

9. "Down to earth" Consistency

9.1. N/R/W

9.1.1. N

9.1.1.1. number of replicas

9.1.1.1.1. for any data item

9.1.2. W

9.1.2.1. nomber of nodes a write operation blocks on

9.1.3. R

9.1.3.1. nomber of nodes a read operation blocks on

9.2. typical values

9.2.1. Untitled

9.2.1.1. W=1

9.2.1.1.1. block until 1st node written successfully

9.2.1.2. W=N

9.2.1.2.1. blocks until all nodes written succesfully

9.2.1.3. W=0

9.2.1.3.1. async writes

9.2.2. Untitled

9.2.2.1. R=1

9.2.2.1.1. blocks until the 1st node returns an answer

9.2.2.2. R=N

9.2.2.2.1. block until all nodes return an answer

9.2.2.3. R=0

9.2.2.3.1. doesn't make sense

9.2.3. Quorum

9.2.3.1. R=N/2+1

9.2.3.2. W=N/2+1

9.2.3.2.1. you write to all but await just for W acks

9.2.3.3. Fully consistent

9.3. N defined on server

9.4. different R/W setups on different column-families, API calls, clients &c

10. Data model

10.1. Forget SQL

10.1.1. or any query language

10.1.1.1. you can integrate with text search

10.1.1.1.1. Lucandra

10.1.1.1.2. you search by text, get an id & then continue

10.1.1.1.3. in BigTable, they added indices, as secondary key

10.1.2. or any grouping/aggregation

10.2. Column-Based

10.2.1. modelled after BigTable

10.2.2. difference from Key-Value

10.2.2.1. you can get/set just some columns

10.3. scales really well

10.4. denormalization is a must

10.4.1. client responsibility to write all

10.4.1.1. not transactional

10.4.2. works well with their disk usage

10.5. Vocabulary

10.5.1. Keyspace

10.5.1.1. like namespaces for unique keys

10.5.1.2. schema

10.5.2. Column Family

10.5.2.1. very much like a Table..

10.5.2.1.1. rows & columns

10.5.2.1.2. but not quite

10.5.3. Key

10.5.3.1. a key that represents a row (of columns)

10.5.3.1.1. search is always by key

10.5.4. Column

10.5.4.1. represents a value with

10.5.4.1.1. Column name

10.5.4.1.2. Value

10.5.4.1.3. Timestamp

10.5.5. Super Column

10.5.5.1. column that holds list of columns inside

11. API

11.1. programmatic not declarative

11.2. support many languauges

11.3. API

11.3.1. get

11.3.2. get_slice

11.3.2.1. some columns

11.3.3. multi_get

11.3.3.1. saves round-trip

11.3.4. multi_get_slice

11.3.5. get_count

11.3.6. get_range_slice

11.3.6.1. slice over rows & columns

11.3.7. get_range_slices

11.3.8. insert

11.3.9. remove

11.3.10. batch_insert

11.3.11. batch_mutate

11.4. also meta-api

11.4.1. e.g.

11.4.1.1. describe ring peers

11.5. not convenient

11.5.1. to say the least

11.6. written in Thrift

11.6.1. language/compiler developed in Facebook

11.6.2. takes .idl & generates implementation in different languages

11.6.2.1. even Erlang

11.7. Map/Reduce also supported, in an extension, not part of the API

11.7.1. using Hadoop

11.7.2. which takes the data from Cassandra

12. You usually

12.1. If Key-Value works, use it

12.2. Else, if Column-based works, use it

12.3. Else, if Document-oriented works, use it

12.4. Else, use SQL

13. Your main consideration

13.1. How much Data will you have

13.2. Unfortunately, it's hard to know in advance..

14. SQL over NoSQL

14.1. e.g.

14.1.1. Hive

14.2. also SQL vendors offering scaling similar to NoSQL

14.2.1. e.g.,

14.2.1.1. Sybase IQ

14.2.1.2. more

15. Eventually data-store solutions will merge

15.1. RDBMS will offer NoSQL features

15.1.1. to enable scale-out

15.2. NoSQL will feature RDBMS features

15.2.1. to enable better features