Create your own awesome maps

Even on the go

with our free apps for iPhone, iPad and Android

Get Started

Already have an account?
Log In

Intro to NoSQL & Cassandra, Ran Tavori by Mind Map: Intro to NoSQL & Cassandra, Ran Tavori
5.0 stars - 1 reviews range from 0 to 5

Intro to NoSQL & Cassandra, Ran Tavori

About

@rantav

@outbrain

http://prettyprint.me

NoSQL

Not Only not No

SQL is Good

SQL DB's implement ACID

Problem

In Internet Scale

SQL gets too expensive

usually Hardware costs

Internet changed the scale of applications & introduced the need, previous corporate apps didn't have such scale

Social Apps - not banks

high RW

frequent schema changes

growth is usually non-linear

no need for same level ACID, e.g., Twitter feed not always consistent

Usage

Facebook still has sharded MySQL for main storage, but also Cassandra

Twitter too, but also Cassandra

Google's main storage is BigTable, but also some sharded MySQL

Scaling solutions

Replication

Master & slaeves, scales Reads

problem in consistency, due to replica's synchronization, similar problem in caching

Sharding

scales also Writes

makes you "sharding slaves", all the time resharding, causes pain - lot's of maintenance

you loose some of SQL's features, joins, sorting, grouping &c

Brewer's CAP theorem

Berkley proffessor & CTO of Akamai

needed to deal with

consistency

availability

partition tolerance, toelrating disconnection between nodes

& found that

you can only choose 2

as apps gets larger, partition toelrance is a must

mid-time between failures is large

so you need to choose between consistency & availability

A+C (no partition tolerance)

Master server, MySQL

C + P

not available, unavailable some time

A + P

tolerate inconsistencies

e.g., BigTable, Dynamo, Cassandra

Consistency levels

Strong

Weak

Eventual

levels, Casual, Read your writes, Monotonic

eventually the data will be in all replica's properly

NoSQL examples

BigTable

to some extent not A

Dynamo

Cassandra

MongoDB

CouchDB

scaling is awfull, but nice features

Hbase

similar to Cassandra

much more difficult to manage

Cassandra

Developed at FaceBook

Data model of BigTable

Network mgmt modelled on Dynamo

Eventual Consistency

2 developers that worked on Amazon's Dynamo, & didn't like it there

moved to FaceBook

Implemented in Java

beautiful implementation

distributed

tens to thousands of nodes

useful for lots of nodes, not few

"Down to earth" Consistency

N/R/W

N, number of replicas, for any data item

W, nomber of nodes a write operation blocks on

R, nomber of nodes a read operation blocks on

typical values

Untitled, W=1, block until 1st node written successfully, W=N, blocks until all nodes written succesfully, W=0, async writes

Untitled, R=1, blocks until the 1st node returns an answer, R=N, block until all nodes return an answer, R=0, doesn't make sense

Quorum, R=N/2+1, W=N/2+1, you write to all but await just for W acks, Fully consistent

N defined on server

different R/W setups on different column-families, API calls, clients &c

Data model

Forget SQL

or any query language, you can integrate with text search, Lucandra, you search by text, get an id & then continue, in BigTable, they added indices, as secondary key, will be supported in Cassandra soon

or any grouping/aggregation

Column-Based

modelled after BigTable

difference from Key-Value, you can get/set just some columns

scales really well

denormalization is a must

client responsibility to write all, not transactional

works well with their disk usage

Vocabulary

Keyspace, like namespaces for unique keys, schema

Column Family, very much like a Table.., rows & columns, but not quite, sparse array

Key, a key that represents a row (of columns), search is always by key, you must model your data according to predicted usage, if use-cases change, you're stuck, but there are some solutions

Column, represents a value with, Column name, Value, Timestamp

Super Column, column that holds list of columns inside

API

programmatic not declarative

support many languauges

API

get

get_slice, some columns

multi_get, saves round-trip

multi_get_slice

get_count

get_range_slice, slice over rows & columns

get_range_slices

insert

remove

batch_insert

batch_mutate

also meta-api

e.g., describe ring peers

not convenient

to say the least

written in Thrift

language/compiler developed in Facebook

takes .idl & generates implementation in different languages, even Erlang

Map/Reduce also supported, in an extension, not part of the API

using Hadoop

which takes the data from Cassandra

You usually

If Key-Value works, use it

Else, if Column-based works, use it

Else, if Document-oriented works, use it

Else, use SQL

Your main consideration

How much Data will you have

Unfortunately, it's hard to know in advance..

SQL over NoSQL

e.g.

Hive

also SQL vendors offering scaling similar to NoSQL

e.g.,, Sybase IQ, more

Eventually data-store solutions will merge

RDBMS will offer NoSQL features

to enable scale-out

NoSQL will feature RDBMS features

to enable better features