Get Started. It's Free
or sign up with your email address
Cassandra by Mind Map: Cassandra

1. Snitches

1.1. Simple Snitch

1.1.1. Unaware of Topology (Rack)

1.2. Rack Inferring

1.2.1. Assume topology of network by octet of server ip

1.2.2. x.<DC>.<rack>.<node>

1.3. Property File Snitch

1.3.1. mapping in the configuration file

1.4. EC2 Snitch

1.4.1. EC2 region = DC

1.4.2. AZ = rack

2. Writes

2.1. Client sends write to one coordinator node in the cluster

2.1.1. per-key

2.1.2. per-query

2.1.3. per-client

2.2. coordinator uses partitioner to send query to all replica nodes responsible for the key

2.3. Hinted Handoff

2.3.1. makes sure write request is always available even if replicas are down

2.3.2. if one replica is down

2.3.2.1. coordinator writes to all other replicas

2.3.2.2. coordinator keeps write locally until the downed replica comes back

2.3.3. if all replicas are down

2.3.3.1. coordinator buffers writes for a few hours

2.4. Per-DC coordinator

2.4.1. different from the query coordinator

2.4.2. elected to coordinate with other DCs

2.4.3. ZooKeeper (Paxos)

2.5. Writes At A Replica Node

2.5.1. 1. Log it in disk commit log (for failure recovery)

2.5.2. 2. Make changes to appropriate memtables

2.5.2.1. Memtable

2.5.2.1.1. in-memory representation of multiple key-value pairs

2.5.2.1.2. cache that can be searched by key

2.5.2.1.3. write-back cache as opposed to write-through

2.5.2.1.4. maintains the latest KV for the server

2.5.3. 3. Later, when memtable is full or old, flush to disk

2.5.3.1. Data File: An SSTable

2.5.3.1.1. SSTable (sorted string)

2.5.3.2. Index file: An SSTable of (key, position in data sstable) pairs

2.5.3.3. Bloom filter (for efficient searches)

2.5.3.3.1. Bloom Filter

2.6. Compaction

2.6.1. merge SSTables over time

2.6.2. Run periodically and locally on each server

2.7. Deletes

2.7.1. Add a Tombstone to the log

2.7.2. Eventually gets compacted

3. Reads

3.1. Send read to replicas that have responded quickest in the past

3.2. When X replicas respond, returns the latest timestamped value from among those X

3.3. read repair

3.3.1. checks consistency in the background

3.3.2. if two values are different, issue read repair

3.3.3. eventually brings all replicas up to date (eventual consistency)

3.4. if compaction isn't run frequently enough

3.4.1. A row may be split across multple SSTable

3.4.2. read needs to touch mutli SSTable

3.4.3. reads become slower than writes

4. DHT

4.1. coordinator knows the key placement

4.2. No need for finger tables

5. Data Placement Strategies

5.1. Simple Strategy

5.1.1. Random Partitioner

5.1.1.1. chord-like hash based partitioning

5.1.2. Byte Ordered Partitioner

5.1.2.1. easier for range queries

5.1.2.2. no hashing of keys involved

5.2. Network Topology Strategy

5.2.1. Rack fault tolerance

5.2.2. Ensure fully replicated

6. Membership

6.1. Any server could be the coordinator

6.2. every server needs to maintain a list of all other servers

6.3. gossip style

6.4. Suspicion

6.4.1. adaptively set timeout based on underlying network and failure behaviour

6.4.2. accural detector

6.4.2.1. failure detector outputs a value (Phi) representing suspicion

6.4.3. Apps set an appropriate threshold

6.4.4. PHI calculation for a member

6.4.4.1. Inter-arrival times for gossip messages

6.4.4.2. PHI(t) = -log(CDF or Probability(t_now-t_last))/log10

6.4.4.3. Determines the detection timeout, but takes into account historical inter-arrival time variation for gossiped heartbeats

6.4.5. In practice, PHI=5=>10-15 sec detection time