Get Started. It's Free
or sign up with your email address
Hadoop 2.x by Mind Map: Hadoop 2.x

1. Name Node

1.1. not a single point of failure

2. Passive Name Node

2.1. Used for reads (possibly)

2.2. Name nodes per group (marketing, IT etc)


3.1. Applications are submitted to YARN resource manager

3.2. Coordinates the allocation of compute resources

3.3. Node managers (on each data node) are responsible for launching compute containers

3.4. Allows all kinds of apps to run on the cluster

4. MapReduce

4.1. Scale out architecture

4.2. Structured and unstructured

4.3. Runs on large clusters, commodity hardware

4.4. fault tolerant

4.5. Java/C#/Python....

4.6. Interoperates with Hive and Pig

4.7. Phases

4.7.1. Accepts KVP as input, outputs KVP

4.7.2. Input - place the data onto HDFS

4.7.3. Split - divide lines onto mapper instances

4.7.4. Map - transform input into intermediate output

4.7.5. Shuffle - sort and transfer map output to reducer

4.7.6. Reducer - aggregate values from shuffle

4.7.7. Output - aggregated list from the reduce phase

4.8. Joins

4.8.1. Map Side joins performed on the mapper faster but has constraints sorted by the same key equal number of partitions all records of the same key have to be in the same partitions

4.8.2. Reduce side joins Performed on the reducer less efficient because they have to be shuffled

4.8.3. Distributed cache can be used in a map side join efficient and may eliminate the need for areducer

4.9. Combiner

4.9.1. optional mini reducer

4.9.2. runs in memory. after map, before reducer

4.10. Partitioner

4.10.1. Defines how keys are assigned to reducers

4.10.2. Determines which reducer receives which KVP

4.11. Interfaces

4.11.1. Mapper Interface KVP into intermediate KVP

4.11.2. Reducer Interface Data aggregation Processes them using Reduce into final KVPs

5. components

5.1. HQL statement is sent to the driver

5.2. compiler invoked by the driver

5.3. compiler translates the statement into DAG - Directed Acyclic Graph

5.4. Driver submits jobs to the execution engine

6. Data organization

6.1. Database

6.1.1. catalog of namespaces that separate tables

6.1.2. schema can evolve

6.2. Table

6.2.1. Table types internal/managed HDFS Hive controls life cycle, data is deleted with table external stored outside of hive data does not get deleted when table is deleted

6.3. Partition

6.3.1. A directory

6.3.2. Can reduce size of map stages, mappers, I/O, and time

6.4. Bucket

6.4.1. a file in a table directory

6.4.2. separates table data int more manageable parts

6.4.3. instead of creating lots of partitions

6.5. View

6.5.1. logical construct

6.5.2. treated like a table

6.6. Index

6.6.1. compaction

6.6.2. bitmap

6.7. Hive metastore

6.7.1. hive metadata

6.7.2. sql (mysql, postgres etc)

6.7.3. stores data about tables and table locations partitions, schemas, table, columns

7. Pig

7.1. Scripting language for analysing large data sets

7.2. Similar to SQL and does not require Java

7.3. Can run on Hadoop 1 and 2 without changes