Big Data Fundamentals

Get Started. It's Free
or sign up with your email address
Rocket clouds
Big Data Fundamentals by Mind Map: Big Data Fundamentals

1. Big data

1.1. volume

1.1.1. amount of data

1.2. velocity

1.2.1. frequency

1.3. variety

1.3.1. contents of the data

2. Components

2.1. Apache Hadoop

2.1.1. Scalable storage and batch processing system

2.1.2. Compliments existing systems by scaling

2.1.3. HDFS

2.1.3.1. Distributed File Systems

2.1.4. YARN

2.1.4.1. Scheduling and executing

2.1.5. Map Reduce

2.1.5.1. YARN-based for processing large data sets on the cluster

2.1.6. How does this work?

2.1.6.1. Architecture

2.1.6.1.1. Name Nodes

2.1.6.1.2. Data Nodes

2.1.6.2. Process

2.1.6.2.1. Client writes to a name node (64MB chunk)

2.1.6.2.2. The data node replicates to 2 other nodes

2.1.7. Big data job types

2.1.7.1. MapReduce

2.1.7.1.1. Parallel processing of large data sets

2.1.7.1.2. Map - into separate nodes

2.1.7.1.3. Reduce - aggregate the outputs

2.1.7.1.4. Partition - which reduces receives the kvp from the mapper

2.1.7.1.5. shuffler - transfers the data from the mappers to the reducers

2.1.7.1.6. Sorted - by keys as it arrives to the reducers

2.1.7.2. Hive

2.1.7.2.1. data warehousing infrastructure

2.1.7.2.2. HIVE query language - HQL

2.1.7.2.3. Allows unstructured data as if it were structured

2.1.7.2.4. Allows ad hoc querying and analysis

2.1.7.3. Pig

2.1.7.3.1. Programming environment for data tasks

2.1.7.3.2. Pig Latin - procedural map reduce with HQL

3. Use cases

4. Database Architecture

4.1. ACID

4.1.1. Atomocity

4.1.1.1. All or nothing for transactions

4.1.2. Consistency

4.1.2.1. Only valid data is saved

4.1.3. Isolation

4.1.3.1. such as two people purchasing the last ticket

4.1.4. Durability

4.1.4.1. AZ failure

5. OLTP vs OLAP

5.1. OLTP

5.1.1. Many users, constant transactions

5.1.2. Critical to the business

5.1.3. GB

5.2. OLAP

5.2.1. Analytics

5.2.2. Periodic large updates and complex queries

5.2.3. TB/PB

6. Data Warehouse

6.1. Uses ETL to ingest the data

6.2. Data Mart

6.2.1. A smaller DWH

6.2.2. Covers a subset of the data --> faster and easier