Big Data & Data Lake

Lancez-Vous. C'est gratuit
ou s'inscrire avec votre adresse e-mail
Big Data & Data Lake par Mind Map: Big Data & Data Lake

1. Hadoop

1.1. Hadoop Framework accesses different sources & divides the data into smaller chunks + stores each subset on a separate machine/node within the cluster (Hadoop Distributed File System) Hadoop is an ecosystem that includes different technologies to help collect, store, process, and analyze big data:

1.1.1. Spark

1.1.1.1. WHAT? Technology to process & aggregate data set simultaneously (Master and Workers nodes) using files stored in HDSF. Compatible with Python Advantages 1. *FAST PROCESSING* Spark contains Resilient Distributed Dataset (RDD= abstraction=schema / df) which saves time in reading and writing operations, allowing it to run almost ten to one hundred times faster than Hadoop. 2. *FLEXIBILITY* Spark supports multiple languages (Java, Python) 3. *IN-MEMORY COMPUTING* Spark stores the data in the RAM of servers which allows quick access and in turn accelerates the speed of analytics. 4. *REAL-TIME PROCESSING* Spark is able to process real-time streaming data, therefore, able to produce instant outcomes. 5. *BETTER ANALYTICS* Spark consists of a rich set of SQL queries, machine learning algorithms, complex analytics, etc. which improve analysis performarce.

1.1.2. Hive

1.1.2.1. WHAT? Technology to read structured data in SQL using files in stored in HDSF Advantages 1. *FAST* Hive uses batch processing so that it works quickly across a very large distributed database. 2. *FAMILIAR* Created for non-dev using SLQ-like interface 3. *EASY ACCESS* Hive stores its database in a metastore, which enables easy data extraction and analysis.

2. SOURCES What is Hadoop https://aws.amazon.com/emr/details/hadoop/what-is-hadoop/ What is Spark? https://chartio.com/learn/data-analytics/what-is-spark/ What is Apache Hive? https://aws.amazon.com/big-data/what-is-hive/ What Is Big Data? Big Data Explained https://www.tableau.com/learn/articles/what-is-big-data The 7 V’s of Big Data https://impact.com/marketing-intelligence/7-vs-big-data/ Difference between Batch Processing and Stream Processing https://www.geeksforgeeks.org/difference-between-batch-processing-and-stream-processing/ How do you design and implement stream processing vs batch processing architectures? https://www.linkedin.com/advice/1/how-do-you-design-implement-stream-processing

3. Main objective is to manage the cluster of DL Data clustering is a data mining technique based on machine learning that splits data into several subsets based on similarity. CLUSTER= group of servers, each machine is a node used for *processing* a/o *storage*

4. DATA PROCESSING SYSTEMS

4.1. Batch

4.1.1. - Collecting large volumen of data in DL or DW + processing it all at once within a specific timeframe - Amount of data is finite (eg payroll) - Tool: Hadoop ecosystem/ ETL-ELT / Airflow - Usually batches are processed at night or offline - Hystorical data (data latency) - Batch processing applications can handle large and complex data sets

4.2. Stream

4.2.1. - Processing of continuous stream of data as soon as it´s generated - Amount of data is infinite (eg food delivery) - Real time collection, distribution and analysis of data - Tool: Apache Kafka (for injection of data), Flink, SparkStream - Event Stream Processing uses a messaging system and processes each event/transaction individually - Good for fraud detection & real-time analytics