Hadoop
by BIBIANA ORTEGON
1. Characteristics and Environment of Hadoop
1.1. Hadoop Overview: Open-source framework for distributed storage and processing of large datasets.
1.2. Core Components:
1.2.1. HDFS (Hadoop Distributed File System): Distributed storage for large datasets with fault tolerance through data replication.
1.2.2. MapReduce: Parallel processing model for data.
1.2.3. YARN: Resource manager for managing and scheduling tasks.
1.2.4. Hadoop Common: Libraries and utilities needed for Hadoop component
1.3. Environment: Suitable for large-scale data processing environments, from data lakes to cloud infrastructure.
2. MapReduce in Hadoop
2.1. MapReduce Model: A programming model for parallel data processing. Map Phase: Data is split into smaller chunks, processed in parallel. Reduce Phase: Results are aggregated from the map tasks.
2.2. Efficiency: Increases processing speed and optimizes resource use by executing tasks in parallel across the cluster.
3. Hadoop Advantages
3.1. Scalability: Horizontally scalable; can process large datasets across distributed clusters.
3.2. Cost-efficiency: Utilizes commodity hardware and open-source software., unstructured).
3.3. Fault Tolerance: Data replication ensures system reliability, even if a node fails.
3.4. High-Throughput: Optimized for high-speed data processing.
3.5. Flexibility: Supports diverse types of data (structured, semi-structured
4. Timeline of the Development of Hadoop
4.1. - 2003: Google introduces the Google File System (GFS).
4.2. 2004: Google publishes the MapReduce programming model.
4.3. 2006: Hadoop is developed by Doug Cutting and Mike Cafarella as an open-source project under Apache.
4.4. 2008: Hadoop becomes a top-level project at Apache Software Foundation.
4.5. Present: Ongoing improvements, extensive adoption in big data solutions.
5. Relation of Big Data Analytics with Machine Learning and Cloud Computing.
5.1. Big Data Analytics: The process of analyzing large datasets to uncover patterns, trends, and associations.
5.1.1. Hadoop provides the infrastructure for storing and processing this data.
5.2. Machine Learning (ML): Uses insights from Big Data to build predictive models and automate decision-making processes.
5.2.1. Hadoop helps process the large datasets needed for training ML models.
5.3. Cloud Computing: Offers on-demand computing resources to handle the storage and computation requirements of big data.
5.3.1. Hadoop integrates with cloud platforms to scale resources as needed for data processing and storage.