Hadoop Ecosystem
by Ed Sarausad
1. Column store
2. Pig
2.1. Scripting for Hadoop
3. HBase
3.1. Transactional lookups
4. Flume
4.1. Log collector
4.2. Integrates into Hadoop
5. Avro
5.1. Data parsing
5.2. Binary data serialization
5.3. RPC
5.4. language-neutral
5.5. optional codegen
5.6. schema evolution
5.7. untagged data
5.8. dynamic typing
6. Mahout
6.1. Machine learning
6.2. Applied to MR
7. Ambari
7.1. Cluster deployment and admin
7.2. Driven by Hortonworks
8. ZooKeeper
8.1. Coordinator of shared state between apps
8.2. Naming, configuration, and synchronization services
9. YARN
9.1. cluster management
9.2. Hadoop 2
9.3. resource manager
9.4. job scheduler
10. BigTop
10.1. Package Hadoop ecosys
10.2. Test Hadoop ecosys package
11. Query data stored in HDFS and HBase
12. Non-relational
13. Maps query onto nodes
14. Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability.
15. Reduces aggregated results into answers
16. Links jobs
16.1. Workflow processing
17. Bundle provides a way to package multiple coordinator and workflow jobs and to manage the lifecycle of those jobs
17.1. Connects non-Hadoop stores (RDBMS)
17.2. Moves data to & from RDBMS to Hadoop
18. Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions to execute. The Workflow job has to wait
19. Hive
19.1. SQL-like querying
19.2. Combiner can be used to optimize reducer performance
19.3. Structured data warehousing
19.4. Partition columns instead of indexes
20. Oozie
21. Sqoop
21.1. Autogens Java InputFormat code for data access
22. MapReduce
22.1. Distributed compute
23. Related Apache Ecosystems
24. HDFS
24.1. Distributed storage
25. Spark
26. Impala
26.1. SQL query egnine
26.2. Real time
27. Cascading
27.1. Higher abstraction from MR
27.2. Creates Flow that assembles Map/Reduce jobs