Hadoop Ecosystem
by Ed Sarausad
1. Query data stored in HDFS and HBase
2. Column store
3. Non-relational
4. Maps query onto nodes
5. Coordinator jobs are recurrent Oozie Workflow jobs that are triggered by time and data availability.
6. Reduces aggregated results into answers
7. Links jobs
7.1. Workflow processing
8. Bundle provides a way to package multiple coordinator and workflow jobs and to manage the lifecycle of those jobs
8.1. Connects non-Hadoop stores (RDBMS)
8.2. Moves data to & from RDBMS to Hadoop
9. Workflow jobs are Directed Acyclical Graphs (DAGs), specifying a sequence of actions to execute. The Workflow job has to wait
10. Hive
10.1. SQL-like querying
10.2. Combiner can be used to optimize reducer performance
10.3. Structured data warehousing
10.4. Partition columns instead of indexes
11. Pig
11.1. Scripting for Hadoop
12. HBase
12.1. Transactional lookups
13. Flume
13.1. Log collector
13.2. Integrates into Hadoop
14. Oozie
15. Avro
15.1. Data parsing
15.2. Binary data serialization
15.3. RPC
15.4. language-neutral
15.5. optional codegen
15.6. schema evolution
15.7. untagged data
15.8. dynamic typing
16. Mahout
16.1. Machine learning
16.2. Applied to MR
17. Sqoop
17.1. Autogens Java InputFormat code for data access
18. MapReduce
18.1. Distributed compute
19. Ambari
19.1. Cluster deployment and admin
19.2. Driven by Hortonworks
20. ZooKeeper
20.1. Coordinator of shared state between apps
20.2. Naming, configuration, and synchronization services
21. YARN
21.1. cluster management
21.2. Hadoop 2
21.3. resource manager
21.4. job scheduler
22. BigTop
22.1. Package Hadoop ecosys
22.2. Test Hadoop ecosys package
23. Related Apache Ecosystems
24. HDFS
24.1. Distributed storage
25. Spark
26. Impala
26.1. SQL query egnine
26.2. Real time
27. Cascading
27.1. Higher abstraction from MR
27.2. Creates Flow that assembles Map/Reduce jobs