Get Started. It's Free
or sign up with your email address
BigData by Mind Map: BigData

1. Spark implementation of some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction

2. Tools

2.1. Scheduler

2.1.1. Oozie (Apache)

2.1.1.1. Schedule Hadoop jobs

2.1.1.2. Combines multiple jobs sequentially into unit of work

2.1.1.3. Integrated with Hadoop stack

2.1.1.4. Supports jobs for MR, Pig, Hive and Sqoop + system app, Java, shell

2.2. Panel

2.2.1. Hue (Cloudera)

2.2.1.1. UI for Hadoop and satellites (HDFS, MR, Hive, Oozie, Pig, Impala, Solr etc.)

2.2.1.2. Webpanel

2.2.1.3. Upload files to HDFS, send Hive queries etc.

2.3. Data analyze

2.3.1. Pig (Apache)

2.3.1.1. High-level scripting language

2.3.1.2. Can invoke from code on Java, Ruby etc.

2.3.1.3. Can get data from files, streams or other sources

2.3.1.4. Output to HDFS

2.3.1.5. Pig scripts translated to series of MR jobs

2.4. Data transfer

2.4.1. Sqoop (Apache)

2.4.1.1. Transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

2.4.1.2. Two-way replication with both snapshots and incremental updates.

2.4.1.3. Import between external datastores, HDSF, Hive, HBase etc.

2.4.1.4. Works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB

2.5. Visualization

2.5.1. Tableau

2.5.1.1. INPUT: Can access data in Hadoop via Hive, Impala, Spark SQL, Drill, Presto or any ODBC in Hortonworks, Cloudera, DataStax, MapR distributions

2.5.1.2. OUTPUT: reports, UI web, UI client

2.5.1.3. Clustered. Nearly linear scalability

2.5.1.4. Can access traditional DB

2.5.1.5. Can explore and visualize data

2.5.1.6. SQL

2.6. Security

2.6.1. Knox (Apache)

2.6.1.1. Provides single-point of authentication and access to services in Hadoop cluster

2.7. Graph analytic

2.7.1. GraphX

2.7.1.1. Spark part

2.7.1.2. API for graphs and graph-parallel computation

3. Event Input

3.1. Kafka

3.1.1. Distributed publish-subscribe messaging system

3.1.2. High throughput

3.1.3. Persistent scalable messaging for parallel data load to Hadoop

3.1.4. Compression for performance, mirroring for HA and scalability

3.1.5. Usually used for clickstream

3.1.6. Pull output

3.1.7. Use Kafka if you need a highly reliable and scalable enterprise messaging system to connect many multiple systems, one of which is Hadoop.

3.2. Flume

3.2.1. Main use-case is to ingest data into Hadoop

3.2.2. Distributed system

3.2.3. Collecting data from many sources

3.2.4. Mostly log processing

3.2.5. Push output

3.2.6. A lot of pre-built pre-built collectors

3.2.7. OUTPUT: Write to HDFS, HBase, Cassandra etc

3.2.8. INPUT: Can use Kafka

3.2.9. Use Flume if you have an non-relational data sources such as log files that you want to stream into Hadoop.

4. Machine Learning

4.1. MLib

4.1.1. Initial contribution from AMPLab, UC Berkeley, shipped with Spark since version 0.8

4.1.2. Spark part

4.2. Mahout

4.2.1. On top of Hadoop

4.2.2. Using MR

4.2.3. Lucene part

5. Frameworks

5.1. Tez

5.1.1. Hadoop part

5.1.2. Framework to build YARN based high-perf batch/interactive apps

5.1.3. Hive and Pig based on it

5.1.4. Provides API

5.1.5. On top of YARN

5.2. YARN

5.2.1. Hadoop part

5.2.2. Resource manager

5.2.3. Security

5.3. ZooKeeper

5.3.1. Cluster coordination

5.3.2. Used by Storm, Hadoop, HBase, Elastic Search etc.

5.3.3. Group messaging and shared registers with an eventing mechanism similar

6. Execution Engine

6.1. Spark (Apache)

6.1.1. Runs through YARN or as stand-alone

6.1.2. Can work with HDFS, HBase, Cassandra, Hive data

6.1.3. General data processing similar to MR + streaming, interactive queries, machine learning etc.

6.1.4. RDD with caching

6.1.5. Stand-alone, YARN-based. Mesos-based, side-by-side on existing Hadoop deployment

6.1.6. Good for iterative algos, iterative processing and machine learning

6.1.7. Rich API and interactive shell

6.1.8. A lot of transformation and actions for RDD

6.1.9. Extremely simple syntax

6.1.10. Compatible with existing Hadoop Data

6.1.11. INPUT/OUTPUT: HDFS, HBase, FS. any datasource with Hadoop InputFormat

6.2. Hadoop MR

6.2.1. Distributed processing of large data sets across clusters of computers

6.2.2. MapReduce

6.2.3. High availability, fault tolerance

6.2.4. Good for batch processing

6.2.5. High latency

6.2.6. Doesn't use memory well

6.2.7. Iterative algos uses a lot of IO

6.2.8. Primitive API anf Key/Value IO

6.2.9. Basics like join requires a lot of code

6.2.10. Result is a lot of files

6.2.11. Hadoop part

6.2.12. INPUT/OUTPUT: HDFS, HBase

7. Storage

7.1. NoSQL

7.1.1. Accumulo (Apache)

7.1.1.1. BigTable, Key/Value, Distributed, Petabyte scale, fault tolerance

7.1.1.2. Sorted, schema-less, graph-support

7.1.1.3. On top of Hadoop, Zookeeper, Thrift

7.1.1.4. BTrees, bloom filters, compression

7.1.1.5. Uses Zookeeper, MR, have CLI

7.1.1.6. Clustered. Linear scalability. Failover master

7.1.1.7. NSA

7.1.1.8. Per-cell ACL + table ACL

7.1.1.9. Built-in user DB and plaintext on wire

7.1.1.10. Iterators can be inserted in read path and table maintenance code

7.1.1.11. Advanced UI

7.1.2. HBase

7.1.2.1. On top of HDFS

7.1.2.2. BigTable, Key/Value, Distributed, Petabyte scale, fault tolerance

7.1.2.3. Clustered. Linear scalability. Failover master

7.1.2.4. Sorted, RESTful

7.1.2.5. FB, eBay, Flurry, Adobe, Mozilla, TrendMicro

7.1.2.6. BTrees, bloom filters, compression

7.1.2.7. Uses Zookeeper, MR, have CLI

7.1.2.8. Column-family ACL

7.1.2.9. Advanced security and integration with Hadoop

7.1.2.10. Triggers, stored procedures, cluster management hooks

7.1.2.11. HBase is 4-5 times slower than HDFS in batch context. HBase is for random access

7.1.2.12. Stores key/value pairs in columnar fashion (columns are clubbed together as column families).

7.1.2.13. Provides low latency access to small amounts of data from within a large data set.

7.1.2.14. Provides flexible data model.

7.1.2.15. HBase is not optimized for classic transactional applications or even relational analytics.

7.1.2.16. HBase is CPU and Memory intensive with sporadic large sequential I/O access

7.1.2.17. HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink

7.2. Distributed Filesystems

7.2.1. HDFS

7.2.1.1. Hadoop part

7.2.1.2. Optimized for streaming access of large files.

7.2.1.3. Follows write-once read-many ideology.

7.2.1.4. Doesn't support random read/write.

7.2.1.5. Good for MapReduce jobs are primarily I/O bound with fixed memory and sporadic CPU.

7.2.2. Tachyon

7.2.2.1. Memory centric

7.2.2.2. Hadoop/Spark compatible wo changes

7.2.2.3. High performance by leveraging lineage information and using memory aggressively

7.2.2.4. 40 contributors from over 15 institutions, including Yahoo, Intel, and Redhat

7.2.2.5. High-throughput writes

7.2.2.6. Read 200x and write 300x faster than HDFS

7.2.2.7. Latency 17x times less than HDFS

7.2.2.8. Uses memory (instead of disk) and recomputation (instead of replication) to produce a distributed, fault-tolerant, and high-throughput file system.

7.2.2.9. Data sets are in the gigabytes or terabytes

8. Query Engine

8.1. Interactive

8.1.1. Drill (Apache)

8.1.1.1. Low latency

8.1.1.2. JSON-like internal data model to represent and process data

8.1.1.3. Dynamic schema discovery. Can query self-describing data-formats as JSON, NoSQL, AVRO etc.

8.1.1.4. SQL queries against a petabyte or more of data distributed across 10,000-plus servers

8.1.1.5. Drillbit instead of MapReduce

8.1.1.6. Full ANSI SQL:2003

8.1.1.7. Low-latency distributed execution engine

8.1.1.8. Unknown schema on HBase, Cassandra, MongoDB

8.1.1.9. Fault tolerant

8.1.1.10. Memory intensive

8.1.1.11. INPUT: RDBMS, NoSQL, HBase, Hive tables, JSON, Avro, Parquet, FS etc.

8.1.1.12. Shell, WebUI, ODBC, JDBC, CPP API

8.1.1.13. Not mature enough?

8.1.2. Impala (Cloudera)

8.1.2.1. Low latency

8.1.2.2. HiveQL and SQL92

8.1.2.3. INPUT: Can use HDFS or HBase wo move/transform. Can use Hive tables as metastore

8.1.2.4. Integrated with Hadoop, queries Hadoop

8.1.2.5. Requires specific format for performance

8.1.2.6. Memory intensive (if join two tables, one of them need to be in cluster memory)

8.1.2.7. Claimed as x30 faster than Hive for queries with multiple MR jobs, but it's before Stinger

8.1.2.8. No fault-tolerance

8.1.2.9. Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet

8.1.2.10. Fine-grained, role-based authorization with Sentry. Supports Hadoop security

8.1.3. Presto (Facebook)

8.1.3.1. ANSI SQL

8.1.3.2. Real-time

8.1.3.3. Petabytes of data

8.1.3.4. INPUT: Can use Hive, HDFS, HBase, Cassandra, Scribe

8.1.3.5. Similar to Impala and Hive/Stinger

8.1.3.6. Own engine

8.1.3.7. Declared as CPU more efficient than Hive/MapReduce

8.1.3.8. Just opened, not so big community

8.1.3.9. Not fail-tolerant

8.1.3.10. Read-only

8.1.3.11. No UDF

8.1.3.12. Requires specific format for performance

8.1.4. Hive 0.13 with Stinger

8.1.4.1. A lot of MR micro-jobs

8.1.4.2. HQL

8.1.4.3. Can use HBase

8.1.5. Spark SQL

8.1.5.1. SQL, HQL or Scala to run on Spark

8.1.5.2. Can use Hive data also

8.1.5.3. JDBC/ODBC

8.1.5.4. Spark part

8.1.6. Shark

8.1.6.1. Replaced with Spark SQL

8.2. Non-interactive

8.2.1. Hive < 0.13

8.2.1.1. Long jobs

8.2.1.2. HQL

8.2.1.3. Based on Hadoop and MR

8.2.1.4. High latency

9. Stream processing

9.1. Spark Streaming (Berkeley)

9.1.1. Spark part

9.1.2. INPUT: Can use Kafka, Flume, Twitter, ZeroMQ, Kinesis, TCP as a source

9.1.3. OUTPUT to FS, DB, live dashboards, HDFS

9.1.4. High-throughput, fault-tolerant on live streams

9.1.5. Machine learning, graph processing, MR, join and window algos can be applied

9.1.6. Fault-tolerance: exactly one

9.1.7. Micro-batching, few seconds

9.1.8. Average latency

9.1.9. Can use YARN, Mesos

9.2. Samza (LinkedIn)

9.2.1. On YARN

9.2.2. INPUT/OUTPUT: Kafka and streams

9.2.3. Asynchronous

9.2.4. Near-realtime

9.2.5. No dynamic re-balancing

9.2.6. Work with YARN and Kafka out of the box

9.2.7. On top of Hadoop 2 / YARN

9.3. Storm (Twitter)

9.3.1. Distributed

9.3.2. Record-at-time

9.3.3. INPUT: Queue: Kestrel, Kafka, Flume, RabbitMQ, JMS, Amazon Kinesis

9.3.4. OUTPUT: Kafka, HBase, HDFS since 0.9.1

9.3.5. Real-time, sub-second latency

9.3.6. Reliable

9.3.7. Large data streams

9.3.8. Dynamic re-balancing

9.3.9. Fault-tolerance: at-least one

9.3.10. Average throughput

9.3.11. On top of Hadoop 2 / YARN

9.3.12. Can use YARN, Mesos

9.3.13. Distributed RPC