Create your own awesome maps

Even on the go

with our free apps for iPhone, iPad and Android

Get Started

Already have an account?
Log In

BigData by Mind Map: BigData
5.0 stars - 7 reviews range from 0 to 5

BigData

Storage

NoSQL

Accumulo (Apache), BigTable, Key/Value, Distributed, Petabyte scale, fault tolerance, Sorted, schema-less, graph-support, On top of Hadoop, Zookeeper, Thrift, BTrees, bloom filters, compression, Uses Zookeeper, MR, have CLI, Clustered. Linear scalability. Failover master, NSA, Per-cell ACL + table ACL, Built-in user DB and plaintext on wire, Iterators can be inserted in read path and table maintenance code, Advanced UI

HBase, On top of HDFS, BigTable, Key/Value, Distributed, Petabyte scale, fault tolerance, Clustered. Linear scalability. Failover master, Sorted, RESTful, FB, eBay, Flurry, Adobe, Mozilla, TrendMicro, BTrees, bloom filters, compression, Uses Zookeeper, MR, have CLI, Column-family ACL, Advanced security and integration with Hadoop, Triggers, stored procedures, cluster management hooks, HBase is 4-5 times slower than HDFS in batch context. HBase is for random access, Stores key/value pairs in columnar fashion (columns are clubbed together as column families)., Provides low latency access to small amounts of data from within a large data set., Provides flexible data model., HBase is not optimized for classic transactional applications or even relational analytics., HBase is CPU and Memory intensive with sporadic large sequential I/O access, HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink

Distributed Filesystems

HDFS, Hadoop part, Optimized for streaming access of large files., Follows write-once read-many ideology., Doesn't support random read/write., Good for MapReduce jobs are primarily I/O bound with fixed memory and sporadic CPU.

Tachyon, Memory centric, Hadoop/Spark compatible wo changes, High performance by leveraging lineage information and using memory aggressively, 40 contributors from over 15 institutions, including Yahoo, Intel, and Redhat, High-throughput writes, Read 200x and write 300x faster than HDFS, Latency 17x times less than HDFS, Uses memory (instead of disk) and recomputation (instead of replication) to produce a distributed, fault-tolerant, and high-throughput file system., Data sets are in the gigabytes or terabytes

Query Engine

Interactive

Drill (Apache), Low latency, JSON-like internal data model to represent and process data, Dynamic schema discovery. Can query self-describing data-formats as JSON, NoSQL, AVRO etc., SQL queries against a petabyte or more of data distributed across 10,000-plus servers, Drillbit instead of MapReduce, Full ANSI SQL:2003, Low-latency distributed execution engine, Unknown schema on HBase, Cassandra, MongoDB, Fault tolerant, Memory intensive, INPUT: RDBMS, NoSQL, HBase, Hive tables, JSON, Avro, Parquet, FS etc., Shell, WebUI, ODBC, JDBC, CPP API, Not mature enough?

Impala (Cloudera), Low latency, HiveQL and SQL92, INPUT: Can use HDFS or HBase wo move/transform. Can use Hive tables as metastore, Integrated with Hadoop, queries Hadoop, Requires specific format for performance, Memory intensive (if join two tables, one of them need to be in cluster memory), Claimed as x30 faster than Hive for queries with multiple MR jobs, but it's before Stinger, No fault-tolerance, Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet, Fine-grained, role-based authorization with Sentry. Supports Hadoop security

Presto (Facebook), ANSI SQL, Real-time, Petabytes of data, INPUT: Can use Hive, HDFS, HBase, Cassandra, Scribe, Similar to Impala and Hive/Stinger, Own engine, Declared as CPU more efficient than Hive/MapReduce, Just opened, not so big community, Not fail-tolerant, Read-only, No UDF, Requires specific format for performance

Hive 0.13 with Stinger, A lot of MR micro-jobs, HQL, Can use HBase

Spark SQL, SQL, HQL or Scala to run on Spark, Can use Hive data also, JDBC/ODBC, Spark part

Shark, Replaced with Spark SQL

Non-interactive

Hive < 0.13, Long jobs, HQL, Based on Hadoop and MR, High latency

Execution Engine

Spark (Apache)

Runs through YARN or as stand-alone

Can work with HDFS, HBase, Cassandra, Hive data

General data processing similar to MR + streaming, interactive queries, machine learning etc.

RDD with caching

Stand-alone, YARN-based. Mesos-based, side-by-side on existing Hadoop deployment

Good for iterative algos, iterative processing and machine learning

Rich API and interactive shell

A lot of transformation and actions for RDD

Extremely simple syntax

Compatible with existing Hadoop Data

INPUT/OUTPUT: HDFS, HBase, FS. any datasource with Hadoop InputFormat

Hadoop MR

Distributed processing of large data sets across clusters of computers

MapReduce

High availability, fault tolerance

Good for batch processing

High latency

Doesn't use memory well

Iterative algos uses a lot of IO

Primitive API anf Key/Value IO

Basics like join requires a lot of code

Result is a lot of files

Hadoop part

INPUT/OUTPUT: HDFS, HBase

Frameworks

Tez

Hadoop part

Framework to build YARN based high-perf batch/interactive apps

Hive and Pig based on it

Provides API

On top of YARN

YARN

Hadoop part

Resource manager

Security

ZooKeeper

Cluster coordination

Used by Storm, Hadoop, HBase, Elastic Search etc.

Group messaging and shared registers with an eventing mechanism similar

Stream processing

Spark Streaming (Berkeley)

Spark part

INPUT: Can use Kafka, Flume, Twitter, ZeroMQ, Kinesis, TCP as a source

OUTPUT to FS, DB, live dashboards, HDFS

High-throughput, fault-tolerant on live streams

Machine learning, graph processing, MR, join and window algos can be applied

Fault-tolerance: exactly one

Micro-batching, few seconds

Average latency

Can use YARN, Mesos

Samza (LinkedIn)

On YARN

INPUT/OUTPUT: Kafka and streams

Asynchronous

Near-realtime

No dynamic re-balancing

Work with YARN and Kafka out of the box

On top of Hadoop 2 / YARN

Storm (Twitter)

Distributed

Record-at-time

INPUT: Queue: Kestrel, Kafka, Flume, RabbitMQ, JMS, Amazon Kinesis

OUTPUT: Kafka, HBase, HDFS since 0.9.1

Real-time, sub-second latency

Reliable

Large data streams

Dynamic re-balancing

Fault-tolerance: at-least one

Average throughput

On top of Hadoop 2 / YARN

Can use YARN, Mesos

Distributed RPC

Machine Learning

MLib

Spark implementation of some common machine learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction

Initial contribution from AMPLab, UC Berkeley, shipped with Spark since version 0.8

Spark part

Mahout

On top of Hadoop

Using MR

Lucene part

Event Input

Kafka

Distributed publish-subscribe messaging system

High throughput

Persistent scalable messaging for parallel data load to Hadoop

Compression for performance, mirroring for HA and scalability

Usually used for clickstream

Pull output

Use Kafka if you need a highly reliable and scalable enterprise messaging system to connect many multiple systems, one of which is Hadoop.

Flume

Main use-case is to ingest data into Hadoop

Distributed system

Collecting data from many sources

Mostly log processing

Push output

A lot of pre-built pre-built collectors

OUTPUT: Write to HDFS, HBase, Cassandra etc

INPUT: Can use Kafka

Use Flume if you have an non-relational data sources such as log files that you want to stream into Hadoop.

Tools

Scheduler

Oozie (Apache), Schedule Hadoop jobs, Combines multiple jobs sequentially into unit of work, Integrated with Hadoop stack, Supports jobs for MR, Pig, Hive and Sqoop + system app, Java, shell

Panel

Hue (Cloudera), UI for Hadoop and satellites (HDFS, MR, Hive, Oozie, Pig, Impala, Solr etc.), Webpanel, Upload files to HDFS, send Hive queries etc.

Data analyze

Pig (Apache), High-level scripting language, Can invoke from code on Java, Ruby etc., Can get data from files, streams or other sources, Output to HDFS, Pig scripts translated to series of MR jobs

Data transfer

Sqoop (Apache), Transferring bulk data between Apache Hadoop and structured datastores such as relational databases., Two-way replication with both snapshots and incremental updates., Import between external datastores, HDSF, Hive, HBase etc., Works with relational databases such as: Teradata, Netezza, Oracle, MySQL, Postgres, and HSQLDB

Visualization

Tableau, INPUT: Can access data in Hadoop via Hive, Impala, Spark SQL, Drill, Presto or any ODBC in Hortonworks, Cloudera, DataStax, MapR distributions, OUTPUT: reports, UI web, UI client, Clustered. Nearly linear scalability, Can access traditional DB, Can explore and visualize data, SQL

Security

Knox (Apache), Provides single-point of authentication and access to services in Hadoop cluster

Graph analytic

GraphX, Spark part, API for graphs and graph-parallel computation