Customer facing reports
Popular
Initially in science now also banks &c
Community, also many companies that support it
RPM's
stable
partial failure
data recovery
consistent
scalable
moving data is expansive
moving calculation is cheap
chatter is as little as possible
e.g., if in the shuffle stage lots of data gets transferred, you'll need a hardware & network solution
expects failure, only one single point of failure, Name node, If you loose it, your whole cluster is useless
Nodes, NameNode, Coordinates storage & access, Holds metadata (file names, permissions &c), Shouldn't be commidity hardware, Should have lots of RAM, Yahoo's 1400 nodes cluster has 30GB in name node, Should be backed-up frequently, Data node, monitored by name node periodically, if it doesn't answer, it will be removed, holds data in opaque blocks stored on the local file system, due to replication of at least 3, you can lose 2 data nodes without losing data, secondary name node, badly named, not a backup of name node, needs as much RAM as the name node, periodic process of combining the RAM & filesystem
intended for commodity hardware, still enterprise servers, but not monsters
by default, replication of 3, the more, the better it will perform
optimized for large files, files smaller than 64MB (block size) won't be padded, in both file-system & name node RAM
less good for random access
All NameNode data stored in RAM, for fast lookup
Workflow, client ask a file from name node, redirect to specific data node
MapReduce daemons, 2 daemons, Job Tracker, master process, coordinates processing, usually running on name node, Task Tracker, spawnes child JVM's to execute work, child JVM sends heartbits to Task Tracker, anything that fails, will be restarted, if failed too many times, will get into black list
MapReduce workflow, Client, Job, JobTracker, Task, TaskTracker, Task, Child JVM of Task
Processing slots for machines, Rule of thumb, number of cores + 2
Nice web interface to monitor nodes, trackers & tasks
Scheduling, Several schedulers, Fair scheduler, Most used, developed in FaceBook, all jobs get fair share, eventually battles will start, You can set min/max requirements for tasks
Datawarehouse built on top of Hadoop
Looks like SQL, HQL interpreted into MapReduce jobs
Scripting language, Developed in Yahoo, Similar in functionality to Hive, but with more programmatic approach
Developed by Cloudera, Move data from/into your cluster
Like LinkedIn's Escaban
Streaming log transort
Stream logs directly into Hive
Use case, Meebo needs decide in 2 hours whether to retire an ad, Processing is longer, Use Hive decorator, Analyze a sample in Hive in real-time to make a decision
Hue, Python based, Open source
Beeswax
best
from last Hadoop conference