Online Mind Mapping and Brainstorming

Create your own awesome maps

Online Mind Mapping and Brainstorming

Even on the go

with our free apps for iPhone, iPad and Android

Get Started

Already have an account? Log In

Hadoop talk, Nathan Milford, Outbrain by Mind Map: Hadoop talk, Nathan Milford, Outbrain
0.0 stars - reviews range from 0 to 5

Hadoop talk, Nathan Milford, Outbrain


Test nodes

less good for random access

If you loose it, your whole cluster is useless

Hadoop HDFS

expects failure

only one single point of failure, Name node


NameNode, Coordinates storage & access, Holds metadata (file names, permissions &c), Shouldn't be commidity hardware, Should have lots of RAM, Yahoo's 1400 nodes cluster has 30GB in name node

Data node, monitored by name node periodically, if it doesn't answer, it will be removed, holds data in opaque blocks stored on the local file system, due to replication of at least 3, you can lose 2 data nodes without losing data

Should be backed-up frequently

secondary name node, badly named, not a backup of name node, needs as much RAM as the name node, periodic process of combining the RAM & filesystem

intended for commodity hardware

still enterprise servers, but not monsters

by default, replication of 3

the more, the better it will perform

optimized for large files

files smaller than 64MB (block size) won't be padded, in both file-system & name node RAM

All NameNode data stored in RAM

for fast lookup


client ask a file from name node

redirect to specific data node


Ops Engineer in Outbrain


15 Hadoop nodes

Used for Reports

Customer facing reports


Scales really well

Alternatives working in scale-up reach some wall

When using sharding, you need a reliable framework to work with it

When writing scripts yourself (e.g., ETL) they usually start |eating their own tail"

Why not commercial alternatives?


Initially in science now also banks &c

Community, also many companies that support it

2 distro's of Hadoop

ASF (Apache) vs. CDH

ASF has the bleeding edge features

most use CDH



abstraction between ops & dev

dev treat the data as an abstract dataset, not caring where it is

ops just do the plumbing & make it work

design principles


partial failure

data recovery



"scatter & gather"

moving data is expansive

moving calculation is cheap

chatter is as little as possible

share nothing

map/reduce is very simple, but can be applied to almost anything

know before you build

according to the use-case, you'll need to setup the cluster differently

e.g., if in the shuffle stage lots of data gets transferred, you'll need a hardware & network solution


Hadoop MapReduce

MapReduce daemons, 2 daemons, Job Tracker, master process, coordinates processing, usually running on name node, Task Tracker, spawnes child JVM's to execute work, child JVM sends heartbits to Task Tracker, anything that fails, will be restarted, if failed too many times, will get into black list

MapReduce workflow, Client, Job, JobTracker, Task, TaskTracker, Task, Child JVM of Task

Processing slots for machines, Rule of thumb, number of cores + 2

Nice web interface to monitor nodes, trackers & tasks

Scheduling, Several schedulers, Fair scheduler, Most used, developed in FaceBook, all jobs get fair share, eventually battles will start, You can set min/max requirements for tasks


Datawarehouse built on top of Hadoop

Looks like SQL, HQL interpreted into MapReduce jobs


Scripting language, Developed in Yahoo, Similar in functionality to Hive, but with more programmatic approach

Other awesome stuff used in Outbrain


Developed by Cloudera, Move data from/into your cluster


Like LinkedIn's Escaban


Streaming log transort

Stream logs directly into Hive

Use case, Meebo needs decide in 2 hours whether to retire an ad, Processing is longer, Use Hive decorator, Analyze a sample in Hive in real-time to make a decision

More stuff from Cloudera

Hue, Python based, Open source


Outbrain's Cluster

15 nodes


Eric Sammer's Productizing hadoop talk


from last Hadoop conference



Replace the default javabased compression