Create your own awesome maps

Even on the go

with our free apps for iPhone, iPad and Android

Get Started

Already have an account?
Log In

Hadoop talk, Nathan Milford, Outbrain by Mind Map: Hadoop talk, Nathan Milford, Outbrain
0.0 stars - reviews range from 0 to 5

Hadoop talk, Nathan Milford, Outbrain

@NathanMilford

Test nodes

less good for random access

If you loose it, your whole cluster is useless

Hadoop HDFS

expects failure

only one single point of failure, Name node

Nodes

NameNode, Coordinates storage & access, Holds metadata (file names, permissions &c), Shouldn't be commidity hardware, Should have lots of RAM, Yahoo's 1400 nodes cluster has 30GB in name node

Data node, monitored by name node periodically, if it doesn't answer, it will be removed, holds data in opaque blocks stored on the local file system, due to replication of at least 3, you can lose 2 data nodes without losing data

Should be backed-up frequently

secondary name node, badly named, not a backup of name node, needs as much RAM as the name node, periodic process of combining the RAM & filesystem

intended for commodity hardware

still enterprise servers, but not monsters

by default, replication of 3

the more, the better it will perform

optimized for large files

files smaller than 64MB (block size) won't be padded, in both file-system & name node RAM

All NameNode data stored in RAM

for fast lookup

Workflow

client ask a file from name node

redirect to specific data node

About

Ops Engineer in Outbrain

Usage

15 Hadoop nodes

Used for Reports

Customer facing reports

Why

Scales really well

Alternatives working in scale-up reach some wall

When using sharding, you need a reliable framework to work with it

When writing scripts yourself (e.g., ETL) they usually start |eating their own tail"

Why not commercial alternatives?

Popular

Initially in science now also banks &c

Community, also many companies that support it

2 distro's of Hadoop

ASF (Apache) vs. CDH

ASF has the bleeding edge features

most use CDH

RPM's

stable

abstraction between ops & dev

dev treat the data as an abstract dataset, not caring where it is

ops just do the plumbing & make it work

design principles

support

partial failure

data recovery

consistent

scalable

"scatter & gather"

moving data is expansive

moving calculation is cheap

chatter is as little as possible

share nothing

map/reduce is very simple, but can be applied to almost anything

know before you build

according to the use-case, you'll need to setup the cluster differently

e.g., if in the shuffle stage lots of data gets transferred, you'll need a hardware & network solution

Ecosystem

Hadoop MapReduce

MapReduce daemons, 2 daemons, Job Tracker, master process, coordinates processing, usually running on name node, Task Tracker, spawnes child JVM's to execute work, child JVM sends heartbits to Task Tracker, anything that fails, will be restarted, if failed too many times, will get into black list

MapReduce workflow, Client, Job, JobTracker, Task, TaskTracker, Task, Child JVM of Task

Processing slots for machines, Rule of thumb, number of cores + 2

Nice web interface to monitor nodes, trackers & tasks

Scheduling, Several schedulers, Fair scheduler, Most used, developed in FaceBook, all jobs get fair share, eventually battles will start, You can set min/max requirements for tasks

Hive

Datawarehouse built on top of Hadoop

Looks like SQL, HQL interpreted into MapReduce jobs

Pig

Scripting language, Developed in Yahoo, Similar in functionality to Hive, but with more programmatic approach

Other awesome stuff used in Outbrain

Scoop

Developed by Cloudera, Move data from/into your cluster

oozie

Like LinkedIn's Escaban

Flume

Streaming log transort

Stream logs directly into Hive

Use case, Meebo needs decide in 2 hours whether to retire an ad, Processing is longer, Use Hive decorator, Analyze a sample in Hive in real-time to make a decision

More stuff from Cloudera

Hue, Python based, Open source

Beeswax

Outbrain's Cluster

15 nodes

Resources

Eric Sammer's Productizing hadoop talk

best

from last Hadoop conference

...

Tips

Replace the default javabased compression