Hadoop talk, Nathan Milford, Outbrain

Get Started. It's Free
or sign up with your email address
Rocket clouds
Hadoop talk, Nathan Milford, Outbrain by Mind Map: Hadoop talk, Nathan Milford, Outbrain

1. Several schedulers

1.1. Fair scheduler

1.1.1. Most used

1.1.2. developed in FaceBook

1.1.3. all jobs get fair share

1.1.4. eventually battles will start

2. @NathanMilford

3. Test nodes

4. less good for random access

5. If you loose it, your whole cluster is useless

6. Hadoop HDFS

6.1. expects failure

6.1.1. only one single point of failure

6.1.1.1. Name node

6.2. Nodes

6.2.1. NameNode

6.2.1.1. Coordinates storage & access

6.2.1.2. Holds metadata (file names, permissions &c)

6.2.1.3. Shouldn't be commidity hardware

6.2.1.4. Should have lots of RAM

6.2.1.4.1. Yahoo's 1400 nodes cluster has 30GB in name node

6.2.2. Data node

6.2.2.1. monitored by name node periodically

6.2.2.1.1. if it doesn't answer, it will be removed

6.2.2.2. holds data in opaque blocks stored on the local file system

6.2.2.3. due to replication of at least 3, you can lose 2 data nodes without losing data

6.2.3. Should be backed-up frequently

6.2.4. secondary name node

6.2.4.1. badly named

6.2.4.1.1. not a backup of name node

6.2.4.2. needs as much RAM as the name node

6.2.4.3. periodic process of combining the RAM & filesystem

6.3. intended for commodity hardware

6.3.1. still enterprise servers, but not monsters

6.4. by default, replication of 3

6.4.1. the more, the better it will perform

6.5. optimized for large files

6.5.1. files smaller than 64MB (block size) won't be padded, in both file-system & name node RAM

6.6. All NameNode data stored in RAM

6.6.1. for fast lookup

6.7. Workflow

6.7.1. client ask a file from name node

6.7.2. redirect to specific data node

7. About

7.1. Ops Engineer in Outbrain

8. Usage

8.1. 15 Hadoop nodes

8.2. Used for Reports

8.2.1. Customer facing reports

9. Why

9.1. Scales really well

9.2. Alternatives working in scale-up reach some wall

9.3. When using sharding, you need a reliable framework to work with it

9.4. When writing scripts yourself (e.g., ETL) they usually start |eating their own tail"

9.5. Why not commercial alternatives?

9.5.1. Popular

9.5.2. Initially in science now also banks &c

9.5.3. Community

9.5.3.1. also many companies that support it

10. 2 distro's of Hadoop

10.1. ASF (Apache) vs. CDH

10.2. ASF has the bleeding edge features

10.3. according to the use-case, you'll need to setup the cluster differently

10.3.1. e.g.

10.3.1.1. if in the shuffle stage lots of data gets transferred, you'll need a hardware & network solution

10.4. most use CDH

10.4.1. RPM's

10.4.2. stable

11. abstraction between ops & dev

11.1. dev treat the data as an abstract dataset, not caring where it is

11.2. ops just do the plumbing & make it work

12. design principles

12.1. support

12.1.1. partial failure

12.1.2. data recovery

12.1.3. consistent

12.1.4. scalable

12.2. "scatter & gather"

12.2.1. moving data is expansive

12.2.2. moving calculation is cheap

12.2.3. chatter is as little as possible

12.3. share nothing

12.4. map/reduce is very simple, but can be applied to almost anything

13. know before you build

14. Ecosystem

14.1. Hadoop MapReduce

14.1.1. MapReduce daemons

14.1.1.1. 2 daemons

14.1.1.1.1. Job Tracker

14.1.1.1.2. Task Tracker

14.1.1.2. anything that fails, will be restarted

14.1.1.3. if failed too many times, will get into black list

14.1.2. MapReduce workflow

14.1.2.1. Client

14.1.2.1.1. Job

14.1.2.2. JobTracker

14.1.2.2.1. Task

14.1.2.3. TaskTracker

14.1.2.3.1. Task

14.1.2.4. Child JVM of Task

14.1.3. Processing slots for machines

14.1.3.1. Rule of thumb

14.1.3.1.1. number of cores + 2

14.1.4. Scheduling

14.1.4.1. You can set min/max requirements for tasks

14.2. Hive

14.2.1. Datawarehouse built on top of Hadoop

14.2.2. Looks like SQL

14.2.2.1. HQL interpreted into MapReduce jobs

14.3. Pig

14.3.1. Scripting language

14.3.1.1. Developed in Yahoo

14.3.1.2. Similar in functionality to Hive, but with more programmatic approach

14.3.1.3. Nice web interface to monitor nodes, trackers & tasks

15. Other awesome stuff used in Outbrain

15.1. Scoop

15.1.1. Developed by Cloudera

15.1.1.1. Move data from/into your cluster

15.2. oozie

15.2.1. Like LinkedIn's Escaban

15.3. Flume

15.3.1. Streaming log transort

15.3.2. Stream logs directly into Hive

15.3.3. Use case

15.3.3.1. Meebo needs decide in 2 hours whether to retire an ad

15.3.3.2. Processing is longer

15.3.3.3. Use Hive decorator

15.3.3.4. Analyze a sample in Hive in real-time to make a decision

15.4. More stuff from Cloudera

15.4.1. Hue

15.4.1.1. Python based

15.4.1.2. Open source

15.4.2. Beeswax

16. Outbrain's Cluster

16.1. 15 nodes

17. Resources

17.1. Eric Sammer's Productizing hadoop talk

17.1.1. best

17.1.2. from last Hadoop conference

17.2. ...

18. Tips

18.1. Replace the default javabased compression