1. Several schedulers
1.1. Fair scheduler
1.1.1. Most used
1.1.2. developed in FaceBook
1.1.3. all jobs get fair share
1.1.4. eventually battles will start
2. @NathanMilford
3. Test nodes
4. less good for random access
5. If you loose it, your whole cluster is useless
6. Hadoop HDFS
6.1. expects failure
6.1.1. only one single point of failure
6.1.1.1. Name node
6.2. Nodes
6.2.1. NameNode
6.2.1.1. Coordinates storage & access
6.2.1.2. Holds metadata (file names, permissions &c)
6.2.1.3. Shouldn't be commidity hardware
6.2.1.4. Should have lots of RAM
6.2.1.4.1. Yahoo's 1400 nodes cluster has 30GB in name node
6.2.2. Data node
6.2.2.1. monitored by name node periodically
6.2.2.1.1. if it doesn't answer, it will be removed
6.2.2.2. holds data in opaque blocks stored on the local file system
6.2.2.3. due to replication of at least 3, you can lose 2 data nodes without losing data
6.2.3. Should be backed-up frequently
6.2.4. secondary name node
6.2.4.1. badly named
6.2.4.1.1. not a backup of name node
6.2.4.2. needs as much RAM as the name node
6.2.4.3. periodic process of combining the RAM & filesystem
6.3. intended for commodity hardware
6.3.1. still enterprise servers, but not monsters
6.4. by default, replication of 3
6.4.1. the more, the better it will perform
6.5. optimized for large files
6.5.1. files smaller than 64MB (block size) won't be padded, in both file-system & name node RAM
6.6. All NameNode data stored in RAM
6.6.1. for fast lookup
6.7. Workflow
6.7.1. client ask a file from name node
6.7.2. redirect to specific data node
7. About
7.1. Ops Engineer in Outbrain
8. Usage
8.1. 15 Hadoop nodes
8.2. Used for Reports
8.2.1. Customer facing reports
9. Why
9.1. Scales really well
9.2. Alternatives working in scale-up reach some wall
9.3. When using sharding, you need a reliable framework to work with it
9.4. When writing scripts yourself (e.g., ETL) they usually start |eating their own tail"
9.5. Why not commercial alternatives?
9.5.1. Popular
9.5.2. Initially in science now also banks &c
9.5.3. Community
9.5.3.1. also many companies that support it
10. 2 distro's of Hadoop
10.1. ASF (Apache) vs. CDH
10.2. ASF has the bleeding edge features
10.3. according to the use-case, you'll need to setup the cluster differently
10.3.1. e.g.
10.3.1.1. if in the shuffle stage lots of data gets transferred, you'll need a hardware & network solution
10.4. most use CDH
10.4.1. RPM's
10.4.2. stable
11. abstraction between ops & dev
11.1. dev treat the data as an abstract dataset, not caring where it is
11.2. ops just do the plumbing & make it work
12. design principles
12.1. support
12.1.1. partial failure
12.1.2. data recovery
12.1.3. consistent
12.1.4. scalable
12.2. "scatter & gather"
12.2.1. moving data is expansive
12.2.2. moving calculation is cheap
12.2.3. chatter is as little as possible
12.3. share nothing
12.4. map/reduce is very simple, but can be applied to almost anything
13. know before you build
14. Ecosystem
14.1. Hadoop MapReduce
14.1.1. MapReduce daemons
14.1.1.1. 2 daemons
14.1.1.1.1. Job Tracker
14.1.1.1.2. Task Tracker
14.1.1.2. anything that fails, will be restarted
14.1.1.3. if failed too many times, will get into black list
14.1.2. MapReduce workflow
14.1.2.1. Client
14.1.2.1.1. Job
14.1.2.2. JobTracker
14.1.2.2.1. Task
14.1.2.3. TaskTracker
14.1.2.3.1. Task
14.1.2.4. Child JVM of Task
14.1.3. Processing slots for machines
14.1.3.1. Rule of thumb
14.1.3.1.1. number of cores + 2
14.1.4. Scheduling
14.1.4.1. You can set min/max requirements for tasks
14.2. Hive
14.2.1. Datawarehouse built on top of Hadoop
14.2.2. Looks like SQL
14.2.2.1. HQL interpreted into MapReduce jobs
14.3. Pig
14.3.1. Scripting language
14.3.1.1. Developed in Yahoo
14.3.1.2. Similar in functionality to Hive, but with more programmatic approach
14.3.1.3. Nice web interface to monitor nodes, trackers & tasks
15. Other awesome stuff used in Outbrain
15.1. Scoop
15.1.1. Developed by Cloudera
15.1.1.1. Move data from/into your cluster
15.2. oozie
15.2.1. Like LinkedIn's Escaban
15.3. Flume
15.3.1. Streaming log transort
15.3.2. Stream logs directly into Hive
15.3.3. Use case
15.3.3.1. Meebo needs decide in 2 hours whether to retire an ad
15.3.3.2. Processing is longer
15.3.3.3. Use Hive decorator
15.3.3.4. Analyze a sample in Hive in real-time to make a decision
15.4. More stuff from Cloudera
15.4.1. Hue
15.4.1.1. Python based
15.4.1.2. Open source
15.4.2. Beeswax
16. Outbrain's Cluster
16.1. 15 nodes
17. Resources
17.1. Eric Sammer's Productizing hadoop talk
17.1.1. best
17.1.2. from last Hadoop conference