3. Map Aggregation

Get Started. It's Free
or sign up with your email address
3. Map Aggregation by Mind Map: 3. Map Aggregation

1. Combiners

1.1. Special type of Reducer

1.1.1. Combiner extends class Reducer

1.1.2. Invoked arbitrary number of times (1...many)

1.1.2.1. MR framework invokes

1.1.2.2. Invoked just prior to spill

1.1.3. Combiner must deserialize Mapper <K, V>

1.1.4. Only combines data on one node

1.1.4.1. Not output from multiple Mappers

1.1.5. Combiner logic should be transparent

1.1.5.1. I.e. Combiner should be able to be removed w/out altering the outcome of Reducer.

1.1.6. Combiner input and output types must be the same

1.1.6.1. I.e. must match output of Mapper

1.1.7. Reducer can be used as a Combiner in some cases.

1.1.7.1. Example: if the Reducer computes a commutative and associative function like summation or max value.

1.1.8. Combiner always runs at least once per Mapper.

1.1.8.1. Always at least 1 spill file

1.2. Reduce-side combining

1.2.1. Purpose of Combiner is decrease network traffic between Mapper and Reducer.

1.2.2. Reducer can use Combiner to improve file I/O.

1.2.2.1. Should make NO difference to program logic.

1.3. Combiner Example

1.3.1. Extends org.apache.hadoop.mapreduce.Reducer class

1.3.2. Combiner <K, V> output must match Mapper output which matches Reducer input.

1.3.3. Multiple invocations won't affect algorithm

1.3.4. WordCountCombiner

1.3.4.1. Input <K, V> = output <K, V>

1.3.4.1.1. Therefore not all Reducers can be Combiners.

1.3.4.2. setCombinerClass in the run() method to configure Combiner

1.3.4.2.1. Example

1.3.4.2.2. Sometimes you can use the Reducer:

2. In-map aggregation

2.1. Also called local aggregation

2.2. Mapper stores records in memory

2.3. Won't work for large # records

2.4. Good performance gains when it works

2.4.1. Doesn't have overhead of Combiner (JVM)

2.5. Example: TopResultMapper

2.5.1. ArrayList eventually holds every word

2.5.1.1. Won't work if split too big

2.5.1.2. Doesn't hold dupes - freq ++

2.5.2. Clever trick: ArrayList put into PriorityQueue sorted by frequency.

2.5.2.1. NOTE: Only finds top 10 words per InputSplit - not entire Hadoop cluster.

3. Counters

3.1. Pre-defined

3.1.1. Job Counters

3.1.2. FilesSystemCounters

3.1.3. Many in the Map-Reduce Framework group

3.2. User-defined

3.2.1. enum-based

3.2.1.1. incrementing

3.2.2. string group and counter name