Definition
MapReduce is a programming model and an associated implementation for processing and generating large data sets. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines.
Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages. We realized that most of our computations involved applying a map operation to each logic "record" in our input in order to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key, in order to combine the derived data appropriately.
Hardware
- dual-processor x86 processors running Linux, with 2-4GB of memory performance
- commodity networking hardware - 100 MB/secs
- machine failures are common
- storage is in expensive IDE disk
Execution Overview
Master Data Strcutures
For each completed map task, the master stores the locations and sizes of the R intermediate file regions produced by the map task. ... The information is pushed incrementally to works that have in-progress reduce tasks.
Fault Tolerance
The master pings every worker periodically.
MapReduce is resilient to large-scale worker failures. The MapReduce master simply re-executed the work done by the unreachable worker machines.
However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master failes.
Locality
The MapReduce master takes the location information of the input files into account and attempts to schedule a map taks on a machine that contains a replica of the corresponding input data.
Backup Tasks
One of the common causes that lengthens the total time taken for a MapReduce operation is a "straggler": a machine that takes an unusually long time to complete one of the last few map or reduce taks in the computation. We have a general mechanism to alleviate the problem of stragglers. When a MapReduce operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.
Partitioning/Combiner Function
Remaining questions after review
- How reduce workers fetch data from different temporary files that map workers generated?
- How to explain the map/reduce function as below with more concrete examples:
map (k1,v1) -> list(k2,v2) reduce (k2,list(v2)) -> list(v2)