Wednesday, February 23, 2011

MapReduce

MapReduce is a programming model and an associated implementation for processing and generating large data sets. It is inspired by Map and Reduce functions which are used in functional programming.

The basic computation can be expressed a set of input key/value pairs and produces a set of output key value pairs. All this computing is done in two steps Map and Reduce. Map takes an input pair and produces a set of intermediate key/value pairs. MapReduce library groups together all intermediate values associated with the same intermediate key and passes to Reduce. Reduce takes intermediate key and a set of values for that key as input. Reduce merges these to form a possibly smaller set of values.Typically just zero or one output value is produced per Reduce invocation. Actually the MapReduce framework is a large distributed sort which comprises of Input Reader, Map Function, Partition Function, Compare Function, Reduce Function, and Output Writer. The diagram shows how the data flows between these functions.

For example consider the problem of counting the number of occurrences of each word in a large collection of documents. Map and Reduce functions would have the pseudo-code as:
map(String key, String value):
    for each word w in value:
        emitIntermidiate(w,"1");
reduce(String key, Iterator values):
    int result = 0;
    for each v in values:
        result+= ParseInt(v);

Few more examples that can be expressed as MapReduce computations. Count of URL Access Frequency for which map function will process logs of web page requests and output (URL, 1) and reduce function will add all together for same url and will result (URL, total count). In Reverse Web-link Graph problem map function will output (target, source) pairs for each link to a target url found in page named source. Reduce will then concatenate the list and will emit (target, list(source)).

MapReduce is easy to use even for programmers without experience with parallel and distributed systems, since it hides the details of parallelization, fault tolerance, locally optimization and load balancing. A large set of real world problems are easily expressible as MapReduce computations.
MapReduce has been used at Google for different purposes such as web search service, sorting, data mining, machine learning etc. Google has developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines.


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.