MapReduce (TM)

March 4, 2011 at 4:55 pm (FOSS)

MapReduce is a patented framework introduced by Google to support distributed computing on large data sets on clusters of computers. It was inspired by the map reduce concepts found in functional programming languages but the purpose in this particular case is different from the original forms.

To begin learning this concept you could read the research paper written by Google research fellows. This might be the most widely read computer science paper till date. Then go ahead and watch these 2 videos . After watching the first video you will have some doubts which the second video will clear up.

Hadoop is probably the most prominent free software project that implements MapReduce. It also implements a file system similar to the Google File System (GFS), called HDFS (Hadoop distributed file system). Hadoop recently won the terabyte data sort benchmark where it completed the task in 209 seconds in comparison to the previously held record of 297 seconds. Facebook, among many other companies, uses Hadoop for analysing terabytes of logs everyday.

The other project that implements MapReduce is the Disco project. While Hadoop is written in Java, the Disco project is in Python. One interesting thing about this paradigm is that to write MapReduce code using these frameworks, you do not need to know the native languages of the frameworks. Infact you can write it in any language that has the STDIN STDOUT concept. That means you can use Hadoop (Hadoop streaming actually) and write Python or Ruby code.

So lets say you have a massive problem you want to solve. How do you go about solving it in MapReduce? Lets take a sample problem.

We have millions of documents and we want to find out how many times each word occurred in the documents.

Your task as the programmer is simple: Write a map function and write a reduce function. Everything else will be taken care of by the underlying framework, such as Hadoop or Disco.

What is the map function for? What does it mean?

These millions of documents will be divided into a groups randomly and sent to different nodes in your cluster. Each node will run this map function on the documents it receives. This map function is supposed to emit a key value pair from each of these documents. All nodes will keep emitting key value pairs like this. The framework will capture these key-value pairs, sort this collection and pass it to the reduce function one key-values set at a time.

What is the reduce function for? What does it mean?

The reduce function takes these key-value pairs and sums or aggregates them before giving a final output value per key-values pair.

Lets get back to the word counting in the documents example:

Here would be the map function pseudocode:

void map(String name, String document):
  // name: document name
  // document: document contents
  for each word w in document:
    EmitIntermediate(w, "1");

For each word found this function emits a key-value pair, where the key is the word and the value is always 1. The idea is that someone (reduce function) will collect all these pairs and keep adding these 1s and calculate a sum at the end of the day. Why does it emit key-value pairs and not something else? Well, the return format has to be formalised to some extent because there are frameworks being built around it. So people just decided to settle on key-value as return types from map functions.

The reduce function would look like:

void reduce(String word, Iterator partialCounts):
  // word: a word
  // partialCounts: a list of aggregated partial counts
  int result = 0;
  for each pc in partialCounts:
    result += ParseInt(pc);
  Emit(AsString(result));

Here is a schematic for the process:

A few points that might not be clear:

1) The shuffling phase, where the keys are sorted before being passed to the reducer is taken care of by the framework like Hadoop or Disco.

2) There maybe different reducer logic for different kinds of keys. The framework can be told which reducer to be called for which key set.

This was a brief description of the concept. You can learn much more by following the links I have provided above in the post.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: