MapReduce explained

MapReduce explained

MapReduce is a high level programming model brought by Google for large scale parallel data processing came in 2004 and its open source implementation came in 2008 named as Apache Hadoop led by researchers at Yahoo. This led to a big bang of various large scale parallel processing enterprises which started coming from various database communities that provide more features(Pig, HBase, Hive).

Some notable and permanent contributions provided by MapReduce were:-

  • Fault Tolerance- When you are working on 1000 computers at a time the probability of one of them failing is extremely high. So fault tolerance during query processing so that you won’t lose the work was something MapReduce and Hadoop paper really emphasized.
  • Schema on read- Relational Databases are implemented on a fixed Schema ie a Structure is made first and then you need to fix your data to that schema. But almost all times data comes from various sources so they are not provided with the Schema already. So map reduce allows you to load the data and allows you to work over it.
  • User defined functions- The experience of writing, managing and maintain code in a normal database with SQL queries was not good. The reason being that a lot of people put their logic inside the application layer as opposed to database layer. MapReduce allows you to define your functions in the application layer.

Enough said about the evolution in Database world MapReduce has brought, let’s now understand how it really works?

Let us suppose you need to determine the word count for every word present in every document available on internet. Certainly this is humongous amount of data and not a cup of tea for only one machine  to process. So to do this we need to scale out and work on cluster of computers.

image_1

Now MapReduce’s job begins. MapReduce can be broken down in three steps:-

Here computation is taken to data in place of data being taken to computation. Data is divided in blocks and then each block is given to one system for processing. Map reduce can be broken down in three steps:-

Map– This takes in a key value pair and generates a key value pair.

  • Input

Key- Document Id

Value- Bag of words in the document.

  • Output

Key- Word

Value- Contains the instance if the word appears in the documents.

image_2

Shuffle– Nodes now will redistribute data based on the words in output keys (produced by the “map()” function), such that all data belonging to a word is located on one node/system. In simple terms it will group all pairs with same word or key and passes that group to the reduce function.

image_3

.

Reduce– This will take key value pairs and returns the count.

  • Input

Key- Word

Value- Bag of instances.

  • Output

Word count

image_4

Hence, we have a word count available at each node.

This pretty much describes what happens in every problem solving with MapReduce approach. The reason why MapReduce became so popular was due to the parallelism it bring on the table.

Advertisements

One thought on “MapReduce explained

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s