MapReduce is a high level programming model brought by Google for large scale parallel data processing came in 2004 and its open source implementation came in 2008 named as Apache Hadoop led by researchers at Yahoo. This led to a big bang of various large scale parallel processing enterprises which started coming from various database communities that provide more features(Pig, HBase, Hive).
Some notable and permanent contributions provided by MapReduce were:-
- Fault Tolerance- When you are working on 1000 computers at a time the probability of one of them failing is extremely high. So fault tolerance during query processing so that you won’t lose the work was something MapReduce and Hadoop paper really emphasized.
- Schema on read- Relational Databases are implemented on a fixed Schema ie a Structure is made first and then you need to fix your data to that schema. But almost all times data comes from various sources so they are not provided with the Schema already. So map reduce allows you to load the data and allows you to work over it.
- User defined functions- The experience of writing, managing and maintain code in a normal database with SQL queries was not good. The reason being that a lot of people put their logic inside the application layer as opposed to database layer. MapReduce allows you to define your functions in the application layer.
Enough said about the evolution in Database world MapReduce has brought, let’s now understand how it really works?
Let us suppose you need to determine the word count for every word present in every document available on internet. Certainly this is humongous amount of data and not a cup of tea for only one machine to process. So to do this we need to scale out and work on cluster of computers.
Now MapReduce’s job begins. MapReduce can be broken down in three steps:-
Here computation is taken to data in place of data being taken to computation. Data is divided in blocks and then each block is given to one system for processing. Map reduce can be broken down in three steps:-
Map– This takes in a key value pair and generates a key value pair.
Key- Document Id
Value- Bag of words in the document.
Value- Contains the instance if the word appears in the documents.
Shuffle– Nodes now will redistribute data based on the words in output keys (produced by the “map()” function), such that all data belonging to a word is located on one node/system. In simple terms it will group all pairs with same word or key and passes that group to the reduce function.
Reduce– This will take key value pairs and returns the count.
Value- Bag of instances.
Hence, we have a word count available at each node.
This pretty much describes what happens in every problem solving with MapReduce approach. The reason why MapReduce became so popular was due to the parallelism it bring on the table.