Big Data possibly now has become the most used term in the tech world for this decade. Everybody is talking about it and everybody have their own understanding towards it which has made its definition quite ambiguous. Let us deconstruct this term with a situation. Suppose you are creating a database for movie ratings where rows indicate user IDs, columns indicate movies and the values of the cells indicates rating(0-5) given by user to the corresponding movie. Now this data is likely to be sparse as you can’t have a situation where all users have rated all movies. In real world situation you can conceive the sparsity of this database and the cost it takes to store this huge database/matrix.
MapReduce is a high level programming model brought by Google for large scale parallel data processing came in 2004 and its open source implementation came in 2008 named as Apache Hadoop led by researchers at Yahoo. This led to a big bang of various large scale parallel processing enterprises which started coming from various database communities that provide more features(Pig, HBase, Hive).
Some notable and permanent contributions provided by MapReduce were:-
- Fault Tolerance- When you are working on 1000 computers at a time the probability of one of them failing is extremely high. So fault tolerance during query processing so that you won’t lose the work was something MapReduce and Hadoop paper really emphasized.
- Schema on read- Relational Databases are implemented on a fixed Schema ie a Structure is made first and then you need to fix your data to that schema. But almost all times data comes from various sources so they are not provided with the Schema already. So map reduce allows you to load the data and allows you to work over it.
- User defined functions- The experience of writing, managing and maintain code in a normal database with SQL queries was not good. The reason being that a lot of people put their logic inside the application layer as opposed to database layer. MapReduce allows you to define your functions in the application layer.