Big Data possibly now has become the most used term in the tech world for this decade. Everybody is talking about it and everybody have their own understanding towards it which has made its definition quite ambiguous. Let us deconstruct this term with a situation. Suppose you are creating a database for movie ratings where rows indicate user IDs, columns indicate movies and the values of the cells indicates rating(0-5) given by user to the corresponding movie. Now this data is likely to be sparse as you can’t have a situation where all users have rated all movies. In real world situation you can conceive the sparsity of this database and the cost it takes to store this huge database/matrix.
Hence, in data centers data is stored in sparse representation i.e (row_index,col_index,value). This way no extra memory is used to store NULL values. Now algorithms which data scientists use to develop insights from the data involve matrix multiplications. So there is a need to develop some new algorithms to perform these matrix multiplications over this humongous data as traditional algorithm doesn’t work for data stored in sparse representation.
Hence data becomes big data when traditional algorithms doesn’t work and new algorithms are needed to be devised to do the job of these traditional algorithms. These algorithms are built on Mapreduce. If you what to get a basic understanding of Mapreduce then check this article- MapReduce explained.
Let us now understand how Matrix Multiplication is implemented with MapReduce
C= A X B
A has dimensions L,M
B has dimensions M,N
Download the data set matrix.json.
- In the map phase:
– for each element (i,j) of A, emit ((i,k),[j,A(i,j)]) for k in 1…N
– for each element (j,k) of B, emit ((i,k),[j,B(j,k)]) for i in 1..L
- In the reduce phase, emit
– value= Sum(A[i,j]*B[j,k])
Essentially we are replicating all values of A for corresponding columns in B and are replicating all values of B for corresponding rows in A.
Define this MapReduce class separately, which will take the Map and Reduce functions defined above.
You will obtain the results in the same sparse representation in self.result. That’s how matrix multiplication takes place when matrices are in sparse representation.