Category: Big Data

First day Internship at Decisionstats

First day Internship at Decisionstats

 

Today I’ve started my internship at Decisionstats. I was briefed by Ajay Ohri, founder, that this internship is going to be all about learning and I’ll be writing about things which are useful for others as well. During this two months internship I’ll be assisting him in solving problems here “https://github.com/decisionstats/pythonfordatascience” and developing open source tutorials for them. He explained that the purpose of doing this is that everyone can learn from it and will also help in elevating my data science brand value.

Continue reading “First day Internship at Decisionstats”

Matrix Multiplication with MapReduce

Big Data possibly now has become the most used term in the tech world for this decade. Everybody is talking about it and everybody have their own understanding towards it which has made its definition quite ambiguous. Let us deconstruct this term with a situation. Suppose you are creating a database for movie ratings where rows indicate user IDs, columns indicate movies and the values of the cells indicates rating(0-5) given by user to the corresponding movie. Now this data is likely to be sparse as you can’t have a situation where all users have rated all movies. In real world situation you can conceive the sparsity of this database and the cost it takes to store this huge database/matrix.

Continue reading “Matrix Multiplication with MapReduce”

How Eventual Consistency works ?

How Eventual Consistency works ?

DynamoDB offered by Amazon pioneered the idea of Eventual Consistency as a way to achieve higher availability and scalability.

Big dataset is broken in chunks and these chunks are then sent to different machines. Some replicas of these chunks also sent to these machines to address fault-tolerance. So the two requirements which we need to deal with here are:-

Continue reading “How Eventual Consistency works ?”

Why GPU development has taken such a huge leap in recent years?

Why GPU development has taken such a huge leap in recent years?

Suppose there is a farmer who wants to plow his farm. He has only a limited sum of money to buy a livestock using which he can get his job done. Either he can buy 1024 chickens or 2 strong Oxen. A smart choice would be 1024 Chickens. The essential idea is Parallelism. We can solve large problems by breaking them into smaller pieces and then we can run these pieces at the same time.

Continue reading “Why GPU development has taken such a huge leap in recent years?”

CNN Ecosphere- A magnificent work on Data Visualization

CNN Ecosphere- A magnificent work on Data Visualization

In Data Science, a skill which distinguishes you as a Data Scientist from statisticians, programmers or database engineers is how well you are able to communicate your results and findings. Most of the time people whom you have to explain your results don’t have knowledge about Analytics. So delivering mere numbers is not a good idea. Apart from that, understanding the data is imperative for data scientists to do Analytics over it. Hence, Data Visualization is not just essential for communicating results but also to get insights from data.

Continue reading “CNN Ecosphere- A magnificent work on Data Visualization”

MapReduce explained

MapReduce explained

MapReduce is a high level programming model brought by Google for large scale parallel data processing came in 2004 and its open source implementation came in 2008 named as Apache Hadoop led by researchers at Yahoo. This led to a big bang of various large scale parallel processing enterprises which started coming from various database communities that provide more features(Pig, HBase, Hive).

Some notable and permanent contributions provided by MapReduce were:-

  • Fault Tolerance- When you are working on 1000 computers at a time the probability of one of them failing is extremely high. So fault tolerance during query processing so that you won’t lose the work was something MapReduce and Hadoop paper really emphasized.
  • Schema on read- Relational Databases are implemented on a fixed Schema ie a Structure is made first and then you need to fix your data to that schema. But almost all times data comes from various sources so they are not provided with the Schema already. So map reduce allows you to load the data and allows you to work over it.
  • User defined functions- The experience of writing, managing and maintain code in a normal database with SQL queries was not good. The reason being that a lot of people put their logic inside the application layer as opposed to database layer. MapReduce allows you to define your functions in the application layer.

Continue reading “MapReduce explained”

P-value explained

P-value explained

The other day I was reading an answer of an interviewer on Quora, “What-is-a-typical-data-scientist-interview-like“, he wrote: ” What is P-Value ? – I expect candidates to know to explain to me what a P-Value is and what P-Value means (even at 4am…)”. This pretty much justifies the importance of understanding P-value.

There are so many definitions already provided on web but still I always have difficulty in understanding its significance. I believe many others with non-statistical background would empathize to this. So let me give a bit intuitive understanding of p-value.

Continue reading “P-value explained”