Month: September 2015

Playing around with Box and Whisker plot.

Playing around with Box and Whisker plot.

Box and Whisker plot or commonly known as Box Plot, is a very useful and one of the most informative data visualization used to show the distribution at a glance. So it is a good visualization for data scientists to use it in their analysis.

Let us understand how to read a Box Plot. Suppose you are conducting a survey for a product X, that how many times users have used it in the past week. And then you plot the collected data using Box Plot.

Continue reading “Playing around with Box and Whisker plot.”

Advertisements
Why GPU development has taken such a huge leap in recent years?

Why GPU development has taken such a huge leap in recent years?

Suppose there is a farmer who wants to plow his farm. He has only a limited sum of money to buy a livestock using which he can get his job done. Either he can buy 1024 chickens or 2 strong Oxen. A smart choice would be 1024 Chickens. The essential idea is Parallelism. We can solve large problems by breaking them into smaller pieces and then we can run these pieces at the same time.

Continue reading “Why GPU development has taken such a huge leap in recent years?”

CNN Ecosphere- A magnificent work on Data Visualization

CNN Ecosphere- A magnificent work on Data Visualization

In Data Science, a skill which distinguishes you as a Data Scientist from statisticians, programmers or database engineers is how well you are able to communicate your results and findings. Most of the time people whom you have to explain your results don’t have knowledge about Analytics. So delivering mere numbers is not a good idea. Apart from that, understanding the data is imperative for data scientists to do Analytics over it. Hence, Data Visualization is not just essential for communicating results but also to get insights from data.

Continue reading “CNN Ecosphere- A magnificent work on Data Visualization”

Frequentist vs Baysian- A Never Ending Debate

Frequentist vs Baysian- A Never Ending Debate

19th century statistics was Bayesian while the 20th century was Frequentist, at least from the point of view of most scientific practitioners. The Bayesian-Frequentist debate reflects two different attitudes to the process of doing modeling, both looks quite legitimate.

In simple terms Bayesian statisticians are individual researchers, or a research group, trying to use all information they have to make quickest possible progress. While Frequentist statisticians draw conclusions from sample data by the emphasis on the frequency or proportion of the data only. They do not have any prior knowledge about the data. Hence, in Bayesian we have some prior knowledge while in Frequentist we don’t. You can find a more intuitive example about the difference between the two in layman terms –here.

Continue reading “Frequentist vs Baysian- A Never Ending Debate”

Getting inside the brain of a Neural Network

Getting inside the brain of a Neural Network

Neural Networks are getting so popular due to their ability to create any function by feature learning when enough data is provided. Features are the information you are giving to the network, greater the feature size greater is the information you provide. They are primarily used to solve classification problems but research is still being done to make them work for regression problems as well.

Continue reading “Getting inside the brain of a Neural Network”

Review of Datacamp’s Introduction to R

Review of Datacamp’s Introduction to R

Introduction to R course is a good course to get started with R. It gives a good practical introduction over the usage of data provide in form of Vectors, Matrices, Factors, Data Frames and Lists. All topics mentioned have a set of exercises with an interesting story associated with them. Things you’ll be doing throughout this 4 hour course are:-

Continue reading “Review of Datacamp’s Introduction to R”

MapReduce explained

MapReduce explained

MapReduce is a high level programming model brought by Google for large scale parallel data processing came in 2004 and its open source implementation came in 2008 named as Apache Hadoop led by researchers at Yahoo. This led to a big bang of various large scale parallel processing enterprises which started coming from various database communities that provide more features(Pig, HBase, Hive).

Some notable and permanent contributions provided by MapReduce were:-

  • Fault Tolerance- When you are working on 1000 computers at a time the probability of one of them failing is extremely high. So fault tolerance during query processing so that you won’t lose the work was something MapReduce and Hadoop paper really emphasized.
  • Schema on read- Relational Databases are implemented on a fixed Schema ie a Structure is made first and then you need to fix your data to that schema. But almost all times data comes from various sources so they are not provided with the Schema already. So map reduce allows you to load the data and allows you to work over it.
  • User defined functions- The experience of writing, managing and maintain code in a normal database with SQL queries was not good. The reason being that a lot of people put their logic inside the application layer as opposed to database layer. MapReduce allows you to define your functions in the application layer.

Continue reading “MapReduce explained”

Regression analysis with R

R is a great tool for quick visualization and playing around with the data. Here I’ll use Boston data over which I’ll do some visualization using ggplot2 library and then will use one machine learning algrorithm.

library(ggplot2)
library(MASS)
data(Boston)

You can find the detailed description in its documentation using:-

?Boston

Continue reading “Regression analysis with R”

P-value explained

P-value explained

The other day I was reading an answer of an interviewer on Quora, “What-is-a-typical-data-scientist-interview-like“, he wrote: ” What is P-Value ? – I expect candidates to know to explain to me what a P-Value is and what P-Value means (even at 4am…)”. This pretty much justifies the importance of understanding P-value.

There are so many definitions already provided on web but still I always have difficulty in understanding its significance. I believe many others with non-statistical background would empathize to this. So let me give a bit intuitive understanding of p-value.

Continue reading “P-value explained”

Machine Learning’s Evolution to Deep Learning

Machine Learning’s Evolution to Deep Learning

Right now there’s a big hype about Machine learning and Big Data all around in the tech world. This is not surprising as they have played a significant role in Automation, Business advancements and predictions. But along with them Deep Learning is also now becoming a popular term in recent times. One interesting fact about deep learning is that it was abandoned in late 1980s, but later in 2007 Geoffrey Hinton brought an algorithm which all over again has invoked research in it.

Continue reading “Machine Learning’s Evolution to Deep Learning”