Playing around with Box and Whisker plot.

Playing around with Box and Whisker plot.

Box and Whisker plot or commonly known as Box Plot, is a very useful and one of the most informative data visualization used to show the distribution at a glance. So it is a good visualization for data scientists to use it in their analysis.

Let us understand how to read a Box Plot. Suppose you are conducting a survey for a product X, that how many times users have used it in the past week. And then you plot the collected data using Box Plot.

box plot

The section above the white line represents the top 50% of the population which used product X. Those in top 25%(3rd quartile) of the product X users are shown by the top “whisker” and dots. Dots represent the outliers and if more than one outlier have the same value, dots are placed side by side.

The white line in the Box Plot represents the median.It is an essential parameter as when data is in large numbers it is a more robust statistic than mean. A few people with huge product X usage count will drag the mean upwards which isn’t representative of most users. Median is resistant to change, since it marks the halfway point from all data points. Since we are testing half of the points of the data we can set a reliable location to our distribution.

One more parameter which can be determined is the Interquartile Range(IQR). It is a measure of how spread out your data is around the mean and is calculated as the difference between the third and the first Quartile in a data set.

Let us create some Box-plots in R using diamonds data-set which is available in ggplot2 library.

library(ggplot2)
data(diamonds)
ggplot(aes(x=cut,y=price), data=diamonds)+
  geom_boxplot(color="black",fill="brown")+
  coord_cartesian(ylim=c(500,10500))

box_plot_diamonds

To perform its quantitative analysis you can use by instruction.

by(diamonds$price,diamonds$cut,summary)
## diamonds$cut: Fair
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     337    2050    3282    4359    5206   18570 
## -------------------------------------------------------- 
## diamonds$cut: Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     327    1145    3050    3929    5028   18790 
## -------------------------------------------------------- 
## diamonds$cut: Very Good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     336     912    2648    3982    5373   18820 
## -------------------------------------------------------- 
## diamonds$cut: Premium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326    1046    3185    4584    6296   18820 
## -------------------------------------------------------- 
## diamonds$cut: Ideal
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     326     878    1810    3458    4678   18810
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s