Regression analysis with R

R is a great tool for quick visualization and playing around with the data. Here I’ll use Boston data over which I’ll do some visualization using ggplot2 library and then will use one machine learning algrorithm.

library(ggplot2)
library(MASS)
data(Boston)

You can find the detailed description in its documentation using:-

?Boston

As now we have the data let’s play with it. See type of the variables and their relationship with each other using scatterplots.

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7
pairs(Boston,col="brown")

download

If you’ll look closely on the plot generated you’ll see there lies a pattern between lstat and medv variables.

lstat- lower status of the population (percent).

medv- median value of owner-occupied homes in $1000s.

Let’s frame a toy problem for ourselves just to get a better understanding.

Given the medv data determine percentage of lower status of the population in Boston. It is always a good practice to start analysis using histograms

ggplot(aes(x=medv),data=Boston)+
  geom_histogram(color="blue",fill="red",binwidth=1)+
  scale_x_continuous(limits=c(4,51), breaks=seq(4,51,1))+
  scale_y_continuous(breaks=seq(0,40))

download (1)

Observations- Most cases have median ranging from 19 -25. Highest median is between 50-51 with 16 cases. Let’s now plot a scatter plot between lstat and medv. Alpha is used to avoid overplotting, setting that we can see where bulk of the data lies.

ggplot(aes(x=lstat,y=medv),data=Boston)+
  geom_point(alpha=1/3)

download (2)

Looking at the plot it seems a quadratic decision boundary would fit the plot.

fit=lm(medv~I(lstat^2),Boston);
plot(Boston$medv~Boston$lstat)
points(Boston$lstat,fitted(fit),col="blue",pch=20)

download (3)

predict(fit,data.frame(lstat=c(15,20,30)))
##         1         2         3 
## 22.193295 17.951219  5.831003
Advertisements

One thought on “Regression analysis with R

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s