R PROJECT
June 9, 2021
1 PREDICTION MODEL FOR BOSTON HOUSE PRICE
1.1 Notebook Imports and Packages
[32]: #If one of the packages is missing use this command to install it : install.
,packages(name of the package)
library(psych)
library(e1071)
library(psych)
library(ggplot2)
#library(heatmaply)
library(broom)
library(car)
library(repr)
library(reshape2)
1
1.2 Question
Can we predict boston house price using a multivariable regression model
1.3 Gather Data
Source: Original research paper #### For this project we’re going to work with a pre built
dataset that Sklearn python library provides us documention:sklearnload boston
1.3.1 Python Code for generating csv dataset
from sklearn import datasets
import pandas as pd
data = datasets.load_boston()
df = pd.DataFrame(data=data['data'], columns = data['feature_names'])
df['PRICE']=data['target']
df.to_csv('boston.csv', sep = ',', index = False)
1.4 Exploring the data
[33]: boston_dataset <- read.csv(file="boston.csv")
[34]: summary(boston_dataset)
str(boston_dataset)
cat("class : ",class(boston_dataset),"\n")
cat ("number of rows : ",nrow(boston_dataset),"\n")
cat ("number of features : " ,ncol(boston_dataset),"\n")
cat("dataset dimension : " ,dim(boston_dataset),"\n")
CRIM ZN INDUS CHAS
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
NOX RM AGE DIS
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
RAD TAX PTRATIO B
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
2
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
LSTAT PRICE
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
'data.frame': 506 obs. of 14 variables:
$ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905
$ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5
$ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87
$ CHAS : num 0 0 0 0 0 0 0 0 0 0
$ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524
$ RM : num 6.58 6.42 7.18 7 7.15
$ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9
$ DIS : num 4.09 4.97 4.97 6.06 6.06
$ RAD : num 1 2 2 3 3 3 5 5 5 5
$ TAX : num 296 242 242 222 222 222 311 311 311 311
$ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2
$ B : num 397 397 393 395 397
$ LSTAT : num 4.98 9.14 4.03 2.94 5.33
$ PRICE : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9
class : data.frame
number of rows : 506
number of features : 14
dataset dimension : 506 14
[35]: #checking for missing data
is.null(boston_dataset)
FALSE
1.4.1 Exploring the features
Data Set Characteristics:
:Number of Instances: 506
:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
:Attribute Information (in order):
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million) to mesure pollution rate
- RM average number of rooms per dwelling
3
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX
full-value property-tax rate per \$10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
:Missing Attribute Values: None
:Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-
databases/housing/
4
1.5 Visualising Data - Histograms, Distributions
[36]: #top rows
head(boston_dataset)
A data.frame: 6 × 14
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7
[37]: options(repr.plot.width=14, repr.plot.height=9)
ggplot(data=boston_dataset, aes(x=PRICE)) +
geom_histogram(aes(y =..count..),
bins=30,
col="red",
fill="#2196f3",
alpha = .2) +
labs(title="Histogram for PRICE", x="Price", y="N of Houses",size=14)
5
[38]: skewness(boston_dataset$PRICE)
1.10153731064787
We have a positive skewed data. to normalize the data the common transformations
of this data include square root, cube root, and log. later on we’re going to use a log
transformation
[39]: options(repr.plot.width=14, repr.plot.height=9)
ggplot(data=boston_dataset, aes(x=RM)) +
geom_histogram(aes(y =..count..),
bins=10,
col="red",
fill="#2196f3",
alpha = .2) +
labs(title="Histogram for rooms", x="Number of rooms", y="N of Houses")
[40]: mean(boston_dataset$RM)
6.28463438735178
[41]: skewness(boston_dataset$RM)
0.401222328753187
6
this distribution tend to a normal distribution, most of the houses have from 5 to 6
rooms.
[42]: options(repr.plot.width=14, repr.plot.height=9)
ggplot(data=boston_dataset, aes(x=RAD)) +
geom_histogram(aes(y =..count..),
bins=24,
col="red",
fill="#2196f3",
alpha = .2,
position=position_dodge(1e5-20*(1e3)) ) +
labs(title="Histogram for rooms", x="Accebility to haiways", y="N of Houses")
[43]: #number of occurences
1.6 Descriptive statistics
[44]: mean(boston_dataset$CHAS)
median(boston_dataset$CHAS)
min(boston_dataset$PRICE)
max(boston_dataset$PRICE)
#number of occurences
table(boston_dataset$CHAS)
7
0.0691699604743083
0
5
50
0 1
471 35
[45]: summary(boston_dataset)
CRIM ZN INDUS CHAS
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
NOX RM AGE DIS
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
RAD TAX PTRATIO B
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
Median : 5.000 Median :330.0 Median :19.05 Median :391.44
Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
LSTAT PRICE
Min. : 1.73 Min. : 5.00
1st Qu.: 6.95 1st Qu.:17.02
Median :11.36 Median :21.20
Mean :12.65 Mean :22.53
3rd Qu.:16.95 3rd Qu.:25.00
Max. :37.97 Max. :50.00
1.6.1 KEY TAKEAWAYS
Correlation coecients are used to measure the strength of the relationship between two
variables.
Pearson correlation is the one most commonly used in statistics. This measures the strength
and direction of a linear relationship between two variables.
8
Values always range between -1 (strong negative relationship) and +1 (strong positive rela-
tionship). Values at or close to zero imply a weak or no linear relationship.
Correlation coecient values less than +0.8 or greater than -0.8 are not considered signicant.
1.7 Correlation
1.8
ρ
XY
= corr(X, Y )
1.9
1.0 ρ
XY
+1.0
[46]: cor(boston_dataset$PRICE,boston_dataset$RM, use = "everything",
method = "pearson")
0.695359947071539
[47]: cor(boston_dataset$PRICE,boston_dataset$PTRATIO, use = "everything",
method = "pearson")
-0.507786685537562
[48]: cormat <- cor(boston_dataset,y=NULL, use = "everything",
method = "pearson")
cormat
A matrix: 14 × 14 of type dbl
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
CRIM 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171 -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456 -0.38506394 0.4556215 -0.3883046
ZN -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785 0.17552032 -0.4129946 0.3604453
INDUS 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145 -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476 -0.35697654 0.6037997 -0.4837252
CHAS -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152 0.04878848 -0.0539293 0.1752602
NOX 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000 -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327 -0.38005064 0.5908789 -0.4273208
RM -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015 0.12806864 -0.6138083 0.6953599
AGE 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010 -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150 -0.27353398 0.6023385 -0.3769546
DIS -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705 0.29151167 -0.4969958 0.2499287
RAD 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056 -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412 -0.44441282 0.4886763 -0.3816262
TAX 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320 -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530 -0.44180801 0.5439934 -0.4685359
PTRATIO 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268 -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000 -0.17738330 0.3740443 -0.5077867
B -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833 1.00000000 -0.3660869 0.3334608
LSTAT 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892 -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443 -0.36608690 1.0000000 -0.7376627
PRICE -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867 0.33346082 -0.7376627 1.0000000
[49]: #creating matrix
mask <- upper.tri(cormat,diag=TRUE)
[50]: ### Get lower triangle of the correlation matrix
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
9
}
### Get upper triangle of the correlation matrix
get_upper_tri <- function(cormat){
cormat[lower.tri(cormat)]<- NA
return(cormat)
}
upper_tri <- get_upper_tri(cormat)
upper_tri
A matrix: 14 × 14 of type dbl
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE
CRIM 1 -0.2004692 0.4065834 -0.05589158 0.42097171 -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456 -0.38506394 0.4556215 -0.3883046
ZN NA 1.0000000 -0.5338282 -0.04269672 -0.51660371 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785 0.17552032 -0.4129946 0.3604453
INDUS NA NA 1.0000000 0.06293803 0.76365145 -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476 -0.35697654 0.6037997 -0.4837252
CHAS NA NA NA 1.00000000 0.09120281 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152 0.04878848 -0.0539293 0.1752602
NOX NA NA NA NA 1.00000000 -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327 -0.38005064 0.5908789 -0.4273208
RM NA NA NA NA NA 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015 0.12806864 -0.6138083 0.6953599
AGE NA NA NA NA NA NA 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150 -0.27353398 0.6023385 -0.3769546
DIS NA NA NA NA NA NA NA 1.00000000 -0.494587930 -0.53443158 -0.2324705 0.29151167 -0.4969958 0.2499287
RAD NA NA NA NA NA NA NA NA 1.000000000 0.91022819 0.4647412 -0.44441282 0.4886763 -0.3816262
TAX NA NA NA NA NA NA NA NA NA 1.00000000 0.4608530 -0.44180801 0.5439934 -0.4685359
PTRATIO NA NA NA NA NA NA NA NA NA NA 1.0000000 -0.17738330 0.3740443 -0.5077867
B NA NA NA NA NA NA NA NA NA NA NA 1.00000000 -0.3660869 0.3334608
LSTAT NA NA NA NA NA NA NA NA NA NA NA NA 1.0000000 -0.7376627
PRICE NA NA NA NA NA NA NA NA NA NA NA NA NA 1.0000000
[51]: melted_cormat <- melt(upper_tri, na.rm = TRUE)
options(repr.plot.width=14, repr.plot.height=9)
ggplot(data = melted_cormat, aes(Var2, Var1, fill = value ))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 12, hjust = 1))+
coord_fixed()
10
[52]: nox_dis_corr<-cor(boston_dataset$DIS,boston_dataset$NOX, use = "everything",
method ="pearson")
options(repr.plot.width=14, repr.plot.height=9)
ggplot(boston_dataset) +
aes(x = DIS, y = NOX) +
geom_point(colour = "#0c4c8a",size=5,alpha=0.6) +
theme(text = element_text(size =15),)+
labs(title=nox_dis_corr, x="DIS - Distance from employment", y="NOX - Nitric
,Oxide Pollution")
11
Interpretation: the more the distance from employment grows the less the pollution we
have which is predictable and normal.
1.9.1 Linear regression between features
[53]: model <- lm(boston_dataset$NOX ~ boston_dataset$DIS, data = boston_dataset)
model
Call:
lm(formula = boston_dataset$NOX ~ boston_dataset$DIS, data = boston_dataset)
Coefficients:
(Intercept) boston_dataset$DIS
0.71534 -0.04233
[54]: ggplot(boston_dataset) +
aes(x = TAX, y = RAD) +
geom_point(colour = "#0c4c8a",size=5,alpha=0.6) +
theme(text = element_text(size =15),)+
stat_smooth(method = lm)+
labs(title=nox_dis_corr, x="DIS - Distance from employment", y="NOX - Nitric
,Oxide Pollution")
options(repr.P.width=9,repr.P.height=6)
12
`geom_smooth()` using formula 'y ~ x'
[55]: #correlation matrix
pairs.panels(boston_dataset,
lm = TRUE,
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
ellipses = FALSE # show correlation ellipses
)
13
1.10 Training & Test Dataset Split
[56]: set.seed(10)
## 80% of the sample size
train_size <- floor(0.8 * nrow(boston_dataset))
in_rows <- sample(c(1:nrow(boston_dataset)), size = train_size, replace = FALSE)
train <- boston_dataset[in_rows, ]
test <- boston_dataset[-in_rows, ]
dim(train)
dim(test)
summary(train)
1. 404 2. 14
1. 102 2. 14
CRIM ZN INDUS CHAS
Min. : 0.00906 Min. : 0.00 Min. : 0.740 Min. :0.00000
1st Qu.: 0.08233 1st Qu.: 0.00 1st Qu.: 5.085 1st Qu.:0.00000
Median : 0.23440 Median : 0.00 Median : 9.690 Median :0.00000
Mean : 3.39556 Mean : 11.69 Mean :11.105 Mean :0.05941
3rd Qu.: 3.84006 3rd Qu.: 12.50 3rd Qu.:18.100 3rd Qu.:0.00000
Max. :73.53410 Max. :100.00 Max. :27.740 Max. :1.00000
NOX RM AGE DIS
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
14
1st Qu.:0.4487 1st Qu.:5.883 1st Qu.: 42.75 1st Qu.: 2.088
Median :0.5380 Median :6.189 Median : 77.15 Median : 3.207
Mean :0.5549 Mean :6.261 Mean : 68.20 Mean : 3.840
3rd Qu.:0.6240 3rd Qu.:6.582 3rd Qu.: 94.10 3rd Qu.: 5.287
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
RAD TAX PTRATIO B
Min. : 1.000 Min. :187.0 Min. :13.00 Min. : 2.52
1st Qu.: 4.000 1st Qu.:277.0 1st Qu.:17.23 1st Qu.:375.78
Median : 5.000 Median :335.0 Median :19.05 Median :391.48
Mean : 9.661 Mean :410.4 Mean :18.50 Mean :359.01
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.90
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
LSTAT PRICE
Min. : 1.730 Min. : 5.00
1st Qu.: 6.928 1st Qu.:16.77
Median :11.675 Median :20.75
Mean :12.783 Mean :22.30
3rd Qu.:16.948 3rd Qu.:25.00
Max. :37.970 Max. :50.00
1.11 Multivariable Regression
[57]: model <- lm(PRICE~CRIM+ZN+INDUS+CHAS+NOX+RM+AGE+DIS+RAD+TAX+PTRATIO+B+
,LSTAT,data=train)
summary(model)
Call:
lm(formula = PRICE ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE +
DIS + RAD + TAX + PTRATIO + B + LSTAT, data = train)
Residuals:
Min 1Q Median 3Q Max
-14.6092 -2.7220 -0.5774 1.7397 25.9586
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.177696 5.780878 6.604 1.31e-10 ***
CRIM -0.077279 0.052417 -1.474 0.141205
ZN 0.037210 0.015294 2.433 0.015425 *
INDUS
0.013338 0.073135 0.182 0.855383
CHAS 2.750736 1.068721 2.574 0.010425 *
NOX -18.381466 4.358511 -4.217 3.08e-05 ***
RM 3.524380 0.473022 7.451 6.03e-13 ***
AGE 0.004915 0.014832 0.331 0.740522
DIS -1.426138 0.221958 -6.425 3.84e-10 ***
RAD 0.264717 0.076744 3.449 0.000623 ***
TAX -0.011338 0.004333 -2.617 0.009223 **
15
PTRATIO -0.968737 0.151327 -6.402 4.42e-10 ***
B 0.010705 0.003286 3.257 0.001223 **
LSTAT -0.540427 0.058027 -9.313 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.876 on 390 degrees of freedom
Multiple R-squared: 0.7186, Adjusted R-squared: 0.7092
F-statistic: 76.6 on 13 and 390 DF, p-value: < 2.2e-16
R-squared shows how well the data t the regression model (the goodness of t). our
r-squared is equal to 0.71, we can say that our model is t but we can do better with
some transformation.
1.12 Data Transformation
1.13 mesuring skew
[58]: skewness(boston_dataset$PRICE)
1.10153731064787
As mentionned before the transformation we’re going to use is Log
[59]: y_log <- log(boston_dataset$PRICE)
tail(y_log)
1. 2.82137888640921 2. 3.10906095886099 3. 3.02529107579554 4. 3.17387845893747
5. 3.09104245335832 6. 2.47653840011748
[60]: options(repr.plot.width=14, repr.plot.height=9)
ggplot(data=boston_dataset, aes(x=y_log)) +
geom_histogram(aes(x=y_log),
bins=50,
col="red",
fill="#2196f3",
alpha = .2) +
labs(title="Histogram for log prices", x="Price 000s ", y="N of Houses")
16
[61]: skewness(y_log)
-0.328365448930113
skewness has improved thanks to our log transformation
[62]: options(repr.plot.width=14, repr.plot.height=9)
ggplot(boston_dataset) +
aes(x = LSTAT, y = PRICE) +
geom_point(colour = "#000000",size=5,alpha=0.6) +
theme(text = element_text(size =14),)+
stat_smooth(method = lm)+
labs(title="", x="LSTAT", y="PRICE")
ggplot(boston_dataset) +
aes(x = LSTAT, y = y_log) +
geom_point(colour = "#000000",size=5,alpha=0.6) +
theme(text = element_text(size =14),)+
stat_smooth(method = lm)+
labs(title="", x="LSTAT", y="log PRICE")
`geom_smooth()` using formula 'y ~ x'
`geom_smooth()` using formula 'y ~ x'
17
so far our optimizations were on point as the graphs show
18
1.14 Regression using log prices
[63]: model <- lm(log(PRICE)~CRIM+ZN+INDUS+CHAS+NOX+RM+AGE+DIS+RAD+TAX+PTRATIO+B+
,LSTAT,data=train)
summary(model)
BIC(model)
Call:
lm(formula = log(PRICE) ~ CRIM + ZN + INDUS + CHAS + NOX + RM +
AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT, data = train)
Residuals:
Min 1Q Median 3Q Max
-0.73182 -0.10628 -0.01636 0.10142 0.88769
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.1191096 0.2299692 17.912 < 2e-16 ***
CRIM -0.0104884 0.0020852 -5.030 7.50e-07 ***
ZN 0.0010208 0.0006084 1.678 0.094190 .
INDUS 0.0020346 0.0029094 0.699 0.484767
CHAS 0.1011734 0.0425148 2.380 0.017806 *
NOX -0.7602185 0.1733860 -4.385 1.50e-05 ***
RM 0.0810628 0.0188173 4.308 2.09e-05 ***
AGE 0.0002907 0.0005900 0.493 0.622453
DIS -0.0479624 0.0088297 -5.432 9.83e-08 ***
RAD 0.0138183 0.0030530 4.526 7.99e-06 ***
TAX -0.0006141 0.0001724 -3.563 0.000412 ***
PTRATIO -0.0387921 0.0060200 -6.444 3.44e-10 ***
B 0.0005175 0.0001307 3.958 8.98e-05 ***
LSTAT -0.0288432 0.0023084 -12.495 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.194 on 390 degrees of freedom
Multiple R-squared: 0.7749, Adjusted R-squared: 0.7673
F-statistic: 103.2 on 13 and 390 DF, p-value: < 2.2e-16
-102.867711770293
[64]: # Charles River Property Premium
exp(0.1011734)
1.10646848676559
people now are willing to pay 1000 dollars more for a house next to the river
19
1.14.1 P values & Evaluating Coecients
[65]: p_values <- data.frame(round(summary(model)$coefficients[ ,4],3))
colnames( p_values )<-c("P_values")
p_values
A data.frame: 14 × 1
P_values
<dbl>
(Intercept) 0.000
CRIM 0.000
ZN 0.094
INDUS 0.485
CHAS 0.018
NOX 0.000
RM 0.000
AGE 0.622
DIS 0.000
RAD 0.000
TAX 0.000
PTRATIO 0.000
B 0.000
LSTAT 0.000
as we can see the p_values of AGE and INDUS are pretty high from the tresh hold
therefore we can say that they’re statistically nonsignicant
1.15 Testing for Multicollinearity
T AX = α
0
+ α
1
RM + α
2
NOX + ... + α
12
LST AT
V IF
T AX
=
1
(1 R
2
T AX
)
1.15.1 KEY TAKEAWAYS
A variance ination factor (VIF) provides a measure of multicollinearity among the indepen-
dent variables in a multiple regression model.
Detecting multicollinearity is important because while multicollinearity does not reduce the
explanatory power of the model, it does reduce the statistical signicance of the independent
variables.
A large variance ination factor (VIF) on an independent variable indicates a highly collinear
relationship to the other variables that should be considered or adjusted for in the structure
of the model and selection of independent variables.
[66]: vif_values <- vif(model)
vif_values
CRIM 2.32004172054075 ZN 2.23669325620911 INDUS 4.25762590331925 CHAS
1.08443701335279 NOX 4.42162249218309 RM 1.85944152784013 AGE 3.02263137291517 DIS
20
3.91429319487761 RAD 7.68674259920848 TAX 9.19729449468695 PTRATIO
1.7279317429752 B 1.40041614593321 LSTAT 2.96287514698582
All the vif numbers are bellow the treshhold 10 so we shouldn’t worry about multi-
collinearity
1.16 Model Simplication & the BIC
Bayesian information criterion (BIC) is a criterion for model selection among a nite
set of models. It is based, in part, on the likelihood function. The models can be tested
using corresponding BIC values. Lower BIC value indicates lower penalty terms hence
a better model.
Our goal is to simplify the model by dropping some of the features that are statistically nonsignif-
icant using p values we provided previously. And improve our model based on the BIC criterion.
[67]: model <- lm(log(PRICE)~CRIM+ZN+CHAS+NOX+RM+AGE+DIS+RAD+TAX+PTRATIO+B+
,LSTAT,data=train)
rsquared <- summary(model)$r.squared
bic <- BIC(model)
cat('BIC is', bic,"\n")
cat('r-squared is',rsquared)
BIC is -108.3628
r-squared is 0.7745711
[68]: model <-
,lm(log(PRICE)~CRIM+ZN+CHAS+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT,data=train)
rsquared <- summary(model)$r.squared
bic <- BIC(model)
cat('BIC is', bic,"\n")
cat('r-squared is',rsquared)
BIC is -114.1037
r-squared is 0.7744257
we Did improve our model without aecting r squared and the t of our model. we dropped two
features
1.17 Residuals & Residual Plots
Residuals: Residuals are a measure of how far from the regression line data points are.
Residuals are nothing but prediction error, we can nd it by subtracting the predicted value
from actual value.
21
[69]: # Modified model: transformed (using log prices) & simplified (dropping two
,features)
y_train<-log(train$PRICE)
model <- lm(y_train~CRIM+ZN+CHAS+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT,data=train)
rsquared <- summary(model)$r.squared
# Residuals
# residuals = train$PRICE - fittedvalues
fittedvalues <- fitted.values(model)
residuals <-y_train - fittedvalues
cor <- cor(y_train,fittedvalues, use = "everything",
method ="pearson")
options(repr.plot.width=14, repr.plot.height=9)
ggplot(train) +
aes(x =log(PRICE), y = fittedvalues) +
geom_point(colour = "#000000",size=5,alpha=0.6) +
theme(text = element_text(size =14),)+
stat_smooth(method = lm)+
labs(title="Actual vs Predicted log prices", x='Actual log prices y',
,y="predicted log prices y hat")
options(repr.P.width=9,repr.P.height=6)
cat("correlation : ",cor)
`geom_smooth()` using formula 'y ~ x'
correlation : 0.8800146
22
RMSE is a measure of how spread out these residuals are. In other words, it tells you how concen-
trated the data is around the line of best t
MSE =
1
N
N
i=1
( ˆy
i
y
i
)
2
RM SE =
MSE
[70]: rsquared<- summary(model)$r.squared
MSE<-mean(summary(model)$residuals^2)
RMSE<- sqrt(MSE)
cat("R-squared:",rsquared,"\n")
cat("MSE:",MSE,"\n")
cat("RMSE:",RMSE,"\n")
R-squared: 0.7744257
MSE: 0.03639125
RMSE: 0.1907649
our RMSE is pretty low which reects how good is our Model
[71]: upper_bound <- log(30) + 2*RMSE
23
cat('The upper bound in log prices for a 95% prediction interval is ',
,upper_bound,"\n")
lower_bound <- log(30) - 2*RMSE
cat('The lower bound in log prices for a 95% prediction interval is ',
,lower_bound)
The upper bound in log prices for a 95% prediction interval is 3.782727
The lower bound in log prices for a 95% prediction interval is 3.019668
1.17.1 Evaluation tools
[245]: #getting ready for setuping tool
log_prices <- log(boston_dataset$PRICE)
target <- data.frame(PRICE=log_prices)
#target
CRIME_IDX <- 0
ZN_IDX <- 1
CHAS_IDX <- 2
RM_IDX <- 4
PTRATIO_IDX <- 8
"this data was collected long time ago, dollar price changed that's why we
,added a scale factor based on
median house prices in boston from this website https://www.zillow.com/
,boston-ma/home-values/ "
ZILLOW_MEDIAN_PRICE <- 675.921
SCALE_FACTOR <- ZILLOW_MEDIAN_PRICE / median(boston_dataset$PRICE)
#rsquared<- summary(model)$r.squared
#MSE<-mean(summary(model)$residuals^2)
#RMSE<- sqrt(MSE)
features <- subset(boston_dataset, select = -c(PRICE,INDUS,AGE))
fittedvalues <- fitted.values(model)
property_stats <- data.frame(t(colMeans(features)))
predict(model,property_stats)
’this data was collected long time ago, dollar price changed that\’s why we added a scale factor
bases on \nmedian house prices in boston from this website
1: 3.03293206386117
[246]: get_log_estimate <- function(nr_rooms,
students_per_classroom,
next_to_river=FALSE,
high_confidence=TRUE) {
#configure proprities
property_stats[,c(RM_IDX)]<-nr_rooms
24
property_stats[,c(PTRATIO_IDX)]<-students_per_classroom
if (high_confidence) {
property_stats[,c(CHAS_IDX)] = 1
}
else{
property_stats[,c(CHAS_IDX)] = 0
}
# Make prediction
log_estimate<- predict(model,property_stats)
# Calc Range
if(high_confidence) {
upper_bound <- log_estimate + 2*RMSE
lower_bound <- log_estimate - 2*RMSE
interval = 95
}
else{
upper_bound <- log_estimate + RMSE
lower_bound <- log_estimate - RMSE
interval = 68
}
return (data.frame(predictions=c(log_estimate, upper_bound, lower_bound,
,interval)))
}
get_log_estimate(3,20,TRUE,FALSE)
A data.frame: 4 × 1
predictions
<dbl>
1.518167
1.708932
1.327402
68.000000
[247]: get_dollar_estimate<- function(rm, ptratio, chas=FALSE, large_range=TRUE){
"Estimate the price of a property in Boston.
Keyword arguments:
rm -- number of rooms in the property.
ptratio -- number of students per teacher in the classroom for the school
,in the area.
chas -- True if the property is next to the river, False otherwise.
large_range -- True for a 95% prediction interval, False for a 68% interval.
"
if (rm < 1 | ptratio < 1){
print('That is unrealistic. Try again.')
25
return()
}
result <-get_log_estimate(rm,
students_per_classroom=ptratio,
next_to_river=chas,
high_confidence=large_range)
log_est<-result[1,"predictions"]
upper<-result[2,"predictions"]
lower<-result[3,"predictions"]
conf <-result[4,"predictions"]
# Convert to today's dollars
dollar_est <- exp(log_est) * 1000 * SCALE_FACTOR
dollar_hi <- exp(upper) * 1000 * SCALE_FACTOR
dollar_low <- exp(lower) * 1000 * SCALE_FACTOR
cat("The estimated property value is ",dollar_est,"\n")
cat("At",conf,"confidence the valuation range is","\n")
cat("USD", dollar_low ,"at the lower end to USD", dollar_hi, "at the high
,end.")
}
[248]: #trying some estimation
get_dollar_estimate(2, 30, TRUE)
The estimated property value is 292779.5
At 95 confidence the valuation range is
USD 199914.6 at the lower end to USD 428782.5 at the high end.
26