R PROJECT

June 9, 2021

1 PREDICTION MODEL FOR BOSTON HOUSE PRICE

1.1 Notebook Imports and Packages

[32]: #If one of the packages is missing use this command to install it : install.

,→packages(name of the package)

library(psych)

library(e1071)

library(psych)

library(ggplot2)

#library(heatmaply)

library(broom)

library(car)

library(repr)

library(reshape2)

1.2 Question

Can we predict boston house price using a multivariable regression model

1.3 Gather Data

Source: Original research paper #### For this project we’re going to work with a pre built

dataset that Sklearn python library provides us documention:sklearnload boston

1.3.1 Python Code for generating csv dataset

from sklearn import datasets

import pandas as pd

data = datasets.load_boston()

df = pd.DataFrame(data=data['data'], columns = data['feature_names'])

df['PRICE']=data['target']

df.to_csv('boston.csv', sep = ',', index = False)

1.4 Exploring the data

[33]: boston_dataset <- read.csv(file="boston.csv")

[34]: summary(boston_dataset)

str(boston_dataset)

cat("class : ",class(boston_dataset),"\n")

cat ("number of rows : ",nrow(boston_dataset),"\n")

cat ("number of features : " ,ncol(boston_dataset),"\n")

cat("dataset dimension : " ,dim(boston_dataset),"\n")

CRIM ZN INDUS CHAS

Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000

1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000

Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000

Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917

3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000

Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000

NOX RM AGE DIS

Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130

1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100

Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207

Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795

3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188

Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127

RAD TAX PTRATIO B

Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32

1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38

Median : 5.000 Median :330.0 Median :19.05 Median :391.44

Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67

3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23

Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90

LSTAT PRICE

Min. : 1.73 Min. : 5.00

1st Qu.: 6.95 1st Qu.:17.02

Median :11.36 Median :21.20

Mean :12.65 Mean :22.53

3rd Qu.:16.95 3rd Qu.:25.00

Max. :37.97 Max. :50.00

'data.frame': 506 obs. of 14 variables:

$ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 …

$ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 …

$ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 …

$ CHAS : num 0 0 0 0 0 0 0 0 0 0 …

$ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 …

$ RM : num 6.58 6.42 7.18 7 7.15 …

$ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 …

$ DIS : num 4.09 4.97 4.97 6.06 6.06 …

$ RAD : num 1 2 2 3 3 3 5 5 5 5 …

$ TAX : num 296 242 242 222 222 222 311 311 311 311 …

$ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 …

$ B : num 397 397 393 395 397 …

$ LSTAT : num 4.98 9.14 4.03 2.94 5.33 …

$ PRICE : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 …

class : data.frame

number of rows : 506

number of features : 14

dataset dimension : 506 14

[35]: #checking for missing data

is.null(boston_dataset)

FALSE

1.4.1 Exploring the features

Data Set Characteristics:

:Number of Instances: 506

:Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

:Attribute Information (in order):

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million) to mesure pollution rate

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX

full-value property-tax rate per \$10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000's

:Missing Attribute Values: None

:Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset. https://archive.ics.uci.edu/ml/machine-learning-

databases/housing/

1.5 Visualising Data - Histograms, Distributions

[36]: #top rows

head(boston_dataset)

A data.frame: 6 × 14

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE

1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0

2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6

3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7

4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4

5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2

6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

[37]: options(repr.plot.width=14, repr.plot.height=9)

ggplot(data=boston_dataset, aes(x=PRICE)) +

geom_histogram(aes(y =..count..),

bins=30,

col="red",

fill="#2196f3",

alpha = .2) +

labs(title="Histogram for PRICE", x="Price", y="N of Houses",size=14)

[38]: skewness(boston_dataset$PRICE)

1.10153731064787

We have a positive skewed data. to normalize the data the common transformations

of this data include square root, cube root, and log. later on we’re going to use a log

transformation

[39]: options(repr.plot.width=14, repr.plot.height=9)

ggplot(data=boston_dataset, aes(x=RM)) +

geom_histogram(aes(y =..count..),

bins=10,

col="red",

fill="#2196f3",

alpha = .2) +

labs(title="Histogram for rooms", x="Number of rooms", y="N of Houses")

[40]: mean(boston_dataset$RM)

6.28463438735178

[41]: skewness(boston_dataset$RM)

0.401222328753187

this distribution tend to a normal distribution, most of the houses have from 5 to 6

rooms.

[42]: options(repr.plot.width=14, repr.plot.height=9)

ggplot(data=boston_dataset, aes(x=RAD)) +

geom_histogram(aes(y =..count..),

bins=24,

col="red",

fill="#2196f3",

alpha = .2,

position=position_dodge(1e5-20*(1e3)) ) +

labs(title="Histogram for rooms", x="Accebility to haiways", y="N of Houses")

[43]: #number of occurences

1.6 Descriptive statistics

[44]: mean(boston_dataset$CHAS)

median(boston_dataset$CHAS)

min(boston_dataset$PRICE)

max(boston_dataset$PRICE)

#number of occurences

table(boston_dataset$CHAS)

0.0691699604743083

0 1

471 35

[45]: summary(boston_dataset)

CRIM ZN INDUS CHAS

Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000

1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000

Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000

Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917

3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000

Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000

NOX RM AGE DIS

Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130

1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100

Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207

Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795

3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188

Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127

RAD TAX PTRATIO B

Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32

1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38

Median : 5.000 Median :330.0 Median :19.05 Median :391.44

Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67

3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23

Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90

LSTAT PRICE

Min. : 1.73 Min. : 5.00

1st Qu.: 6.95 1st Qu.:17.02

Median :11.36 Median :21.20

Mean :12.65 Mean :22.53

3rd Qu.:16.95 3rd Qu.:25.00

Max. :37.97 Max. :50.00

1.6.1 KEY TAKEAWAYS

• Correlation coecients are used to measure the strength of the relationship between two

variables.

• Pearson correlation is the one most commonly used in statistics. This measures the strength

and direction of a linear relationship between two variables.

• Values always range between -1 (strong negative relationship) and +1 (strong positive rela-

tionship). Values at or close to zero imply a weak or no linear relationship.

• Correlation coecient values less than +0.8 or greater than -0.8 are not considered signicant.

1.7 Correlation

1.8

= corr(X, Y )

1.9

−1.0 ≤ ρ

≤ +1.0

[46]: cor(boston_dataset$PRICE,boston_dataset$RM, use = "everything",

method = "pearson")

0.695359947071539

[47]: cor(boston_dataset$PRICE,boston_dataset$PTRATIO, use = "everything",

method = "pearson")

-0.507786685537562

[48]: cormat <- cor(boston_dataset,y=NULL, use = "everything",

method = "pearson")

cormat

A matrix: 14 × 14 of type dbl

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE

CRIM 1.00000000 -0.20046922 0.40658341 -0.055891582 0.42097171 -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456 -0.38506394 0.4556215 -0.3883046

ZN -0.20046922 1.00000000 -0.53382819 -0.042696719 -0.51660371 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785 0.17552032 -0.4129946 0.3604453

INDUS 0.40658341 -0.53382819 1.00000000 0.062938027 0.76365145 -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476 -0.35697654 0.6037997 -0.4837252

CHAS -0.05589158 -0.04269672 0.06293803 1.000000000 0.09120281 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152 0.04878848 -0.0539293 0.1752602

NOX 0.42097171 -0.51660371 0.76365145 0.091202807 1.00000000 -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327 -0.38005064 0.5908789 -0.4273208

RM -0.21924670 0.31199059 -0.39167585 0.091251225 -0.30218819 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015 0.12806864 -0.6138083 0.6953599

AGE 0.35273425 -0.56953734 0.64477851 0.086517774 0.73147010 -0.24026493 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150 -0.27353398 0.6023385 -0.3769546

DIS -0.37967009 0.66440822 -0.70802699 -0.099175780 -0.76923011 0.20524621 -0.74788054 1.00000000 -0.494587930 -0.53443158 -0.2324705 0.29151167 -0.4969958 0.2499287

RAD 0.62550515 -0.31194783 0.59512927 -0.007368241 0.61144056 -0.20984667 0.45602245 -0.49458793 1.000000000 0.91022819 0.4647412 -0.44441282 0.4886763 -0.3816262

TAX 0.58276431 -0.31456332 0.72076018 -0.035586518 0.66802320 -0.29204783 0.50645559 -0.53443158 0.910228189 1.00000000 0.4608530 -0.44180801 0.5439934 -0.4685359

PTRATIO 0.28994558 -0.39167855 0.38324756 -0.121515174 0.18893268 -0.35550149 0.26151501 -0.23247054 0.464741179 0.46085304 1.0000000 -0.17738330 0.3740443 -0.5077867

B -0.38506394 0.17552032 -0.35697654 0.048788485 -0.38005064 0.12806864 -0.27353398 0.29151167 -0.444412816 -0.44180801 -0.1773833 1.00000000 -0.3660869 0.3334608

LSTAT 0.45562148 -0.41299457 0.60379972 -0.053929298 0.59087892 -0.61380827 0.60233853 -0.49699583 0.488676335 0.54399341 0.3740443 -0.36608690 1.0000000 -0.7376627

PRICE -0.38830461 0.36044534 -0.48372516 0.175260177 -0.42732077 0.69535995 -0.37695457 0.24992873 -0.381626231 -0.46853593 -0.5077867 0.33346082 -0.7376627 1.0000000

[49]: #creating matrix

mask <- upper.tri(cormat,diag=TRUE)

[50]: ### Get lower triangle of the correlation matrix

get_lower_tri<-function(cormat){

cormat[upper.tri(cormat)] <- NA

return(cormat)

}

### Get upper triangle of the correlation matrix

get_upper_tri <- function(cormat){

cormat[lower.tri(cormat)]<- NA

return(cormat)

}

upper_tri <- get_upper_tri(cormat)

upper_tri

A matrix: 14 × 14 of type dbl

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT PRICE

CRIM 1 -0.2004692 0.4065834 -0.05589158 0.42097171 -0.21924670 0.35273425 -0.37967009 0.625505145 0.58276431 0.2899456 -0.38506394 0.4556215 -0.3883046

ZN NA 1.0000000 -0.5338282 -0.04269672 -0.51660371 0.31199059 -0.56953734 0.66440822 -0.311947826 -0.31456332 -0.3916785 0.17552032 -0.4129946 0.3604453

INDUS NA NA 1.0000000 0.06293803 0.76365145 -0.39167585 0.64477851 -0.70802699 0.595129275 0.72076018 0.3832476 -0.35697654 0.6037997 -0.4837252

CHAS NA NA NA 1.00000000 0.09120281 0.09125123 0.08651777 -0.09917578 -0.007368241 -0.03558652 -0.1215152 0.04878848 -0.0539293 0.1752602

NOX NA NA NA NA 1.00000000 -0.30218819 0.73147010 -0.76923011 0.611440563 0.66802320 0.1889327 -0.38005064 0.5908789 -0.4273208

RM NA NA NA NA NA 1.00000000 -0.24026493 0.20524621 -0.209846668 -0.29204783 -0.3555015 0.12806864 -0.6138083 0.6953599

AGE NA NA NA NA NA NA 1.00000000 -0.74788054 0.456022452 0.50645559 0.2615150 -0.27353398 0.6023385 -0.3769546

DIS NA NA NA NA NA NA NA 1.00000000 -0.494587930 -0.53443158 -0.2324705 0.29151167 -0.4969958 0.2499287

RAD NA NA NA NA NA NA NA NA 1.000000000 0.91022819 0.4647412 -0.44441282 0.4886763 -0.3816262

TAX NA NA NA NA NA NA NA NA NA 1.00000000 0.4608530 -0.44180801 0.5439934 -0.4685359

PTRATIO NA NA NA NA NA NA NA NA NA NA 1.0000000 -0.17738330 0.3740443 -0.5077867

B NA NA NA NA NA NA NA NA NA NA NA 1.00000000 -0.3660869 0.3334608

LSTAT NA NA NA NA NA NA NA NA NA NA NA NA 1.0000000 -0.7376627

PRICE NA NA NA NA NA NA NA NA NA NA NA NA NA 1.0000000

[51]: melted_cormat <- melt(upper_tri, na.rm = TRUE)

options(repr.plot.width=14, repr.plot.height=9)

ggplot(data = melted_cormat, aes(Var2, Var1, fill = value ))+

geom_tile(color = "white")+

scale_fill_gradient2(low = "blue", high = "red", mid = "white",

midpoint = 0, limit = c(-1,1), space = "Lab",

name="Pearson\nCorrelation") +

theme_minimal()+

theme(axis.text.x = element_text(angle = 45, vjust = 1,

size = 12, hjust = 1))+

coord_fixed()

[52]: nox_dis_corr<-cor(boston_dataset$DIS,boston_dataset$NOX, use = "everything",

method ="pearson")

options(repr.plot.width=14, repr.plot.height=9)

ggplot(boston_dataset) +

aes(x = DIS, y = NOX) +

geom_point(colour = "#0c4c8a",size=5,alpha=0.6) +

theme(text = element_text(size =15),)+

labs(title=nox_dis_corr, x="DIS - Distance from employment", y="NOX - Nitric␣

,→Oxide Pollution")

Interpretation: the more the distance from employment grows the less the pollution we

have which is predictable and normal.

1.9.1 Linear regression between features

[53]: model <- lm(boston_dataset$NOX ~ boston_dataset$DIS, data = boston_dataset)

model

Call:

lm(formula = boston_dataset$NOX ~ boston_dataset$DIS, data = boston_dataset)

Coefficients:

(Intercept) boston_dataset$DIS

0.71534 -0.04233

[54]: ggplot(boston_dataset) +

aes(x = TAX, y = RAD) +

geom_point(colour = "#0c4c8a",size=5,alpha=0.6) +

theme(text = element_text(size =15),)+

stat_smooth(method = lm)+

labs(title=nox_dis_corr, x="DIS - Distance from employment", y="NOX - Nitric␣

,→Oxide Pollution")

options(repr.P.width=9,repr.P.height=6)

`geom_smooth()` using formula 'y ~ x'

[55]: #correlation matrix

pairs.panels(boston_dataset,

lm = TRUE,

method = "pearson", # correlation method

hist.col = "#00AFBB",

density = TRUE, # show density plots

ellipses = FALSE # show correlation ellipses

)

1.10 Training & Test Dataset Split

[56]: set.seed(10)

## 80% of the sample size

train_size <- floor(0.8 * nrow(boston_dataset))

in_rows <- sample(c(1:nrow(boston_dataset)), size = train_size, replace = FALSE)

train <- boston_dataset[in_rows, ]

test <- boston_dataset[-in_rows, ]

dim(train)

dim(test)

summary(train)

1. 404 2. 14

1. 102 2. 14

CRIM ZN INDUS CHAS

Min. : 0.00906 Min. : 0.00 Min. : 0.740 Min. :0.00000

1st Qu.: 0.08233 1st Qu.: 0.00 1st Qu.: 5.085 1st Qu.:0.00000

Median : 0.23440 Median : 0.00 Median : 9.690 Median :0.00000

Mean : 3.39556 Mean : 11.69 Mean :11.105 Mean :0.05941

3rd Qu.: 3.84006 3rd Qu.: 12.50 3rd Qu.:18.100 3rd Qu.:0.00000

Max. :73.53410 Max. :100.00 Max. :27.740 Max. :1.00000

NOX RM AGE DIS

Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130

1st Qu.:0.4487 1st Qu.:5.883 1st Qu.: 42.75 1st Qu.: 2.088

Median :0.5380 Median :6.189 Median : 77.15 Median : 3.207

Mean :0.5549 Mean :6.261 Mean : 68.20 Mean : 3.840

3rd Qu.:0.6240 3rd Qu.:6.582 3rd Qu.: 94.10 3rd Qu.: 5.287

Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127

RAD TAX PTRATIO B

Min. : 1.000 Min. :187.0 Min. :13.00 Min. : 2.52

1st Qu.: 4.000 1st Qu.:277.0 1st Qu.:17.23 1st Qu.:375.78

Median : 5.000 Median :335.0 Median :19.05 Median :391.48

Mean : 9.661 Mean :410.4 Mean :18.50 Mean :359.01

3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.90

Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90

LSTAT PRICE

Min. : 1.730 Min. : 5.00

1st Qu.: 6.928 1st Qu.:16.77

Median :11.675 Median :20.75

Mean :12.783 Mean :22.30

3rd Qu.:16.948 3rd Qu.:25.00

Max. :37.970 Max. :50.00

1.11 Multivariable Regression

[57]: model <- lm(PRICE~CRIM+ZN+INDUS+CHAS+NOX+RM+AGE+DIS+RAD+TAX+PTRATIO+B+␣

,→LSTAT,data=train)

summary(model)

Call:

lm(formula = PRICE ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE +

DIS + RAD + TAX + PTRATIO + B + LSTAT, data = train)

Residuals:

Min 1Q Median 3Q Max

-14.6092 -2.7220 -0.5774 1.7397 25.9586

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 38.177696 5.780878 6.604 1.31e-10 ***

CRIM -0.077279 0.052417 -1.474 0.141205

ZN 0.037210 0.015294 2.433 0.015425 *

INDUS

0.013338 0.073135 0.182 0.855383

CHAS 2.750736 1.068721 2.574 0.010425 *

NOX -18.381466 4.358511 -4.217 3.08e-05 ***

RM 3.524380 0.473022 7.451 6.03e-13 ***

AGE 0.004915 0.014832 0.331 0.740522

DIS -1.426138 0.221958 -6.425 3.84e-10 ***

RAD 0.264717 0.076744 3.449 0.000623 ***

TAX -0.011338 0.004333 -2.617 0.009223 **

PTRATIO -0.968737 0.151327 -6.402 4.42e-10 ***

B 0.010705 0.003286 3.257 0.001223 **

LSTAT -0.540427 0.058027 -9.313 < 2e-16 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.876 on 390 degrees of freedom

Multiple R-squared: 0.7186, Adjusted R-squared: 0.7092

F-statistic: 76.6 on 13 and 390 DF, p-value: < 2.2e-16

R-squared shows how well the data t the regression model (the goodness of t). our

r-squared is equal to 0.71, we can say that our model is t but we can do better with

some transformation.

1.12 Data Transformation

1.13 mesuring skew

[58]: skewness(boston_dataset$PRICE)

1.10153731064787

As mentionned before the transformation we’re going to use is Log

[59]: y_log <- log(boston_dataset$PRICE)

tail(y_log)

1. 2.82137888640921 2. 3.10906095886099 3. 3.02529107579554 4. 3.17387845893747

5. 3.09104245335832 6. 2.47653840011748

[60]: options(repr.plot.width=14, repr.plot.height=9)

ggplot(data=boston_dataset, aes(x=y_log)) +

geom_histogram(aes(x=y_log),

bins=50,

col="red",

fill="#2196f3",

alpha = .2) +

labs(title="Histogram for log prices", x="Price 000s ", y="N of Houses")

[61]: skewness(y_log)

-0.328365448930113

skewness has improved thanks to our log transformation

[62]: options(repr.plot.width=14, repr.plot.height=9)

ggplot(boston_dataset) +

aes(x = LSTAT, y = PRICE) +

geom_point(colour = "#000000",size=5,alpha=0.6) +

theme(text = element_text(size =14),)+

stat_smooth(method = lm)+

labs(title="", x="LSTAT", y="PRICE")

ggplot(boston_dataset) +

aes(x = LSTAT, y = y_log) +

geom_point(colour = "#000000",size=5,alpha=0.6) +

theme(text = element_text(size =14),)+

stat_smooth(method = lm)+

labs(title="", x="LSTAT", y="log PRICE")

`geom_smooth()` using formula 'y ~ x'

so far our optimizations were on point as the graphs show

1.14 Regression using log prices

[63]: model <- lm(log(PRICE)~CRIM+ZN+INDUS+CHAS+NOX+RM+AGE+DIS+RAD+TAX+PTRATIO+B+␣

,→LSTAT,data=train)

summary(model)

BIC(model)

Call:

lm(formula = log(PRICE) ~ CRIM + ZN + INDUS + CHAS + NOX + RM +

AGE + DIS + RAD + TAX + PTRATIO + B + LSTAT, data = train)

Residuals:

Min 1Q Median 3Q Max

-0.73182 -0.10628 -0.01636 0.10142 0.88769

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.1191096 0.2299692 17.912 < 2e-16 ***

CRIM -0.0104884 0.0020852 -5.030 7.50e-07 ***

ZN 0.0010208 0.0006084 1.678 0.094190 .

INDUS 0.0020346 0.0029094 0.699 0.484767

CHAS 0.1011734 0.0425148 2.380 0.017806 *

NOX -0.7602185 0.1733860 -4.385 1.50e-05 ***

RM 0.0810628 0.0188173 4.308 2.09e-05 ***

AGE 0.0002907 0.0005900 0.493 0.622453

DIS -0.0479624 0.0088297 -5.432 9.83e-08 ***

RAD 0.0138183 0.0030530 4.526 7.99e-06 ***

TAX -0.0006141 0.0001724 -3.563 0.000412 ***

PTRATIO -0.0387921 0.0060200 -6.444 3.44e-10 ***

B 0.0005175 0.0001307 3.958 8.98e-05 ***

LSTAT -0.0288432 0.0023084 -12.495 < 2e-16 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.194 on 390 degrees of freedom

Multiple R-squared: 0.7749, Adjusted R-squared: 0.7673

F-statistic: 103.2 on 13 and 390 DF, p-value: < 2.2e-16

-102.867711770293

[64]: # Charles River Property Premium

exp(0.1011734)

1.10646848676559

people now are willing to pay 1000 dollars more for a house next to the river

1.14.1 P values & Evaluating Coecients

[65]: p_values <- data.frame(round(summary(model)$coefficients[ ,4],3))

colnames( p_values )<-c("P_values")

p_values

A data.frame: 14 × 1

P_values

<dbl>

(Intercept) 0.000

CRIM 0.000

ZN 0.094

INDUS 0.485

CHAS 0.018

NOX 0.000

RM 0.000

AGE 0.622

DIS 0.000

RAD 0.000

TAX 0.000

PTRATIO 0.000

B 0.000

LSTAT 0.000

as we can see the p_values of AGE and INDUS are pretty high from the tresh hold

therefore we can say that they’re statistically nonsignicant

1.15 Testing for Multicollinearity

T AX = α

+ α

RM + α

NOX + ... + α

LST AT

V IF

T AX

(1 − R

T AX

)

1.15.1 KEY TAKEAWAYS

• A variance ination factor (VIF) provides a measure of multicollinearity among the indepen-

dent variables in a multiple regression model.

• Detecting multicollinearity is important because while multicollinearity does not reduce the

explanatory power of the model, it does reduce the statistical signicance of the independent

variables.

• A large variance ination factor (VIF) on an independent variable indicates a highly collinear

relationship to the other variables that should be considered or adjusted for in the structure

of the model and selection of independent variables.

[66]: vif_values <- vif(model)

vif_values

CRIM 2.32004172054075 ZN 2.23669325620911 INDUS 4.25762590331925 CHAS

1.08443701335279 NOX 4.42162249218309 RM 1.85944152784013 AGE 3.02263137291517 DIS

3.91429319487761 RAD 7.68674259920848 TAX 9.19729449468695 PTRATIO

1.7279317429752 B 1.40041614593321 LSTAT 2.96287514698582

All the vif numbers are bellow the treshhold 10 so we shouldn’t worry about multi-

collinearity

1.16 Model Simplication & the BIC

Bayesian information criterion (BIC) is a criterion for model selection among a nite

set of models. It is based, in part, on the likelihood function. The models can be tested

using corresponding BIC values. Lower BIC value indicates lower penalty terms hence

a better model.

Our goal is to simplify the model by dropping some of the features that are statistically nonsignif-

icant using p values we provided previously. And improve our model based on the BIC criterion.

[67]: model <- lm(log(PRICE)~CRIM+ZN+CHAS+NOX+RM+AGE+DIS+RAD+TAX+PTRATIO+B+␣

,→LSTAT,data=train)

rsquared <- summary(model)$r.squared

bic <- BIC(model)

cat('BIC is', bic,"\n")

cat('r-squared is',rsquared)

BIC is -108.3628

r-squared is 0.7745711

[68]: model <-␣

,→lm(log(PRICE)~CRIM+ZN+CHAS+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT,data=train)

rsquared <- summary(model)$r.squared

bic <- BIC(model)

cat('BIC is', bic,"\n")

cat('r-squared is',rsquared)

BIC is -114.1037

r-squared is 0.7744257

we Did improve our model without aecting r squared and the t of our model. we dropped two

features

1.17 Residuals & Residual Plots

• Residuals: Residuals are a measure of how far from the regression line data points are.

Residuals are nothing but prediction error, we can nd it by subtracting the predicted value

from actual value.

[69]: # Modified model: transformed (using log prices) & simplified (dropping two␣

,→features)

y_train<-log(train$PRICE)

model <- lm(y_train~CRIM+ZN+CHAS+NOX+RM+DIS+RAD+TAX+PTRATIO+B+LSTAT,data=train)

rsquared <- summary(model)$r.squared

# Residuals

# residuals = train$PRICE - fittedvalues

fittedvalues <- fitted.values(model)

residuals <-y_train - fittedvalues

cor <- cor(y_train,fittedvalues, use = "everything",

method ="pearson")

options(repr.plot.width=14, repr.plot.height=9)

ggplot(train) +

aes(x =log(PRICE), y = fittedvalues) +

geom_point(colour = "#000000",size=5,alpha=0.6) +

theme(text = element_text(size =14),)+

stat_smooth(method = lm)+

labs(title="Actual vs Predicted log prices", x='Actual log prices y',␣

,→y="predicted log prices y hat")

options(repr.P.width=9,repr.P.height=6)

cat("correlation : ",cor)

`geom_smooth()` using formula 'y ~ x'

correlation : 0.8800146

RMSE is a measure of how spread out these residuals are. In other words, it tells you how concen-

trated the data is around the line of best t

MSE =

∑

i=1

( ˆy

− y

)

RM SE =

√

MSE

[70]: rsquared<- summary(model)$r.squared

MSE<-mean(summary(model)$residuals^2)

RMSE<- sqrt(MSE)

cat("R-squared:",rsquared,"\n")

cat("MSE:",MSE,"\n")

cat("RMSE:",RMSE,"\n")

R-squared: 0.7744257

MSE: 0.03639125

RMSE: 0.1907649

our RMSE is pretty low which reects how good is our Model �

[71]: upper_bound <- log(30) + 2*RMSE

cat('The upper bound in log prices for a 95% prediction interval is ',␣

,→upper_bound,"\n")

lower_bound <- log(30) - 2*RMSE

cat('The lower bound in log prices for a 95% prediction interval is ',␣

,→lower_bound)

The upper bound in log prices for a 95% prediction interval is 3.782727

The lower bound in log prices for a 95% prediction interval is 3.019668

1.17.1 Evaluation tools

[245]: #getting ready for setuping tool

log_prices <- log(boston_dataset$PRICE)

target <- data.frame(PRICE=log_prices)

#target

CRIME_IDX <- 0

ZN_IDX <- 1

CHAS_IDX <- 2

RM_IDX <- 4

PTRATIO_IDX <- 8

"this data was collected long time ago, dollar price changed that's why we␣

,→added a scale factor based on

median house prices in boston from this website https://www.zillow.com/

,→boston-ma/home-values/ "

ZILLOW_MEDIAN_PRICE <- 675.921

SCALE_FACTOR <- ZILLOW_MEDIAN_PRICE / median(boston_dataset$PRICE)

#rsquared<- summary(model)$r.squared

#MSE<-mean(summary(model)$residuals^2)

#RMSE<- sqrt(MSE)

features <- subset(boston_dataset, select = -c(PRICE,INDUS,AGE))

fittedvalues <- fitted.values(model)

property_stats <- data.frame(t(colMeans(features)))

predict(model,property_stats)

’this data was collected long time ago, dollar price changed that\’s why we added a scale factor

bases on \nmedian house prices in boston from this website ’

1: 3.03293206386117

[246]: get_log_estimate <- function(nr_rooms,

students_per_classroom,

next_to_river=FALSE,

high_confidence=TRUE) {

#configure proprities

property_stats[,c(RM_IDX)]<-nr_rooms

property_stats[,c(PTRATIO_IDX)]<-students_per_classroom

if (high_confidence) {

property_stats[,c(CHAS_IDX)] = 1

}

else{

property_stats[,c(CHAS_IDX)] = 0

}

# Make prediction

log_estimate<- predict(model,property_stats)

# Calc Range

if(high_confidence) {

upper_bound <- log_estimate + 2*RMSE

lower_bound <- log_estimate - 2*RMSE

interval = 95

}

else{

upper_bound <- log_estimate + RMSE

lower_bound <- log_estimate - RMSE

interval = 68

}

return (data.frame(predictions=c(log_estimate, upper_bound, lower_bound,␣

,→interval)))

}

get_log_estimate(3,20,TRUE,FALSE)

A data.frame: 4 × 1

predictions

<dbl>

1.518167

1.708932

1.327402

68.000000

[247]: get_dollar_estimate<- function(rm, ptratio, chas=FALSE, large_range=TRUE){

"Estimate the price of a property in Boston.

Keyword arguments:

rm -- number of rooms in the property.

ptratio -- number of students per teacher in the classroom for the school␣

,→in the area.

chas -- True if the property is next to the river, False otherwise.

large_range -- True for a 95% prediction interval, False for a 68% interval.

if (rm < 1 | ptratio < 1){

print('That is unrealistic. Try again.')

return()

}

result <-get_log_estimate(rm,

students_per_classroom=ptratio,

next_to_river=chas,

high_confidence=large_range)

log_est<-result[1,"predictions"]

upper<-result[2,"predictions"]

lower<-result[3,"predictions"]

conf <-result[4,"predictions"]

# Convert to today's dollars

dollar_est <- exp(log_est) * 1000 * SCALE_FACTOR

dollar_hi <- exp(upper) * 1000 * SCALE_FACTOR

dollar_low <- exp(lower) * 1000 * SCALE_FACTOR

cat("The estimated property value is ",dollar_est,"\n")

cat("At",conf,"confidence the valuation range is","\n")

cat("USD", dollar_low ,"at the lower end to USD", dollar_hi, "at the high␣

,→end.")

}

[248]: #trying some estimation

get_dollar_estimate(2, 30, TRUE)

The estimated property value is 292779.5

At 95 confidence the valuation range is

USD 199914.6 at the lower end to USD 428782.5 at the high end.