Importing the CSV data file assign1_EnergyData.csv
for analysis, and quickly check the structure of the data.
## 'data.frame': 700 obs. of 10 variables:
## $ Compactness: num 0.74 0.9 0.79 0.69 0.82 0.98 0.69 0.62 0.62 0.66 ...
## $ SurfaceArea: num 686 564 637 735 612 ...
## $ WallArea : num 245 318 343 294 318 ...
## $ RoofArea : num 220 122 147 220 147 ...
## $ Height : num 3.5 7 7 3.5 7 7 3.5 3.5 3.5 3.5 ...
## $ Orientation: int 2 5 3 5 5 4 2 4 3 3 ...
## $ GlazingArea: num 0.4 0.1 0 0.4 0.1 0.25 0.25 0.1 0.1 0.25 ...
## $ GAreaDist : int 3 4 0 3 5 5 3 5 2 1 ...
## $ HeatingLoad: num 14 29.8 29.9 14.1 23.9 ...
## $ CoolingLoad: num 16.1 29.9 31.3 17 27.7 ...
The following table summarizes the variables in the dataset.
Variable | Description | Numeric / Categorical |
---|---|---|
Compactness | Index of compactness ratio for the building | Numeric |
SurfaceArea | Surface area of the complete building | Numeric |
WallArea | Total surface area of the Walls of the building | Numeric |
RoofArea | Total surface area of the Roof(s) of the building | Numeric |
Height | Overall height of the building | Numeric |
Orientation | Orientation of the building in terms of direction | Interger |
GlazingArea | Total area of Glazing around the building | Numeric |
GAreaDist | Distribution of the Glazing Area in the building | Integer |
HeatingLoad | Total heating load required for the building | Numeric |
CoolingLoad | Total cooling load required for the building | Numeric |
Heating Load is the amount of heat energy that would need to be added to a space to maintain the temperature in an acceptable range. Cooling Load is the amount of heat energy that would need to be removed from a space (cooling) to maintain the temperature in an acceptable range. Thus, it is important in Civil Engineering to estimate the Heating and Cooling Load of a building while designing/planning the structure.
We begin by viewing some exploratory plots.
Compactness ratio seem to be mostly around 0.6-0.9 with an absence of data between 0.90-0.95. Seems that most This is not a continuous distribution.
Compactness ratio seem to be mostly around 60 with an absence of data at around 525~560, 645~655 and 755~775. This is not a continuous distribution.
Not all histograms seem to follow a normal distribtion or even a continuous distribution. Only HeatingLoad and CoolingLoad is similar to a normal distribution.
Considering the other variables as categorical variables
The following boxplots are not meaningful to plot. RoofArea as the median = 3rd quartile = 4th quartile Height as 1st quartile = 2nd quartile = median and 3rd quartile = 4th quartile Orientation as 2nd quartile = median GlazingArea 3rd quartile = 4th quartile
## Compactness SurfaceArea WallArea RoofArea
## Compactness 1.000000000 -0.991809299 -0.20812361 -0.869691184
## SurfaceArea -0.991809299 1.000000000 0.20130347 0.880997750
## WallArea -0.208123615 0.201303470 1.00000000 -0.286087266
## RoofArea -0.869691184 0.880997750 -0.28608727 1.000000000
## Height 0.828217657 -0.858340532 0.27432148 -0.972153083
## Orientation 0.002895461 -0.002339471 -0.01373944 0.004347724
## GlazingArea -0.022618635 0.020026945 -0.00447997 0.021754799
## GAreaDist 0.017067129 -0.013939406 -0.01454338 -0.006611360
## HeatingLoad 0.613965409 -0.650383231 0.45153417 -0.854318402
## CoolingLoad 0.629535257 -0.668604808 0.42161086 -0.857690073
## Height Orientation GlazingArea GAreaDist
## Compactness 0.828217657 0.002895461 -0.022618635 0.017067129
## SurfaceArea -0.858340532 -0.002339471 0.020026945 -0.013939406
## WallArea 0.274321482 -0.013739441 -0.004479970 -0.014543378
## RoofArea -0.972153083 0.004347724 0.021754799 -0.006611360
## Height 1.000000000 -0.008961975 -0.013924794 0.005742054
## Orientation -0.008961975 1.000000000 0.001529999 0.022541886
## GlazingArea -0.013924794 0.001529999 1.000000000 0.236956380
## GAreaDist 0.005742054 0.022541886 0.236956380 1.000000000
## HeatingLoad 0.884971395 -0.011043954 0.263708130 0.100780744
## CoolingLoad 0.893038477 0.001686366 0.201229083 0.061463685
## HeatingLoad CoolingLoad
## Compactness 0.61396541 0.629535257
## SurfaceArea -0.65038323 -0.668604808
## WallArea 0.45153417 0.421610863
## RoofArea -0.85431840 -0.857690073
## Height 0.88497140 0.893038477
## Orientation -0.01104395 0.001686366
## GlazingArea 0.26370813 0.201229083
## GAreaDist 0.10078074 0.061463685
## HeatingLoad 1.00000000 0.975695932
## CoolingLoad 0.97569593 1.000000000
## corrplot 0.84 loaded
The dark colored circles represent high correlation. Compactness and SurfaceArea is close to being directly related.
The inital scatterplot of all variables did not look very convincing and feels overwhelming due to the 10 variables present. We attempt to remove the variables that did not have strong relation with its other counterparts.
The second plot looks better and we can be very sure that GAreaDist is strongly related to HeatingLoad and CoolingLoad. There also appear to be a pattern between the variables and HeatingLoad and CoolingLoad. The scatter plots of all variables against HeatingLoad is almost the same as plotting them against CoolingLoad. This do suggest a strong direct relation between HeatingLoad and CoolingLoad which the plot between them also proving likewise.
We shall perform backwards Linear Regression.
lmFit1 <- lm(HeatingLoad ~. , data = energyData)
summary(lmFit1)
##
## Call:
## lm(formula = HeatingLoad ~ ., data = energyData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1770 -0.9063 -0.0444 0.7533 6.3206
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.465829 12.374688 1.007 0.3141
## Compactness -13.312667 6.774583 -1.965 0.0498 *
## SurfaceArea -0.022008 0.011112 -1.981 0.0480 *
## WallArea 0.027552 0.004388 6.279 6.01e-10 ***
## RoofArea NA NA NA NA
## Height 1.060263 0.233924 4.533 6.87e-06 ***
## Orientation -0.088424 0.060854 -1.453 0.1467
## GlazingArea 9.042753 0.607969 14.874 < 2e-16 ***
## GAreaDist 0.195144 0.045161 4.321 1.78e-05 ***
## CoolingLoad 0.735366 0.021253 34.601 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.796 on 691 degrees of freedom
## Multiple R-squared: 0.9684, Adjusted R-squared: 0.9681
## F-statistic: 2650 on 8 and 691 DF, p-value: < 2.2e-16
Model No. | Residual standard error | Multiple R-squared | Adjusted R-squared | F-statistic |
---|---|---|---|---|
1 | 1.796 | 0.9684 | 0.9681 | 2650 |
Create a table as we perform linear regression for easier tracking of progress. From summary, it seems that RoofArea is not applicable for this regression model. This tells me that RoofArea is in fact linearly related to the other variables. We will only know if it is useful after dropping a few variables. The intercept is zero with the largest p-value of 0.05240. However removing the intercept is only in cases whereby the true intercept is zero. Which in this case, I decided against it as the p-value is the largest but not even 0.1. Hence, let us remove the variable with the 2nd largest p-value, Orientation.
lmFit2 <- update(lmFit1,~.-1, data = energyData)
summary(lmFit2)
##
## Call:
## lm(formula = HeatingLoad ~ Compactness + SurfaceArea + WallArea +
## RoofArea + Height + Orientation + GlazingArea + GAreaDist +
## CoolingLoad - 1, data = energyData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1691 -0.8976 -0.0450 0.7366 6.3084
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## Compactness -6.598565 1.213511 -5.438 7.50e-08 ***
## SurfaceArea -0.010922 0.001535 -7.114 2.82e-12 ***
## WallArea 0.024526 0.003199 7.667 5.96e-14 ***
## RoofArea NA NA NA NA
## Height 1.206685 0.183287 6.584 9.08e-11 ***
## Orientation -0.086729 0.060832 -1.426 0.154
## GlazingArea 8.994504 0.606086 14.840 < 2e-16 ***
## GAreaDist 0.193951 0.045146 4.296 1.99e-05 ***
## CoolingLoad 0.739057 0.020935 35.303 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.796 on 692 degrees of freedom
## Multiple R-squared: 0.9946, Adjusted R-squared: 0.9946
## F-statistic: 1.604e+04 on 8 and 692 DF, p-value: < 2.2e-16
Model No. | Residual standard error | Multiple R-squared | Adjusted R-squared | F-statistic |
---|---|---|---|---|
1 | 1.796 | 0.9684 | 0.9681 | 2650 |
2 | 1.796 | 0.9946 | 0.9946 | 1.604e+04 |
Wow, Multiple R-squared, Adjusted R-squared and F-statistic have drastically increased. This means that this regression probably do have a true intercept of zero. Let us try removing Orientation and see if it improves this regression further as it has the largest p-value of being zero
lmFit3 <- update(lmFit2,~.-Orientation, data = energyData)
summary(lmFit3)
##
## Call:
## lm(formula = HeatingLoad ~ Compactness + SurfaceArea + WallArea +
## RoofArea + Height + GlazingArea + GAreaDist + CoolingLoad -
## 1, data = energyData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3109 -0.9080 -0.0305 0.7526 6.2625
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## Compactness -6.812438 1.205101 -5.653 2.30e-08 ***
## SurfaceArea -0.011164 0.001527 -7.311 7.34e-13 ***
## WallArea 0.024586 0.003201 7.681 5.40e-14 ***
## RoofArea NA NA NA NA
## Height 1.212645 0.183376 6.613 7.53e-11 ***
## GlazingArea 9.013971 0.606383 14.865 < 2e-16 ***
## GAreaDist 0.192606 0.045170 4.264 2.29e-05 ***
## CoolingLoad 0.737893 0.020935 35.247 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.797 on 693 degrees of freedom
## Multiple R-squared: 0.9946, Adjusted R-squared: 0.9946
## F-statistic: 1.831e+04 on 7 and 693 DF, p-value: < 2.2e-16
Model No. | Residual standard error | Multiple R-squared | Adjusted R-squared | F-statistic |
---|---|---|---|---|
1 | 1.796 | 0.9684 | 0.9681 | 2650 |
2 | 1.796 | 0.9946 | 0.9946 | 1.604e+04 |
3 | 1.797 | 0.9946 | 0.9946 | 1.831e+04 |
It seems that Residual standard error and F- statistic have increased marginally with the rest remaining constant. I am happy with this regression and will stop here.
Importing the CSV data file assign1_EnergyPred.csv
for predicting.
energyPred <- read.csv("assign1_EnergyPred.csv", header = TRUE)
lmFit4 <- lm(HeatingLoad ~ Compactness + SurfaceArea + WallArea + RoofArea + Height + GlazingArea + GAreaDist, data=energyData)
newData <- energyPred[c(1, 2, 4, 6, 8, 10),]
predict(lmFit4, newdata = newData, interval = "confidence", level = 0.95)
## Warning in predict.lm(lmFit4, newdata = newData, interval = "confidence", :
## prediction from a rank-deficient fit may be misleading
## fit lwr upr
## 1 10.14791 9.748573 10.54726
## 2 17.93658 17.446524 18.42665
## 4 27.74494 27.055281 28.43460
## 6 37.61549 36.816812 38.41416
## 8 16.93565 16.407267 17.46404
## 10 32.96256 32.373312 33.55180
predict(lmFit4, newdata = newData, interval = "prediction", level = 0.95)
## Warning in predict.lm(lmFit4, newdata = newData, interval = "prediction", :
## prediction from a rank-deficient fit may be misleading
## fit lwr upr
## 1 10.14791 4.313901 15.98193
## 2 17.93658 12.095660 23.77751
## 4 27.74494 21.883892 33.60599
## 6 37.61549 31.740613 43.49036
## 8 16.93565 11.091389 22.77992
## 10 32.96256 27.112476 38.81264
It seems that prediction interval is larger than confidence interval. This is because Confidence interval estimates the expectation E(y|x) while Prediction interval estimates a single output y|x.
coef(summary(lmFit4)) # the standard errors of coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.16156556 20.109420395 4.284637 2.089284e-05
## Compactness -66.76395043 10.886959620 -6.132470 1.453337e-09
## SurfaceArea -0.08884746 0.018059697 -4.919654 1.084017e-06
## WallArea 0.06116209 0.007061689 8.661113 3.269055e-17
## Height 4.18175667 0.356195322 11.740066 3.729639e-29
## GlazingArea 19.87859332 0.860198511 23.109309 4.850098e-88
## GAreaDist 0.25023268 0.074481310 3.359671 8.231088e-04
confint(lmFit4) # the confidence intervals of coefficients
## 2.5 % 97.5 %
## (Intercept) 46.67886911 125.64426200
## Compactness -88.13933140 -45.38856947
## SurfaceArea -0.12430574 -0.05338918
## WallArea 0.04729722 0.07502696
## RoofArea NA NA
## Height 3.48240525 4.88110809
## GlazingArea 18.18968553 21.56750110
## GAreaDist 0.10399659 0.39646877
var(lmFit4$residuals) # variance of residuals
## [1] 8.71238
lmFit5 <- lm(CoolingLoad ~ Compactness + SurfaceArea + WallArea + RoofArea + Height + GlazingArea + GAreaDist, data=energyData)
newData <- energyPred[c(1, 2, 4, 6, 8, 10),]
predict(lmFit4, newdata = newData, interval = "confidence", level = 0.95)
## Warning in predict.lm(lmFit4, newdata = newData, interval = "confidence", :
## prediction from a rank-deficient fit may be misleading
## fit lwr upr
## 1 10.14791 9.748573 10.54726
## 2 17.93658 17.446524 18.42665
## 4 27.74494 27.055281 28.43460
## 6 37.61549 36.816812 38.41416
## 8 16.93565 16.407267 17.46404
## 10 32.96256 32.373312 33.55180
predict(lmFit4, newdata = newData, interval = "prediction", level = 0.95)
## Warning in predict.lm(lmFit4, newdata = newData, interval = "prediction", :
## prediction from a rank-deficient fit may be misleading
## fit lwr upr
## 1 10.14791 4.313901 15.98193
## 2 17.93658 12.095660 23.77751
## 4 27.74494 21.883892 33.60599
## 6 37.61549 31.740613 43.49036
## 8 16.93565 11.091389 22.77992
## 10 32.96256 27.112476 38.81264
It seems that prediction interval is larger than confidence interval. This is because Confidence interval estimates the expectation E(y|x) while Prediction interval estimates a single output y|x.
coef(summary(lmFit5)) # the standard errors of coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 101.03509886 21.786731463 4.6374601 4.216438e-06
## Compactness -72.85734862 11.795032429 -6.1769520 1.113783e-09
## SurfaceArea -0.09128750 0.019566042 -4.6656090 3.693638e-06
## WallArea 0.04585184 0.007650699 5.9931570 3.309036e-09
## Height 4.23460432 0.385905296 10.9731698 6.087681e-26
## GlazingArea 14.73176465 0.931946998 15.8075134 2.712576e-48
## GAreaDist 0.07695136 0.080693738 0.9536224 3.406073e-01
confint(lmFit5) # the confidence intervals of coefficients
## 2.5 % 97.5 %
## (Intercept) 58.25918151 143.81101622
## Compactness -96.01563344 -49.69906381
## SurfaceArea -0.12970333 -0.05287167
## WallArea 0.03083051 0.06087317
## RoofArea NA NA
## Height 3.47692054 4.99228810
## GlazingArea 12.90198638 16.56154292
## GAreaDist -0.08148217 0.23538488
var(lmFit5$residuals) # variance of residuals
## [1] 10.22638
In conclusion, it would appear that the HeatingLoad and CoolingLoad are relatively similar to one another. One would only need to predict one of them to find the value of the other. This could be due to a building’s construction. A building’s total heating load required for the building should actually be equal to the total cooling load required for the building, ceteris paribus.