Perform a detailed exploratory data analysis and regression model building based on assign1_EnergyData.csv

Analysis of Energy Efficiency

Importing the CSV data file assign1_EnergyData.csv for analysis, and quickly check the structure of the data.

## 'data.frame':    700 obs. of  10 variables:
##  $ Compactness: num  0.74 0.9 0.79 0.69 0.82 0.98 0.69 0.62 0.62 0.66 ...
##  $ SurfaceArea: num  686 564 637 735 612 ...
##  $ WallArea   : num  245 318 343 294 318 ...
##  $ RoofArea   : num  220 122 147 220 147 ...
##  $ Height     : num  3.5 7 7 3.5 7 7 3.5 3.5 3.5 3.5 ...
##  $ Orientation: int  2 5 3 5 5 4 2 4 3 3 ...
##  $ GlazingArea: num  0.4 0.1 0 0.4 0.1 0.25 0.25 0.1 0.1 0.25 ...
##  $ GAreaDist  : int  3 4 0 3 5 5 3 5 2 1 ...
##  $ HeatingLoad: num  14 29.8 29.9 14.1 23.9 ...
##  $ CoolingLoad: num  16.1 29.9 31.3 17 27.7 ...

The following table summarizes the variables in the dataset.

Variable Description Numeric / Categorical
Compactness Index of compactness ratio for the building Numeric
SurfaceArea Surface area of the complete building Numeric
WallArea Total surface area of the Walls of the building Numeric
RoofArea Total surface area of the Roof(s) of the building Numeric
Height Overall height of the building Numeric
Orientation Orientation of the building in terms of direction Interger
GlazingArea Total area of Glazing around the building Numeric
GAreaDist Distribution of the Glazing Area in the building Integer
HeatingLoad Total heating load required for the building Numeric
CoolingLoad Total cooling load required for the building Numeric

Heating Load is the amount of heat energy that would need to be added to a space to maintain the temperature in an acceptable range. Cooling Load is the amount of heat energy that would need to be removed from a space (cooling) to maintain the temperature in an acceptable range. Thus, it is important in Civil Engineering to estimate the Heating and Cooling Load of a building while designing/planning the structure.

Exploratory Data Analysis

We begin by viewing some exploratory plots.

Histograms of all individual variables:

Compactness ratio seem to be mostly around 0.6-0.9 with an absence of data between 0.90-0.95. Seems that most This is not a continuous distribution.

Compactness ratio seem to be mostly around 60 with an absence of data at around 525~560, 645~655 and 755~775. This is not a continuous distribution.

Not all histograms seem to follow a normal distribtion or even a continuous distribution. Only HeatingLoad and CoolingLoad is similar to a normal distribution.

Considering the other variables as categorical variables

Barplot of selected variables as catergorical ones:

Boxplot of all individual variables:

The following boxplots are not meaningful to plot. RoofArea as the median = 3rd quartile = 4th quartile Height as 1st quartile = 2nd quartile = median and 3rd quartile = 4th quartile Orientation as 2nd quartile = median GlazingArea 3rd quartile = 4th quartile

Find correlation between variables

##              Compactness  SurfaceArea    WallArea     RoofArea
## Compactness  1.000000000 -0.991809299 -0.20812361 -0.869691184
## SurfaceArea -0.991809299  1.000000000  0.20130347  0.880997750
## WallArea    -0.208123615  0.201303470  1.00000000 -0.286087266
## RoofArea    -0.869691184  0.880997750 -0.28608727  1.000000000
## Height       0.828217657 -0.858340532  0.27432148 -0.972153083
## Orientation  0.002895461 -0.002339471 -0.01373944  0.004347724
## GlazingArea -0.022618635  0.020026945 -0.00447997  0.021754799
## GAreaDist    0.017067129 -0.013939406 -0.01454338 -0.006611360
## HeatingLoad  0.613965409 -0.650383231  0.45153417 -0.854318402
## CoolingLoad  0.629535257 -0.668604808  0.42161086 -0.857690073
##                   Height  Orientation  GlazingArea    GAreaDist
## Compactness  0.828217657  0.002895461 -0.022618635  0.017067129
## SurfaceArea -0.858340532 -0.002339471  0.020026945 -0.013939406
## WallArea     0.274321482 -0.013739441 -0.004479970 -0.014543378
## RoofArea    -0.972153083  0.004347724  0.021754799 -0.006611360
## Height       1.000000000 -0.008961975 -0.013924794  0.005742054
## Orientation -0.008961975  1.000000000  0.001529999  0.022541886
## GlazingArea -0.013924794  0.001529999  1.000000000  0.236956380
## GAreaDist    0.005742054  0.022541886  0.236956380  1.000000000
## HeatingLoad  0.884971395 -0.011043954  0.263708130  0.100780744
## CoolingLoad  0.893038477  0.001686366  0.201229083  0.061463685
##             HeatingLoad  CoolingLoad
## Compactness  0.61396541  0.629535257
## SurfaceArea -0.65038323 -0.668604808
## WallArea     0.45153417  0.421610863
## RoofArea    -0.85431840 -0.857690073
## Height       0.88497140  0.893038477
## Orientation -0.01104395  0.001686366
## GlazingArea  0.26370813  0.201229083
## GAreaDist    0.10078074  0.061463685
## HeatingLoad  1.00000000  0.975695932
## CoolingLoad  0.97569593  1.000000000
## corrplot 0.84 loaded

The dark colored circles represent high correlation. Compactness and SurfaceArea is close to being directly related.

Plot 2d scatterplots of all pairs of variables

The inital scatterplot of all variables did not look very convincing and feels overwhelming due to the 10 variables present. We attempt to remove the variables that did not have strong relation with its other counterparts.

The second plot looks better and we can be very sure that GAreaDist is strongly related to HeatingLoad and CoolingLoad. There also appear to be a pattern between the variables and HeatingLoad and CoolingLoad. The scatter plots of all variables against HeatingLoad is almost the same as plotting them against CoolingLoad. This do suggest a strong direct relation between HeatingLoad and CoolingLoad which the plot between them also proving likewise.

Perform Linear Regression

We shall perform backwards Linear Regression.

lmFit1 <- lm(HeatingLoad ~. , data = energyData)
summary(lmFit1)
## 
## Call:
## lm(formula = HeatingLoad ~ ., data = energyData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1770 -0.9063 -0.0444  0.7533  6.3206 
## 
## Coefficients: (1 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  12.465829  12.374688   1.007   0.3141    
## Compactness -13.312667   6.774583  -1.965   0.0498 *  
## SurfaceArea  -0.022008   0.011112  -1.981   0.0480 *  
## WallArea      0.027552   0.004388   6.279 6.01e-10 ***
## RoofArea            NA         NA      NA       NA    
## Height        1.060263   0.233924   4.533 6.87e-06 ***
## Orientation  -0.088424   0.060854  -1.453   0.1467    
## GlazingArea   9.042753   0.607969  14.874  < 2e-16 ***
## GAreaDist     0.195144   0.045161   4.321 1.78e-05 ***
## CoolingLoad   0.735366   0.021253  34.601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.796 on 691 degrees of freedom
## Multiple R-squared:  0.9684, Adjusted R-squared:  0.9681 
## F-statistic:  2650 on 8 and 691 DF,  p-value: < 2.2e-16
Model No. Residual standard error Multiple R-squared Adjusted R-squared F-statistic
1 1.796 0.9684 0.9681 2650

Create a table as we perform linear regression for easier tracking of progress. From summary, it seems that RoofArea is not applicable for this regression model. This tells me that RoofArea is in fact linearly related to the other variables. We will only know if it is useful after dropping a few variables. The intercept is zero with the largest p-value of 0.05240. However removing the intercept is only in cases whereby the true intercept is zero. Which in this case, I decided against it as the p-value is the largest but not even 0.1. Hence, let us remove the variable with the 2nd largest p-value, Orientation.

lmFit2 <- update(lmFit1,~.-1, data = energyData)
summary(lmFit2)
## 
## Call:
## lm(formula = HeatingLoad ~ Compactness + SurfaceArea + WallArea + 
##     RoofArea + Height + Orientation + GlazingArea + GAreaDist + 
##     CoolingLoad - 1, data = energyData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1691 -0.8976 -0.0450  0.7366  6.3084 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## Compactness -6.598565   1.213511  -5.438 7.50e-08 ***
## SurfaceArea -0.010922   0.001535  -7.114 2.82e-12 ***
## WallArea     0.024526   0.003199   7.667 5.96e-14 ***
## RoofArea           NA         NA      NA       NA    
## Height       1.206685   0.183287   6.584 9.08e-11 ***
## Orientation -0.086729   0.060832  -1.426    0.154    
## GlazingArea  8.994504   0.606086  14.840  < 2e-16 ***
## GAreaDist    0.193951   0.045146   4.296 1.99e-05 ***
## CoolingLoad  0.739057   0.020935  35.303  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.796 on 692 degrees of freedom
## Multiple R-squared:  0.9946, Adjusted R-squared:  0.9946 
## F-statistic: 1.604e+04 on 8 and 692 DF,  p-value: < 2.2e-16
Model No. Residual standard error Multiple R-squared Adjusted R-squared F-statistic
1 1.796 0.9684 0.9681 2650
2 1.796 0.9946 0.9946 1.604e+04

Wow, Multiple R-squared, Adjusted R-squared and F-statistic have drastically increased. This means that this regression probably do have a true intercept of zero. Let us try removing Orientation and see if it improves this regression further as it has the largest p-value of being zero

lmFit3 <- update(lmFit2,~.-Orientation, data = energyData)
summary(lmFit3)
## 
## Call:
## lm(formula = HeatingLoad ~ Compactness + SurfaceArea + WallArea + 
##     RoofArea + Height + GlazingArea + GAreaDist + CoolingLoad - 
##     1, data = energyData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3109 -0.9080 -0.0305  0.7526  6.2625 
## 
## Coefficients: (1 not defined because of singularities)
##              Estimate Std. Error t value Pr(>|t|)    
## Compactness -6.812438   1.205101  -5.653 2.30e-08 ***
## SurfaceArea -0.011164   0.001527  -7.311 7.34e-13 ***
## WallArea     0.024586   0.003201   7.681 5.40e-14 ***
## RoofArea           NA         NA      NA       NA    
## Height       1.212645   0.183376   6.613 7.53e-11 ***
## GlazingArea  9.013971   0.606383  14.865  < 2e-16 ***
## GAreaDist    0.192606   0.045170   4.264 2.29e-05 ***
## CoolingLoad  0.737893   0.020935  35.247  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.797 on 693 degrees of freedom
## Multiple R-squared:  0.9946, Adjusted R-squared:  0.9946 
## F-statistic: 1.831e+04 on 7 and 693 DF,  p-value: < 2.2e-16
Model No. Residual standard error Multiple R-squared Adjusted R-squared F-statistic
1 1.796 0.9684 0.9681 2650
2 1.796 0.9946 0.9946 1.604e+04
3 1.797 0.9946 0.9946 1.831e+04

It seems that Residual standard error and F- statistic have increased marginally with the rest remaining constant. I am happy with this regression and will stop here.

Predict the two response variables, Heating Load and Cooling Load, for the samples in assign1_EnergyPred.csv

Importing the CSV data file assign1_EnergyPred.csv for predicting.

Predicting HeatingLoad

energyPred <- read.csv("assign1_EnergyPred.csv", header = TRUE)

lmFit4 <- lm(HeatingLoad ~ Compactness + SurfaceArea + WallArea + RoofArea + Height + GlazingArea + GAreaDist, data=energyData)

newData <- energyPred[c(1, 2, 4, 6, 8, 10),]
predict(lmFit4, newdata = newData, interval = "confidence", level = 0.95)
## Warning in predict.lm(lmFit4, newdata = newData, interval = "confidence", :
## prediction from a rank-deficient fit may be misleading
##         fit       lwr      upr
## 1  10.14791  9.748573 10.54726
## 2  17.93658 17.446524 18.42665
## 4  27.74494 27.055281 28.43460
## 6  37.61549 36.816812 38.41416
## 8  16.93565 16.407267 17.46404
## 10 32.96256 32.373312 33.55180
predict(lmFit4, newdata = newData, interval = "prediction", level = 0.95)
## Warning in predict.lm(lmFit4, newdata = newData, interval = "prediction", :
## prediction from a rank-deficient fit may be misleading
##         fit       lwr      upr
## 1  10.14791  4.313901 15.98193
## 2  17.93658 12.095660 23.77751
## 4  27.74494 21.883892 33.60599
## 6  37.61549 31.740613 43.49036
## 8  16.93565 11.091389 22.77992
## 10 32.96256 27.112476 38.81264

It seems that prediction interval is larger than confidence interval. This is because Confidence interval estimates the expectation E(y|x) while Prediction interval estimates a single output y|x.

coef(summary(lmFit4))    # the standard errors of coefficients
##                 Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  86.16156556 20.109420395  4.284637 2.089284e-05
## Compactness -66.76395043 10.886959620 -6.132470 1.453337e-09
## SurfaceArea  -0.08884746  0.018059697 -4.919654 1.084017e-06
## WallArea      0.06116209  0.007061689  8.661113 3.269055e-17
## Height        4.18175667  0.356195322 11.740066 3.729639e-29
## GlazingArea  19.87859332  0.860198511 23.109309 4.850098e-88
## GAreaDist     0.25023268  0.074481310  3.359671 8.231088e-04
confint(lmFit4)          # the confidence intervals of coefficients
##                    2.5 %       97.5 %
## (Intercept)  46.67886911 125.64426200
## Compactness -88.13933140 -45.38856947
## SurfaceArea  -0.12430574  -0.05338918
## WallArea      0.04729722   0.07502696
## RoofArea              NA           NA
## Height        3.48240525   4.88110809
## GlazingArea  18.18968553  21.56750110
## GAreaDist     0.10399659   0.39646877
var(lmFit4$residuals)    # variance of residuals
## [1] 8.71238

Predicting CoolingLoad

lmFit5 <- lm(CoolingLoad ~ Compactness + SurfaceArea + WallArea + RoofArea + Height + GlazingArea + GAreaDist, data=energyData)

newData <- energyPred[c(1, 2, 4, 6, 8, 10),]
predict(lmFit4, newdata = newData, interval = "confidence", level = 0.95)
## Warning in predict.lm(lmFit4, newdata = newData, interval = "confidence", :
## prediction from a rank-deficient fit may be misleading
##         fit       lwr      upr
## 1  10.14791  9.748573 10.54726
## 2  17.93658 17.446524 18.42665
## 4  27.74494 27.055281 28.43460
## 6  37.61549 36.816812 38.41416
## 8  16.93565 16.407267 17.46404
## 10 32.96256 32.373312 33.55180
predict(lmFit4, newdata = newData, interval = "prediction", level = 0.95)
## Warning in predict.lm(lmFit4, newdata = newData, interval = "prediction", :
## prediction from a rank-deficient fit may be misleading
##         fit       lwr      upr
## 1  10.14791  4.313901 15.98193
## 2  17.93658 12.095660 23.77751
## 4  27.74494 21.883892 33.60599
## 6  37.61549 31.740613 43.49036
## 8  16.93565 11.091389 22.77992
## 10 32.96256 27.112476 38.81264

It seems that prediction interval is larger than confidence interval. This is because Confidence interval estimates the expectation E(y|x) while Prediction interval estimates a single output y|x.

coef(summary(lmFit5))    # the standard errors of coefficients
##                 Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept) 101.03509886 21.786731463  4.6374601 4.216438e-06
## Compactness -72.85734862 11.795032429 -6.1769520 1.113783e-09
## SurfaceArea  -0.09128750  0.019566042 -4.6656090 3.693638e-06
## WallArea      0.04585184  0.007650699  5.9931570 3.309036e-09
## Height        4.23460432  0.385905296 10.9731698 6.087681e-26
## GlazingArea  14.73176465  0.931946998 15.8075134 2.712576e-48
## GAreaDist     0.07695136  0.080693738  0.9536224 3.406073e-01
confint(lmFit5)          # the confidence intervals of coefficients
##                    2.5 %       97.5 %
## (Intercept)  58.25918151 143.81101622
## Compactness -96.01563344 -49.69906381
## SurfaceArea  -0.12970333  -0.05287167
## WallArea      0.03083051   0.06087317
## RoofArea              NA           NA
## Height        3.47692054   4.99228810
## GlazingArea  12.90198638  16.56154292
## GAreaDist    -0.08148217   0.23538488
var(lmFit5$residuals)    # variance of residuals
## [1] 10.22638

In conclusion, it would appear that the HeatingLoad and CoolingLoad are relatively similar to one another. One would only need to predict one of them to find the value of the other. This could be due to a building’s construction. A building’s total heating load required for the building should actually be equal to the total cooling load required for the building, ceteris paribus.