Problem: Use an individual's information to predict whether or not the person earns more than $50,000 per year.
Data: Census data for earnings, 2010
Source: UCI Machine Learning Laboratory
Install and load required packages.
Package caTools is required for splitting the data into training and testing datasets.
if(!require(caTools)){
install.packages('caTools')
}
library(caTools)
CART (Classification and regression tree) models require rpart and rpart.plot packages.
if(!require(rpart)){
install.packages('rpart')
}
library(rpart)
if(!require(rpart.plot)){
install.packages('rpart.plot')
}
library(rpart.plot)
ROCR package is required to evaluate the performance of models and to plot the Receiver Operating Characteristic (ROC) curves.
if(!require(ROCR)){
install.packages('ROCR')
}
library(ROCR)
Packages caret and e1071 are required for cross-validation.
if(!require(caret)){
install.packages('caret')
}
library(caret)
if(!require(e1071)){
install.packages('e1071')
}
library(e1071)
Random Forest models require the randomForest package.
if(!require(randomForest)){
install.packages('randomForest')
}
library(randomForest)
Set the working directory and read data into dataframe.
setwd('C:/Users/Sachin/Desktop/MyRData')
census <- read.csv('census2010.csv')
View the names of variables in the dataframe and their types.
str(census)
## 'data.frame': 31978 obs. of 14 variables:
## $ age : int 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
## $ education : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
## $ educationnum : int 13 13 9 7 13 14 5 9 14 13 ...
## $ maritalstatus: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 ...
## $ occupation : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
## $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 ...
## $ race : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
## $ sex : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
## $ capitalgain : int 2174 0 0 0 0 0 0 0 14084 5178 ...
## $ capitalloss : int 0 0 0 0 0 0 0 0 0 0 ...
## $ hoursperweek : int 40 13 40 40 40 40 16 45 50 40 ...
## $ nativecountry: Factor w/ 41 levels " Cambodia"," Canada",..: 39 39 39 39 5 39 23 ...
## $ fiftyKPlus : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...
The variable educationnum is a numerical representation of the variable education. Therefore, for the
purpose of building models, I will use the variable education and will not use the variable educationnum.
Look at the summary of dataframe; notice missing values and any abnormal distributions.
summary(census)
## age workclass education
## Min. :17.00 Private :22286 HS-grad :10368
## 1st Qu.:28.00 Self-emp-not-inc: 2499 Some-college: 7187
## Median :37.00 Local-gov : 2067 Bachelors : 5210
## Mean :38.58 ? : 1809 Masters : 1674
## 3rd Qu.:48.00 State-gov : 1279 Assoc-voc : 1366
## Max. :90.00 Self-emp-inc : 1074 11th : 1167
## (Other) : 964 (Other) : 5006
## educationnum maritalstatus occupation
## Min. : 1.00 Divorced : 4394 Prof-specialty :4038
## 1st Qu.: 9.00 Married-AF-spouse : 23 Craft-repair :4030
## Median :10.00 Married-civ-spouse :14692 Exec-managerial:3992
## Mean :10.07 Married-spouse-absent: 397 Adm-clerical :3721
## 3rd Qu.:12.00 Never-married :10488 Sales :3584
## Max. :16.00 Separated : 1005 Other-service :3212
## Widowed : 979 (Other) :9401
## relationship race sex
## Husband :12947 Amer-Indian-Eskimo: 311 Female:10608
## Not-in-family : 8156 Asian-Pac-Islander: 956 Male :21370
## Other-relative: 952 Black : 3028
## Own-child : 5005 Other : 253
## Unmarried : 3384 White :27430
## Wife : 1534
## capitalgain capitalloss hoursperweek nativecountry
## Min. : 0 Min. : 0.00 Min. : 1.00 United-States:29170
## 1st Qu.: 0 1st Qu.: 0.00 1st Qu.:40.00 Mexico : 643
## Median : 0 Median : 0.00 Median :40.00 Philippines : 198
## Mean : 1064 Mean : 86.74 Mean :40.42 Germany : 137
## 3rd Qu.: 0 3rd Qu.: 0.00 3rd Qu.:45.00 Canada : 121
## Max. :99999 Max. :4356.00 Max. :99.00 Puerto-Rico : 114
## (Other) : 1595
## fiftyKPlus
## <=50K:24283
## >50K : 7695
View the first and last few rows of the dataframe.
head(census)
## age workclass education educationnum maritalstatus
## 1 39 State-gov Bachelors 13 Never-married
## 2 50 Self-emp-not-inc Bachelors 13 Married-civ-spouse
## 3 38 Private HS-grad 9 Divorced
## 4 53 Private 11th 7 Married-civ-spouse
## 5 28 Private Bachelors 13 Married-civ-spouse
## 6 37 Private Masters 14 Married-civ-spouse
## occupation relationship race sex capitalgain capitalloss
## 1 Adm-clerical Not-in-family White Male 2174 0
## 2 Exec-managerial Husband White Male 0 0
## 3 Handlers-cleaners Not-in-family White Male 0 0
## 4 Handlers-cleaners Husband Black Male 0 0
## 5 Prof-specialty Wife Black Female 0 0
## 6 Exec-managerial Wife White Female 0 0
## hoursperweek nativecountry fiftyKPlus
## 1 40 United-States <=50K
## 2 13 United-States <=50K
## 3 40 United-States <=50K
## 4 40 United-States <=50K
## 5 40 Cuba <=50K
## 6 40 United-States <=50K
tail(census)
## age workclass education educationnum maritalstatus
## 31973 22 Private Some-college 10 Never-married
## 31974 27 Private Assoc-acdm 12 Married-civ-spouse
## 31975 40 Private HS-grad 9 Married-civ-spouse
## 31976 58 Private HS-grad 9 Widowed
## 31977 22 Private HS-grad 9 Never-married
## 31978 52 Self-emp-inc HS-grad 9 Married-civ-spouse
## occupation relationship race sex capitalgain
## 31973 Protective-serv Not-in-family White Male 0
## 31974 Tech-support Wife White Female 0
## 31975 Machine-op-inspct Husband White Male 0
## 31976 Adm-clerical Unmarried White Female 0
## 31977 Adm-clerical Own-child White Male 0
## 31978 Exec-managerial Wife White Female 15024
## capitalloss hoursperweek nativecountry fiftyKPlus
## 31973 0 40 United-States <=50K
## 31974 0 38 United-States <=50K
## 31975 0 40 United-States >50K
## 31976 0 40 United-States <=50K
## 31977 0 20 United-States <=50K
## 31978 0 40 United-States >50K
Split the data into training and testing datasets.
Use 60% of the dataset for training the model and the remaining 30% for testing the model.
set.seed(5678)
split <- sample.split(census$fiftyKPlus, SplitRatio=0.6)
train <- subset(census, split==TRUE)
test <- subset(census, split==FALSE)
CLASSIFICATION AND REGRESSION TREE (CART) MODEL
Use the training dataset to build a CART model.
censusTree <- rpart(fiftyKPlus ~age + workclass + education + maritalstatus +
occupation + relationship + race + sex + capitalgain +
capitalloss + hoursperweek + nativecountry,
data = train,
method ='class')
Plot the censusTree.
prp(censusTree, type=1, compress=FALSE, box.col=c('yellow', 'pink')[censusTree$frame$yval])

Score the model using the test dataset, ie. use the CART model to predict the label using test dataset.
predictTest.CensusTree <- predict(censusTree, newdata=test, type='class')
Build a Confusion Matrix and compute the accuracy of the CART model on test dataset.
confusionMatrix.CensusTree <- table(test$fiftyKPlus, predictTest.CensusTree)
accuracy.CensusTree <- (confusionMatrix.CensusTree[1,1] + confusionMatrix.CensusTree[2,2])/sum(confusionMatrix.CensusTree)
sprintf('The accurary of the CART model is: %f', accuracy.CensusTree)
The accurary of the CART model is: 0.849582
Performance of the CART model: Compute the area under the ROC curve.
predictTest.CensusTree2 <- predict(censusTree, newdata=test)
predictTest.CensusTree3 <- predictTest.CensusTree2[,2]
censusTree.ROCR <- prediction(predictTest.CensusTree3, test$fiftyKPlus)
perf.censusTree.ROCR <- performance(censusTree.ROCR, 'tpr','fpr')
Plot Receiver Operating Characteristic (ROC) curve.
plot(perf.censusTree.ROCR, main='ROC Curve for CART Model',
colorize=TRUE,
print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.1,1.2))

Find the area under the ROC curve.
auc.censusTree <- as.numeric(performance(censusTree.ROCR, 'auc')@y.values)
sprintf('The area under the ROC curve for the CART Model is: %f',auc.censusTree)
The area under the ROC curve for the CART Model is: 0.846746
CART MODEL WITH CROSS VALIDATION
Create a 10-fold cross validation model.
fitControl <- trainControl(method='cv', number=10)
cartGrid = expand.grid(.cp=seq(0.0001,0.01,0.0002))
Train to find the optimum complexity parameter (cp).
cpTable=train(fiftyKPlus ~age + workclass + education + maritalstatus + occupation +
relationship + race + sex + capitalgain + capitalloss + hoursperweek +
nativecountry,
data = train,
method = 'rpart',
trControl = fitControl,
tuneGrid = cartGrid)
print(cpTable)
## CART
##
## 19187 samples
## 13 predictor
## 2 classes: ' <=50K', ' >50K'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17268, 17268, 17269, 17268, 17268, 17268, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 0.0001 0.8451027 0.5521496 0.007447661 0.02237656
## 0.0003 0.8518260 0.5659783 0.005656349 0.02076003
## 0.0005 0.8527128 0.5674594 0.004759356 0.01974055
## 0.0007 0.8534945 0.5651194 0.004058885 0.01821072
## 0.0009 0.8548493 0.5705408 0.005293786 0.02616873
## 0.0011 0.8554226 0.5706891 0.005275069 0.02398613
## 0.0013 0.8557352 0.5649620 0.005583107 0.02105032
## 0.0015 0.8556310 0.5656655 0.005484671 0.01862422
## 0.0017 0.8554746 0.5654269 0.005869369 0.02030439
## 0.0019 0.8554224 0.5649776 0.006222305 0.02070911
## 0.0021 0.8547969 0.5636514 0.005754433 0.01971248
## 0.0023 0.8539110 0.5617916 0.005181031 0.01855891
## 0.0025 0.8537025 0.5616562 0.004974453 0.01823042
## 0.0027 0.8537025 0.5616562 0.004974453 0.01823042
## 0.0029 0.8535461 0.5615035 0.005229620 0.01891518
## 0.0031 0.8535461 0.5615035 0.005229620 0.01891518
## 0.0033 0.8535461 0.5615035 0.005229620 0.01891518
## 0.0035 0.8532855 0.5611465 0.005155199 0.01945739
## 0.0037 0.8529728 0.5608301 0.004935175 0.01917629
## 0.0039 0.8528686 0.5607988 0.004919927 0.01916543
## 0.0041 0.8514092 0.5587599 0.005941366 0.02106837
## 0.0043 0.8507318 0.5578820 0.005360193 0.02037283
## 0.0045 0.8505754 0.5586239 0.005215911 0.01912316
## 0.0047 0.8505754 0.5587028 0.005215911 0.01893425
## 0.0049 0.8503149 0.5578247 0.005042505 0.01856463
## 0.0051 0.8500022 0.5566787 0.005048704 0.01874286
## 0.0053 0.8491684 0.5515700 0.005634282 0.02394063
## 0.0055 0.8491684 0.5515700 0.005634282 0.02394063
## 0.0057 0.8486994 0.5519440 0.005830518 0.02378685
## 0.0059 0.8479699 0.5484015 0.005178705 0.02235647
## 0.0061 0.8479699 0.5484015 0.005178705 0.02235647
## 0.0063 0.8479699 0.5484015 0.005178705 0.02235647
## 0.0065 0.8479699 0.5484015 0.005178705 0.02235647
## 0.0067 0.8479699 0.5484015 0.005178705 0.02235647
## 0.0069 0.8479699 0.5484015 0.005178705 0.02235647
## 0.0071 0.8475007 0.5474264 0.006070675 0.02397956
## 0.0073 0.8471357 0.5466614 0.005624372 0.02315024
## 0.0075 0.8467188 0.5460389 0.005518118 0.02216232
## 0.0077 0.8467188 0.5460389 0.005518118 0.02216232
## 0.0079 0.8460932 0.5419583 0.007144320 0.03149372
## 0.0081 0.8458847 0.5414910 0.006929118 0.03091918
## 0.0083 0.8458847 0.5414910 0.006929118 0.03091918
## 0.0085 0.8457805 0.5414673 0.006913012 0.03090712
## 0.0087 0.8457805 0.5414673 0.006913012 0.03090712
## 0.0089 0.8457805 0.5414673 0.006913012 0.03090712
## 0.0091 0.8457805 0.5417572 0.006913012 0.03097151
## 0.0093 0.8457805 0.5417572 0.006913012 0.03097151
## 0.0095 0.8457805 0.5417572 0.006913012 0.03097151
## 0.0097 0.8457805 0.5417572 0.006913012 0.03097151
## 0.0099 0.8453636 0.5409112 0.007141527 0.03083063
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0013.
Use the optimum cp value to build the CART model with Cross Validation.
censusTreeCV <- rpart(fiftyKPlus ~age + workclass + education + maritalstatus +
occupation + relationship + race + sex + capitalgain +
capitalloss + hoursperweek + nativecountry,
data = train,
method ='class',
control = rpart.control(cp=cpTable$results$cp[which.max(cpTable$results[,2])]))
Plot the censusTree.
prp(censusTreeCV, type=1, compress=FALSE, box.col=c('yellow', 'pink')[censusTree$frame$yval])

Build a Confusion Matrix and compute the accuracy of the CART model with Cross Validation on test dataset.
predict.CensusTreeCV <- predict(censusTreeCV, newdata=test, type='class')
confusionMatrix <- table(test$fiftyKPlus, predict.CensusTreeCV)
accuracy.CensusTreeCV <- (confusionMatrix[1,1] + confusionMatrix[2,2])/sum(confusionMatrix)
sprintf('The accurary of the CART model with Cross Validation is: %f', accuracy.CensusTreeCV)
The accurary of the CART model with Cross Validation is: 0.862403
Performance of the CART model with Cross Validation: Compute the area under the ROC curve.
predict.CensusTreeCV2 <- predict(censusTreeCV, newdata=test)
predict.CensusTreeCV3 <- predict.CensusTreeCV2[,2]
predict.CensusTreeCV.ROC <- prediction(predict.CensusTreeCV3, test$fiftyKPlus)
perf.CensusTreeCV.ROC <- performance(predict.CensusTreeCV.ROC, 'tpr','fpr')
plot(perf.CensusTreeCV.ROC, main='ROC Curve for CART Model with Cross Validation',
colorize=TRUE,
print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.1,1.2))
Plot Receiver Operaring Characteristic (ROC) curve.

Find the area under the ROC curve.
auc.CensusTreeCV <- as.numeric(performance(predict.CensusTreeCV.ROC, 'auc')@y.values)
sprintf('The area under the ROC curve for the CART Model with Cross Validation is: %f',auc.CensusTreeCV)
The area under the ROC curve for the CART Model with Cross Validation is: 0.871933
RANDOM FOREST MODEL
set.seed(6789)
Use the training dataset to build a Random Forest model.
censusRF <- randomForest(fiftyKPlus ~age + workclass + education + maritalstatus +
occupation + relationship + race + sex + capitalgain +
capitalloss + hoursperweek + nativecountry,
data=train)
Use the Random Forest model to predict the outcome of the test dataset.
predict.censusRF <- predict(censusRF, newdata=test)
Build Confusion Matrix and compute the accuracy of the Random Forest model on the test dataset.
confusionMatrix <- table(test$fiftyKPlus, predict.censusRF)
accuracy.censusRF <- (confusionMatrix[1,1] + confusionMatrix[2,2])/sum(confusionMatrix)
sprintf('The accurary of the RANDOM FOREST model is: %f', accuracy.censusRF)
The accurary of the RANDOM FOREST model is: 0.825580
LOGISTIC REGRESSION MODEL
Build a Logistic Regression model using the training dataset.
censusGLM <- glm(fiftyKPlus ~age + workclass + education + maritalstatus +
occupation + relationship + race + sex + capitalgain +
capitalloss + hoursperweek + nativecountry,
data=train,
family = 'binomial')
summary(censusGLM)
## Call:
## glm(formula = fiftyKPlus ~ age + workclass + education + maritalstatus +
## occupation + relationship + race + sex + capitalgain + capitalloss +
## hoursperweek + nativecountry, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.3156 -0.4984 -0.1806 -0.0007 3.6020
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value
## (Intercept) -7.429e+00 9.951e-01 -7.466
## age 2.467e-02 2.136e-03 11.550
## workclass Federal-gov 1.232e+00 1.985e-01 6.205
## workclass Local-gov 5.380e-01 1.798e-01 2.993
## workclass Never-worked -1.307e+01 9.850e+02 -0.013
## workclass Private 7.284e-01 1.589e-01 4.585
## workclass Self-emp-inc 9.060e-01 1.916e-01 4.728
## workclass Self-emp-not-inc 2.890e-01 1.760e-01 1.642
## workclass State-gov 2.776e-01 1.957e-01 1.419
## workclass Without-pay -1.373e+01 5.976e+02 -0.023
## education 11th -1.379e-01 2.785e-01 -0.495
## education 12th -1.812e-01 4.071e-01 -0.445
## education 1st-4th -8.493e-01 7.243e-01 -1.173
## education 5th-6th -5.030e-01 4.564e-01 -1.102
## education 7th-8th -4.900e-01 3.069e-01 -1.596
## education 9th -4.253e-01 3.506e-01 -1.213
## education Assoc-acdm 1.237e+00 2.307e-01 5.364
## education Assoc-voc 1.427e+00 2.207e-01 6.465
## education Bachelors 1.965e+00 2.057e-01 9.551
## education Doctorate 2.905e+00 2.797e-01 10.387
## education HS-grad 8.289e-01 2.002e-01 4.140
## education Masters 2.329e+00 2.196e-01 10.606
## education Preschool -1.281e+01 3.742e+02 -0.034
## education Prof-school 2.855e+00 2.637e-01 10.827
## education Some-college 1.224e+00 2.029e-01 6.032
## maritalstatus Married-AF-spouse 2.395e+00 7.854e-01 3.049
## maritalstatus Married-civ-spouse 1.603e+00 3.824e-01 4.193
## maritalstatus Married-spouse-absent 3.472e-01 2.807e-01 1.237
## maritalstatus Never-married -5.178e-01 1.138e-01 -4.550
## maritalstatus Separated 5.013e-02 2.079e-01 0.241
## maritalstatus Widowed 9.353e-02 2.021e-01 0.463
## occupation Adm-clerical -7.306e-02 1.283e-01 -0.569
## occupation Armed-Forces -1.393e+01 9.937e+02 -0.014
## occupation Craft-repair 4.995e-02 1.092e-01 0.457
## occupation Exec-managerial 7.267e-01 1.133e-01 6.414
## occupation Farming-fishing -1.214e+00 1.851e-01 -6.558
## occupation Handlers-cleaners -7.398e-01 1.879e-01 -3.938
## occupation Machine-op-inspct -3.752e-01 1.393e-01 -2.693
## occupation Other-service -9.234e-01 1.655e-01 -5.579
## occupation Priv-house-serv -1.390e+01 2.166e+02 -0.064
## occupation Prof-specialty 3.865e-01 1.213e-01 3.186
## occupation Protective-serv 6.512e-01 1.718e-01 3.790
## occupation Sales 1.892e-01 1.170e-01 1.617
## occupation Tech-support 5.251e-01 1.532e-01 3.428
## occupation Transport-moving NA NA NA
## relationship Not-in-family -3.240e-02 3.786e-01 -0.086
## relationship Other-relative -8.494e-01 3.448e-01 -2.463
## relationship Own-child -1.321e+00 3.894e-01 -3.391
## relationship Unmarried -2.496e-01 3.981e-01 -0.627
## relationship Wife 1.340e+00 1.340e-01 9.995
## race Asian-Pac-Islander 1.057e+00 3.695e-01 2.861
## race Black 6.486e-01 3.191e-01 2.033
## race Other 5.881e-01 4.800e-01 1.225
## race White 7.755e-01 3.050e-01 2.542
## sex Male 8.798e-01 1.038e-01 8.479
## capitalgain 3.165e-04 1.349e-05 23.467
## capitalloss 6.440e-04 4.820e-05 13.359
## hoursperweek 3.092e-02 2.116e-03 14.609
## nativecountry Canada -8.524e-01 8.920e-01 -0.956
## nativecountry China -2.168e+00 9.122e-01 -2.377
## nativecountry Columbia -3.213e+00 1.390e+00 -2.311
## nativecountry Cuba -7.595e-01 9.132e-01 -0.832
## nativecountry Dominican-Republic -1.438e+01 3.437e+02 -0.042
## nativecountry Ecuador -1.454e+00 1.202e+00 -1.210
## nativecountry El-Salvador -1.185e+00 1.008e+00 -1.176
## nativecountry England -5.668e-01 9.114e-01 -0.622
## nativecountry France -4.422e-01 1.106e+00 -0.400
## nativecountry Germany -6.506e-01 8.793e-01 -0.740
## nativecountry Greece -1.763e+00 1.093e+00 -1.613
## nativecountry Guatemala -1.428e+01 3.288e+02 -0.043
## nativecountry Haiti -1.314e+00 1.273e+00 -1.032
## nativecountry Holand-Netherlands -1.357e+01 2.400e+03 -0.006
## nativecountry Honduras -2.018e+00 2.737e+00 -0.737
## nativecountry Hong -2.114e+00 1.246e+00 -1.696
## nativecountry Hungary -7.671e-01 1.297e+00 -0.592
## nativecountry India -1.460e+00 8.671e-01 -1.684
## nativecountry Iran -3.787e-01 1.029e+00 -0.368
## nativecountry Ireland -2.845e-01 1.042e+00 -0.273
## nativecountry Italy -2.959e-02 9.068e-01 -0.033
## nativecountry Jamaica -2.903e-01 9.633e-01 -0.301
## nativecountry Japan -8.212e-01 9.677e-01 -0.849
## nativecountry Laos -1.583e+01 7.962e+02 -0.020
## nativecountry Mexico -1.237e+00 8.610e-01 -1.437
## nativecountry Nicaragua -1.872e+00 1.394e+00 -1.343
## nativecountry Outlying-US(Guam-USVI-etc) -1.525e+01 7.940e+02 -0.019
## nativecountry Peru -1.903e+00 1.501e+00 -1.267
## nativecountry Philippines -7.124e-01 8.401e-01 -0.848
## nativecountry Poland -8.865e-01 9.495e-01 -0.934
## nativecountry Portugal -1.682e+00 1.354e+00 -1.243
## nativecountry Puerto-Rico -1.373e+00 9.798e-01 -1.401
## nativecountry Scotland -6.633e-01 1.176e+00 -0.564
## nativecountry South -1.900e+00 9.244e-01 -2.055
## nativecountry Taiwan -7.558e-01 9.616e-01 -0.786
## nativecountry Thailand -2.125e+00 1.231e+00 -1.726
## nativecountry Trinadad&Tobago -8.819e-01 1.226e+00 -0.719
## nativecountry United-States -7.282e-01 8.179e-01 -0.890
## nativecountry Vietnam -1.221e+00 9.880e-01 -1.236
## nativecountry Yugoslavia 3.934e-01 1.246e+00 0.316
## Pr(>|z|)
## (Intercept) 8.28e-14 ***
## age < 2e-16 ***
## workclass Federal-gov 5.48e-10 ***
## workclass Local-gov 0.002767 **
## workclass Never-worked 0.989415
## workclass Private 4.54e-06 ***
## workclass Self-emp-inc 2.26e-06 ***
## workclass Self-emp-not-inc 0.100665
## workclass State-gov 0.155982
## workclass Without-pay 0.981669
## education 11th 0.620585
## education 12th 0.656309
## education 1st-4th 0.240990
## education 5th-6th 0.270411
## education 7th-8th 0.110416
## education 9th 0.225078
## education Assoc-acdm 8.13e-08 ***
## education Assoc-voc 1.01e-10 ***
## education Bachelors < 2e-16 ***
## education Doctorate < 2e-16 ***
## education HS-grad 3.47e-05 ***
## education Masters < 2e-16 ***
## education Preschool 0.972702
## education Prof-school < 2e-16 ***
## education Some-college 1.62e-09 ***
## maritalstatus Married-AF-spouse 0.002295 **
## maritalstatus Married-civ-spouse 2.75e-05 ***
## maritalstatus Married-spouse-absent 0.216096
## maritalstatus Never-married 5.36e-06 ***
## maritalstatus Separated 0.809444
## maritalstatus Widowed 0.643481
## occupation Adm-clerical 0.569102
## occupation Armed-Forces 0.988812
## occupation Craft-repair 0.647523
## occupation Exec-managerial 1.42e-10 ***
## occupation Farming-fishing 5.46e-11 ***
## occupation Handlers-cleaners 8.22e-05 ***
## occupation Machine-op-inspct 0.007089 **
## occupation Other-service 2.42e-08 ***
## occupation Priv-house-serv 0.948837
## occupation Prof-specialty 0.001442 **
## occupation Protective-serv 0.000150 ***
## occupation Sales 0.105965
## occupation Tech-support 0.000608 ***
## occupation Transport-moving NA
## relationship Not-in-family 0.931792
## relationship Other-relative 0.013774 *
## relationship Own-child 0.000696 ***
## relationship Unmarried 0.530705
## relationship Wife < 2e-16 ***
## race Asian-Pac-Islander 0.004226 **
## race Black 0.042071 *
## race Other 0.220512
## race White 0.011010 *
## sex Male < 2e-16 ***
## capitalgain < 2e-16 ***
## capitalloss < 2e-16 ***
## hoursperweek < 2e-16 ***
## nativecountry Canada 0.339268
## nativecountry China 0.017471 *
## nativecountry Columbia 0.020846 *
## nativecountry Cuba 0.405575
## nativecountry Dominican-Republic 0.966629
## nativecountry Ecuador 0.226257
## nativecountry El-Salvador 0.239739
## nativecountry England 0.533998
## nativecountry France 0.689224
## nativecountry Germany 0.459358
## nativecountry Greece 0.106651
## nativecountry Guatemala 0.965373
## nativecountry Haiti 0.301939
## nativecountry Holand-Netherlands 0.995488
## nativecountry Honduras 0.460864
## nativecountry Hong 0.089910 .
## nativecountry Hungary 0.554123
## nativecountry India 0.092160 .
## nativecountry Iran 0.712740
## nativecountry Ireland 0.784864
## nativecountry Italy 0.973967
## nativecountry Jamaica 0.763106
## nativecountry Japan 0.396072
## nativecountry Laos 0.984139
## nativecountry Mexico 0.150722
## nativecountry Nicaragua 0.179252
## nativecountry Outlying-US(Guam-USVI-etc) 0.984673
## nativecountry Peru 0.205029
## nativecountry Philippines 0.396441
## nativecountry Poland 0.350469
## nativecountry Portugal 0.213951
## nativecountry Puerto-Rico 0.161076
## nativecountry Scotland 0.572856
## nativecountry South 0.039884 *
## nativecountry Taiwan 0.431905
## nativecountry Thailand 0.084266 .
## nativecountry Trinadad&Tobago 0.471926
## nativecountry United-States 0.373292
## nativecountry Vietnam 0.216499
## nativecountry Yugoslavia 0.752289
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 21175 on 19186 degrees of freedom
## Residual deviance: 12121 on 19090 degrees of freedom
## AIC: 12315
##
## Number of Fisher Scoring iterations: 15
Use the logistic regression model to predict the outcome of the test dataset.
predict.CensusGLM <- predict(censusGLM, newdata=test, type='response')
Build Confusion Matrix and compute the accuracy of the logistic regression model on test dataset.
confusionMatrix <- table(test$fiftyKPlus, predict.CensusGLM>=0.5)
accuracy.CensusGLM <- (confusionMatrix[1,1] + confusionMatrix[2,2])/sum(confusionMatrix)
sprintf('The accuracy of the Logistic Regression model is: %f', accuracy.CensusGLM)
The accuracy of the Logistic Regression model is: 0.850989
Compute the baseline accuracy of the model.
baselineTrainMatrix <- table(train$fiftyKPlus)
baselineTestMatrix <- table(test$fiftyKPlus)
if(baselineTrainMatrix[1] > baselineTrainMatrix[2])
{
baselineAccuracy.CensusGLM <- baselineTestMatrix[1]/sum(baselineTestMatrix)
}else
{
baselineAccuracy.CensusGLM <- baselineTrainMatrix[1]/sum(baselineTestMatrix)
}
sprintf('The baseline accuracy of the Logistic Regression Model is: %f',baselineAccuracy.CensusGLM)
The baseline accuracy of the Logistic Regression Model is: 0.759362
Performance of the Logistic Regression model: Plot the Receiver Operating Characteristic curve and compute the area under the ROC curve
predict.CensusGLM.ROC <- prediction(predict.CensusGLM, test$fiftyKPlus)
perf.CensusGLM.ROC <- performance(predict.CensusGLM.ROC, 'tpr','fpr')
plot(perf.CensusGLM.ROC, main='ROC Curve for Logistic Regression Model',
colorize=TRUE,
print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.1,1.2))

auc.CensusGLM <- as.numeric(performance(predict.CensusGLM.ROC, 'auc')@y.values)
sprintf('The area under the ROC curve for the Logistic Regression Model is: %f',auc.CensusGLM)
The area under the ROC curve for the Logistic Regression Model is: 0.905733