R Code for Census Project

Problem: Use an individual's information to predict whether or not the person earns more than $50,000 per year.


Data: Census data for earnings, 2010
Source: UCI Machine Learning Laboratory


Install and load required packages.

Package caTools is required for splitting the data into training and testing datasets.


if(!require(caTools)){
  install.packages('caTools')
}
library(caTools)

CART (Classification and regression tree) models require rpart and rpart.plot packages.


if(!require(rpart)){
  install.packages('rpart')
}
library(rpart)

if(!require(rpart.plot)){
  install.packages('rpart.plot')
}
library(rpart.plot)

ROCR package is required to evaluate the performance of models and to plot the Receiver Operating Characteristic (ROC) curves.


if(!require(ROCR)){
  install.packages('ROCR')
}
library(ROCR)

Packages caret and e1071 are required for cross-validation.


if(!require(caret)){
  install.packages('caret')
}
library(caret)

if(!require(e1071)){
  install.packages('e1071')
}
library(e1071)

Random Forest models require the randomForest package.


if(!require(randomForest)){
  install.packages('randomForest')
}
library(randomForest)

Set the working directory and read data into dataframe.


setwd('C:/Users/Sachin/Desktop/MyRData')
census <- read.csv('census2010.csv')

View the names of variables in the dataframe and their types.


str(census)
## 'data.frame':    31978 obs. of  14 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass    : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ education    : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ educationnum : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ maritalstatus: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 ...
##  $ occupation   : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 ...
##  $ race         : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capitalgain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hoursperweek : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ nativecountry: Factor w/ 41 levels " Cambodia"," Canada",..: 39 39 39 39 5 39 23 ...
##  $ fiftyKPlus   : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...

The variable educationnum is a numerical representation of the variable education. Therefore, for the purpose of building models, I will use the variable education and will not use the variable educationnum.


Look at the summary of dataframe; notice missing values and any abnormal distributions.


summary(census)
##       age                    workclass             education    
##  Min.   :17.00    Private         :22286    HS-grad     :10368  
##  1st Qu.:28.00    Self-emp-not-inc: 2499    Some-college: 7187  
##  Median :37.00    Local-gov       : 2067    Bachelors   : 5210  
##  Mean   :38.58    ?               : 1809    Masters     : 1674  
##  3rd Qu.:48.00    State-gov       : 1279    Assoc-voc   : 1366  
##  Max.   :90.00    Self-emp-inc    : 1074    11th        : 1167  
##                  (Other)          :  964   (Other)      : 5006  

##   educationnum                  maritalstatus              occupation  
##  Min.   : 1.00    Divorced             : 4394    Prof-specialty :4038  
##  1st Qu.: 9.00    Married-AF-spouse    :   23    Craft-repair   :4030  
##  Median :10.00    Married-civ-spouse   :14692    Exec-managerial:3992  
##  Mean   :10.07    Married-spouse-absent:  397    Adm-clerical   :3721  
##  3rd Qu.:12.00    Never-married        :10488    Sales          :3584  
##  Max.   :16.00    Separated            : 1005    Other-service  :3212  
##                   Widowed              :  979   (Other)         :9401  

##           relationship                    race            sex       
##   Husband       :12947    Amer-Indian-Eskimo:  311    Female:10608  
##   Not-in-family : 8156    Asian-Pac-Islander:  956    Male  :21370  
##   Other-relative:  952    Black             : 3028                  
##   Own-child     : 5005    Other             :  253                  
##   Unmarried     : 3384    White             :27430                  
##   Wife          : 1534                                              

##   capitalgain     capitalloss       hoursperweek          nativecountry  
##  Min.   :    0   Min.   :   0.00   Min.   : 1.00    United-States:29170  
##  1st Qu.:    0   1st Qu.:   0.00   1st Qu.:40.00    Mexico       :  643  
##  Median :    0   Median :   0.00   Median :40.00    Philippines  :  198  
##  Mean   : 1064   Mean   :  86.74   Mean   :40.42    Germany      :  137  
##  3rd Qu.:    0   3rd Qu.:   0.00   3rd Qu.:45.00    Canada       :  121  
##  Max.   :99999   Max.   :4356.00   Max.   :99.00    Puerto-Rico  :  114  
##                                                    (Other)       : 1595  

##   fiftyKPlus   
##   <=50K:24283  
##   >50K : 7695  


View the first and last few rows of the dataframe.


head(census)
##   age         workclass  education educationnum       maritalstatus
## 1  39         State-gov  Bachelors           13       Never-married
## 2  50  Self-emp-not-inc  Bachelors           13  Married-civ-spouse
## 3  38           Private    HS-grad            9            Divorced
## 4  53           Private       11th            7  Married-civ-spouse
## 5  28           Private  Bachelors           13  Married-civ-spouse
## 6  37           Private    Masters           14  Married-civ-spouse

##           occupation   relationship   race     sex capitalgain capitalloss
## 1       Adm-clerical  Not-in-family  White    Male        2174           0
## 2    Exec-managerial        Husband  White    Male           0           0
## 3  Handlers-cleaners  Not-in-family  White    Male           0           0
## 4  Handlers-cleaners        Husband  Black    Male           0           0
## 5     Prof-specialty           Wife  Black  Female           0           0
## 6    Exec-managerial           Wife  White  Female           0           0

##   hoursperweek  nativecountry fiftyKPlus
## 1           40  United-States      <=50K
## 2           13  United-States      <=50K
## 3           40  United-States      <=50K
## 4           40  United-States      <=50K
## 5           40           Cuba      <=50K
## 6           40  United-States      <=50K
tail(census)
##       age     workclass     education educationnum       maritalstatus
## 31973  22       Private  Some-college           10       Never-married
## 31974  27       Private    Assoc-acdm           12  Married-civ-spouse
## 31975  40       Private       HS-grad            9  Married-civ-spouse
## 31976  58       Private       HS-grad            9             Widowed
## 31977  22       Private       HS-grad            9       Never-married
## 31978  52  Self-emp-inc       HS-grad            9  Married-civ-spouse

##               occupation   relationship   race     sex capitalgain
## 31973    Protective-serv  Not-in-family  White    Male           0
## 31974       Tech-support           Wife  White  Female           0
## 31975  Machine-op-inspct        Husband  White    Male           0
## 31976       Adm-clerical      Unmarried  White  Female           0
## 31977       Adm-clerical      Own-child  White    Male           0
## 31978    Exec-managerial           Wife  White  Female       15024

##       capitalloss hoursperweek  nativecountry fiftyKPlus
## 31973           0           40  United-States      <=50K
## 31974           0           38  United-States      <=50K
## 31975           0           40  United-States       >50K
## 31976           0           40  United-States      <=50K
## 31977           0           20  United-States      <=50K
## 31978           0           40  United-States       >50K

Split the data into training and testing datasets. Use 60% of the dataset for training the model and the remaining 30% for testing the model.


set.seed(5678)

split <- sample.split(census$fiftyKPlus, SplitRatio=0.6)

train <- subset(census, split==TRUE)
test <- subset(census, split==FALSE)


CLASSIFICATION AND REGRESSION TREE (CART) MODEL


Use the training dataset to build a CART model.


censusTree <- rpart(fiftyKPlus ~age + workclass + education + maritalstatus + 
                      occupation + relationship + race + sex + capitalgain +
                      capitalloss + hoursperweek + nativecountry,
                    data = train,
                    method ='class')

Plot the censusTree.


prp(censusTree, type=1, compress=FALSE, box.col=c('yellow', 'pink')[censusTree$frame$yval])


Score the model using the test dataset, ie. use the CART model to predict the label using test dataset.


predictTest.CensusTree <- predict(censusTree, newdata=test, type='class')

Build a Confusion Matrix and compute the accuracy of the CART model on test dataset.


confusionMatrix.CensusTree <- table(test$fiftyKPlus, predictTest.CensusTree)

accuracy.CensusTree <- (confusionMatrix.CensusTree[1,1] + confusionMatrix.CensusTree[2,2])/sum(confusionMatrix.CensusTree)

sprintf('The accurary of the CART model is: %f', accuracy.CensusTree)
The accurary of the CART model is: 0.849582

Performance of the CART model: Compute the area under the ROC curve.


predictTest.CensusTree2 <- predict(censusTree, newdata=test)

predictTest.CensusTree3 <- predictTest.CensusTree2[,2]

censusTree.ROCR <- prediction(predictTest.CensusTree3, test$fiftyKPlus)

perf.censusTree.ROCR <- performance(censusTree.ROCR, 'tpr','fpr')

Plot Receiver Operating Characteristic (ROC) curve.


plot(perf.censusTree.ROCR, main='ROC Curve for CART Model',
     colorize=TRUE,
     print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.1,1.2))


Find the area under the ROC curve.


auc.censusTree <- as.numeric(performance(censusTree.ROCR, 'auc')@y.values)

sprintf('The area under the ROC curve for the CART Model is: %f',auc.censusTree)
The area under the ROC curve for the CART Model is: 0.846746



CART MODEL WITH CROSS VALIDATION


Create a 10-fold cross validation model.


fitControl <- trainControl(method='cv', number=10)
cartGrid = expand.grid(.cp=seq(0.0001,0.01,0.0002))

Train to find the optimum complexity parameter (cp).


cpTable=train(fiftyKPlus ~age + workclass + education + maritalstatus + occupation +
              relationship + race + sex + capitalgain + capitalloss + hoursperweek +
              nativecountry,
      data = train,
      method = 'rpart',
      trControl = fitControl,
      tuneGrid = cartGrid)

print(cpTable)
## CART 
## 
## 19187 samples
##    13 predictor
##     2 classes: ' <=50K', ' >50K' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17268, 17268, 17269, 17268, 17268, 17268, ... 
## Resampling results across tuning parameters:
## 
##   cp      Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.0001  0.8451027  0.5521496  0.007447661  0.02237656
##   0.0003  0.8518260  0.5659783  0.005656349  0.02076003
##   0.0005  0.8527128  0.5674594  0.004759356  0.01974055
##   0.0007  0.8534945  0.5651194  0.004058885  0.01821072
##   0.0009  0.8548493  0.5705408  0.005293786  0.02616873
##   0.0011  0.8554226  0.5706891  0.005275069  0.02398613
##   0.0013  0.8557352  0.5649620  0.005583107  0.02105032
##   0.0015  0.8556310  0.5656655  0.005484671  0.01862422
##   0.0017  0.8554746  0.5654269  0.005869369  0.02030439
##   0.0019  0.8554224  0.5649776  0.006222305  0.02070911
##   0.0021  0.8547969  0.5636514  0.005754433  0.01971248
##   0.0023  0.8539110  0.5617916  0.005181031  0.01855891
##   0.0025  0.8537025  0.5616562  0.004974453  0.01823042
##   0.0027  0.8537025  0.5616562  0.004974453  0.01823042
##   0.0029  0.8535461  0.5615035  0.005229620  0.01891518
##   0.0031  0.8535461  0.5615035  0.005229620  0.01891518
##   0.0033  0.8535461  0.5615035  0.005229620  0.01891518
##   0.0035  0.8532855  0.5611465  0.005155199  0.01945739
##   0.0037  0.8529728  0.5608301  0.004935175  0.01917629
##   0.0039  0.8528686  0.5607988  0.004919927  0.01916543
##   0.0041  0.8514092  0.5587599  0.005941366  0.02106837
##   0.0043  0.8507318  0.5578820  0.005360193  0.02037283
##   0.0045  0.8505754  0.5586239  0.005215911  0.01912316
##   0.0047  0.8505754  0.5587028  0.005215911  0.01893425
##   0.0049  0.8503149  0.5578247  0.005042505  0.01856463
##   0.0051  0.8500022  0.5566787  0.005048704  0.01874286
##   0.0053  0.8491684  0.5515700  0.005634282  0.02394063
##   0.0055  0.8491684  0.5515700  0.005634282  0.02394063
##   0.0057  0.8486994  0.5519440  0.005830518  0.02378685
##   0.0059  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0061  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0063  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0065  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0067  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0069  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0071  0.8475007  0.5474264  0.006070675  0.02397956
##   0.0073  0.8471357  0.5466614  0.005624372  0.02315024
##   0.0075  0.8467188  0.5460389  0.005518118  0.02216232
##   0.0077  0.8467188  0.5460389  0.005518118  0.02216232
##   0.0079  0.8460932  0.5419583  0.007144320  0.03149372
##   0.0081  0.8458847  0.5414910  0.006929118  0.03091918
##   0.0083  0.8458847  0.5414910  0.006929118  0.03091918
##   0.0085  0.8457805  0.5414673  0.006913012  0.03090712
##   0.0087  0.8457805  0.5414673  0.006913012  0.03090712
##   0.0089  0.8457805  0.5414673  0.006913012  0.03090712
##   0.0091  0.8457805  0.5417572  0.006913012  0.03097151
##   0.0093  0.8457805  0.5417572  0.006913012  0.03097151
##   0.0095  0.8457805  0.5417572  0.006913012  0.03097151
##   0.0097  0.8457805  0.5417572  0.006913012  0.03097151
##   0.0099  0.8453636  0.5409112  0.007141527  0.03083063
 
Accuracy was used to select the optimal model using  the largest value.

The final value used for the model was cp = 0.0013.

Use the optimum cp value to build the CART model with Cross Validation.


censusTreeCV <- rpart(fiftyKPlus ~age + workclass + education + maritalstatus + 
                        occupation + relationship + race + sex + capitalgain + 
                        capitalloss + hoursperweek + nativecountry,
                   data = train,
                   method ='class',
                   control = rpart.control(cp=cpTable$results$cp[which.max(cpTable$results[,2])]))

Plot the censusTree.


prp(censusTreeCV, type=1, compress=FALSE, box.col=c('yellow', 'pink')[censusTree$frame$yval])


Build a Confusion Matrix and compute the accuracy of the CART model with Cross Validation on test dataset.

predict.CensusTreeCV <- predict(censusTreeCV, newdata=test, type='class')
confusionMatrix <- table(test$fiftyKPlus, predict.CensusTreeCV)

accuracy.CensusTreeCV <- (confusionMatrix[1,1] + confusionMatrix[2,2])/sum(confusionMatrix)

sprintf('The accurary of the CART model with Cross Validation is: %f', accuracy.CensusTreeCV)
The accurary of the CART model with Cross Validation is: 0.862403

Performance of the CART model with Cross Validation: Compute the area under the ROC curve.


predict.CensusTreeCV2 <- predict(censusTreeCV, newdata=test)
predict.CensusTreeCV3 <- predict.CensusTreeCV2[,2]

predict.CensusTreeCV.ROC <- prediction(predict.CensusTreeCV3, test$fiftyKPlus)

perf.CensusTreeCV.ROC <- performance(predict.CensusTreeCV.ROC, 'tpr','fpr')

plot(perf.CensusTreeCV.ROC, main='ROC Curve for CART Model with Cross Validation',
     colorize=TRUE,
     print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.1,1.2))

Plot Receiver Operaring Characteristic (ROC) curve.


Find the area under the ROC curve.

auc.CensusTreeCV <- as.numeric(performance(predict.CensusTreeCV.ROC, 'auc')@y.values)

sprintf('The area under the ROC curve for the CART Model with Cross Validation is: %f',auc.CensusTreeCV)
The area under the ROC curve for the CART Model with Cross Validation is: 0.871933



RANDOM FOREST MODEL



set.seed(6789)

Use the training dataset to build a Random Forest model.


censusRF <- randomForest(fiftyKPlus ~age + workclass + education + maritalstatus +
                        occupation + relationship + race + sex + capitalgain +
                        capitalloss + hoursperweek + nativecountry,
                        data=train)

Use the Random Forest model to predict the outcome of the test dataset.


predict.censusRF <- predict(censusRF, newdata=test)

Build Confusion Matrix and compute the accuracy of the Random Forest model on the test dataset.


confusionMatrix <- table(test$fiftyKPlus, predict.censusRF)

accuracy.censusRF <- (confusionMatrix[1,1] + confusionMatrix[2,2])/sum(confusionMatrix)

sprintf('The accurary of the RANDOM FOREST model is: %f', accuracy.censusRF)
The accurary of the RANDOM FOREST model is: 0.825580



LOGISTIC REGRESSION MODEL


Build a Logistic Regression model using the training dataset.


censusGLM <- glm(fiftyKPlus ~age + workclass + education + maritalstatus + 
                    occupation + relationship + race + sex + capitalgain + 
                    capitalloss + hoursperweek + nativecountry,
                data=train, 
                family = 'binomial')
summary(censusGLM)

## Call:
## glm(formula = fiftyKPlus ~ age + workclass + education + maritalstatus + 
##     occupation + relationship + race + sex + capitalgain + capitalloss + 
##     hoursperweek + nativecountry, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3156  -0.4984  -0.1806  -0.0007   3.6020  
## 
## Coefficients: (1 not defined because of singularities)
##                                            Estimate Std. Error z value
## (Intercept)                              -7.429e+00  9.951e-01  -7.466
## age                                       2.467e-02  2.136e-03  11.550
## workclass Federal-gov                     1.232e+00  1.985e-01   6.205
## workclass Local-gov                       5.380e-01  1.798e-01   2.993
## workclass Never-worked                   -1.307e+01  9.850e+02  -0.013
## workclass Private                         7.284e-01  1.589e-01   4.585
## workclass Self-emp-inc                    9.060e-01  1.916e-01   4.728
## workclass Self-emp-not-inc                2.890e-01  1.760e-01   1.642
## workclass State-gov                       2.776e-01  1.957e-01   1.419
## workclass Without-pay                    -1.373e+01  5.976e+02  -0.023
## education 11th                           -1.379e-01  2.785e-01  -0.495
## education 12th                           -1.812e-01  4.071e-01  -0.445
## education 1st-4th                        -8.493e-01  7.243e-01  -1.173
## education 5th-6th                        -5.030e-01  4.564e-01  -1.102
## education 7th-8th                        -4.900e-01  3.069e-01  -1.596
## education 9th                            -4.253e-01  3.506e-01  -1.213
## education Assoc-acdm                      1.237e+00  2.307e-01   5.364
## education Assoc-voc                       1.427e+00  2.207e-01   6.465
## education Bachelors                       1.965e+00  2.057e-01   9.551
## education Doctorate                       2.905e+00  2.797e-01  10.387
## education HS-grad                         8.289e-01  2.002e-01   4.140
## education Masters                         2.329e+00  2.196e-01  10.606
## education Preschool                      -1.281e+01  3.742e+02  -0.034
## education Prof-school                     2.855e+00  2.637e-01  10.827
## education Some-college                    1.224e+00  2.029e-01   6.032
## maritalstatus Married-AF-spouse           2.395e+00  7.854e-01   3.049
## maritalstatus Married-civ-spouse          1.603e+00  3.824e-01   4.193
## maritalstatus Married-spouse-absent       3.472e-01  2.807e-01   1.237
## maritalstatus Never-married              -5.178e-01  1.138e-01  -4.550
## maritalstatus Separated                   5.013e-02  2.079e-01   0.241
## maritalstatus Widowed                     9.353e-02  2.021e-01   0.463
## occupation Adm-clerical                  -7.306e-02  1.283e-01  -0.569
## occupation Armed-Forces                  -1.393e+01  9.937e+02  -0.014
## occupation Craft-repair                   4.995e-02  1.092e-01   0.457
## occupation Exec-managerial                7.267e-01  1.133e-01   6.414
## occupation Farming-fishing               -1.214e+00  1.851e-01  -6.558
## occupation Handlers-cleaners             -7.398e-01  1.879e-01  -3.938
## occupation Machine-op-inspct             -3.752e-01  1.393e-01  -2.693
## occupation Other-service                 -9.234e-01  1.655e-01  -5.579
## occupation Priv-house-serv               -1.390e+01  2.166e+02  -0.064
## occupation Prof-specialty                 3.865e-01  1.213e-01   3.186
## occupation Protective-serv                6.512e-01  1.718e-01   3.790
## occupation Sales                          1.892e-01  1.170e-01   1.617
## occupation Tech-support                   5.251e-01  1.532e-01   3.428
## occupation Transport-moving                      NA         NA      NA
## relationship Not-in-family               -3.240e-02  3.786e-01  -0.086
## relationship Other-relative              -8.494e-01  3.448e-01  -2.463
## relationship Own-child                   -1.321e+00  3.894e-01  -3.391
## relationship Unmarried                   -2.496e-01  3.981e-01  -0.627
## relationship Wife                         1.340e+00  1.340e-01   9.995
## race Asian-Pac-Islander                   1.057e+00  3.695e-01   2.861
## race Black                                6.486e-01  3.191e-01   2.033
## race Other                                5.881e-01  4.800e-01   1.225
## race White                                7.755e-01  3.050e-01   2.542
## sex Male                                  8.798e-01  1.038e-01   8.479
## capitalgain                               3.165e-04  1.349e-05  23.467
## capitalloss                               6.440e-04  4.820e-05  13.359
## hoursperweek                              3.092e-02  2.116e-03  14.609
## nativecountry Canada                     -8.524e-01  8.920e-01  -0.956
## nativecountry China                      -2.168e+00  9.122e-01  -2.377
## nativecountry Columbia                   -3.213e+00  1.390e+00  -2.311
## nativecountry Cuba                       -7.595e-01  9.132e-01  -0.832
## nativecountry Dominican-Republic         -1.438e+01  3.437e+02  -0.042
## nativecountry Ecuador                    -1.454e+00  1.202e+00  -1.210
## nativecountry El-Salvador                -1.185e+00  1.008e+00  -1.176
## nativecountry England                    -5.668e-01  9.114e-01  -0.622
## nativecountry France                     -4.422e-01  1.106e+00  -0.400
## nativecountry Germany                    -6.506e-01  8.793e-01  -0.740
## nativecountry Greece                     -1.763e+00  1.093e+00  -1.613
## nativecountry Guatemala                  -1.428e+01  3.288e+02  -0.043
## nativecountry Haiti                      -1.314e+00  1.273e+00  -1.032
## nativecountry Holand-Netherlands         -1.357e+01  2.400e+03  -0.006
## nativecountry Honduras                   -2.018e+00  2.737e+00  -0.737
## nativecountry Hong                       -2.114e+00  1.246e+00  -1.696
## nativecountry Hungary                    -7.671e-01  1.297e+00  -0.592
## nativecountry India                      -1.460e+00  8.671e-01  -1.684
## nativecountry Iran                       -3.787e-01  1.029e+00  -0.368
## nativecountry Ireland                    -2.845e-01  1.042e+00  -0.273
## nativecountry Italy                      -2.959e-02  9.068e-01  -0.033
## nativecountry Jamaica                    -2.903e-01  9.633e-01  -0.301
## nativecountry Japan                      -8.212e-01  9.677e-01  -0.849
## nativecountry Laos                       -1.583e+01  7.962e+02  -0.020
## nativecountry Mexico                     -1.237e+00  8.610e-01  -1.437
## nativecountry Nicaragua                  -1.872e+00  1.394e+00  -1.343
## nativecountry Outlying-US(Guam-USVI-etc) -1.525e+01  7.940e+02  -0.019
## nativecountry Peru                       -1.903e+00  1.501e+00  -1.267
## nativecountry Philippines                -7.124e-01  8.401e-01  -0.848
## nativecountry Poland                     -8.865e-01  9.495e-01  -0.934
## nativecountry Portugal                   -1.682e+00  1.354e+00  -1.243
## nativecountry Puerto-Rico                -1.373e+00  9.798e-01  -1.401
## nativecountry Scotland                   -6.633e-01  1.176e+00  -0.564
## nativecountry South                      -1.900e+00  9.244e-01  -2.055
## nativecountry Taiwan                     -7.558e-01  9.616e-01  -0.786
## nativecountry Thailand                   -2.125e+00  1.231e+00  -1.726
## nativecountry Trinadad&Tobago            -8.819e-01  1.226e+00  -0.719
## nativecountry United-States              -7.282e-01  8.179e-01  -0.890
## nativecountry Vietnam                    -1.221e+00  9.880e-01  -1.236
## nativecountry Yugoslavia                  3.934e-01  1.246e+00   0.316
##                                          Pr(>|z|)    
## (Intercept)                              8.28e-14 ***
## age                                       < 2e-16 ***
## workclass Federal-gov                    5.48e-10 ***
## workclass Local-gov                      0.002767 ** 
## workclass Never-worked                   0.989415    
## workclass Private                        4.54e-06 ***
## workclass Self-emp-inc                   2.26e-06 ***
## workclass Self-emp-not-inc               0.100665    
## workclass State-gov                      0.155982    
## workclass Without-pay                    0.981669    
## education 11th                           0.620585    
## education 12th                           0.656309    
## education 1st-4th                        0.240990    
## education 5th-6th                        0.270411    
## education 7th-8th                        0.110416    
## education 9th                            0.225078    
## education Assoc-acdm                     8.13e-08 ***
## education Assoc-voc                      1.01e-10 ***
## education Bachelors                       < 2e-16 ***
## education Doctorate                       < 2e-16 ***
## education HS-grad                        3.47e-05 ***
## education Masters                         < 2e-16 ***
## education Preschool                      0.972702    
## education Prof-school                     < 2e-16 ***
## education Some-college                   1.62e-09 ***
## maritalstatus Married-AF-spouse          0.002295 ** 
## maritalstatus Married-civ-spouse         2.75e-05 ***
## maritalstatus Married-spouse-absent      0.216096    
## maritalstatus Never-married              5.36e-06 ***
## maritalstatus Separated                  0.809444    
## maritalstatus Widowed                    0.643481    
## occupation Adm-clerical                  0.569102    
## occupation Armed-Forces                  0.988812    
## occupation Craft-repair                  0.647523    
## occupation Exec-managerial               1.42e-10 ***
## occupation Farming-fishing               5.46e-11 ***
## occupation Handlers-cleaners             8.22e-05 ***
## occupation Machine-op-inspct             0.007089 ** 
## occupation Other-service                 2.42e-08 ***
## occupation Priv-house-serv               0.948837    
## occupation Prof-specialty                0.001442 ** 
## occupation Protective-serv               0.000150 ***
## occupation Sales                         0.105965    
## occupation Tech-support                  0.000608 ***
## occupation Transport-moving                    NA    
## relationship Not-in-family               0.931792    
## relationship Other-relative              0.013774 *  
## relationship Own-child                   0.000696 ***
## relationship Unmarried                   0.530705    
## relationship Wife                         < 2e-16 ***
## race Asian-Pac-Islander                  0.004226 ** 
## race Black                               0.042071 *  
## race Other                               0.220512    
## race White                               0.011010 *  
## sex Male                                  < 2e-16 ***
## capitalgain                               < 2e-16 ***
## capitalloss                               < 2e-16 ***
## hoursperweek                              < 2e-16 ***
## nativecountry Canada                     0.339268    
## nativecountry China                      0.017471 *  
## nativecountry Columbia                   0.020846 *  
## nativecountry Cuba                       0.405575    
## nativecountry Dominican-Republic         0.966629    
## nativecountry Ecuador                    0.226257    
## nativecountry El-Salvador                0.239739    
## nativecountry England                    0.533998    
## nativecountry France                     0.689224    
## nativecountry Germany                    0.459358    
## nativecountry Greece                     0.106651    
## nativecountry Guatemala                  0.965373    
## nativecountry Haiti                      0.301939    
## nativecountry Holand-Netherlands         0.995488    
## nativecountry Honduras                   0.460864    
## nativecountry Hong                       0.089910 .  
## nativecountry Hungary                    0.554123    
## nativecountry India                      0.092160 .  
## nativecountry Iran                       0.712740    
## nativecountry Ireland                    0.784864    
## nativecountry Italy                      0.973967    
## nativecountry Jamaica                    0.763106    
## nativecountry Japan                      0.396072    
## nativecountry Laos                       0.984139    
## nativecountry Mexico                     0.150722    
## nativecountry Nicaragua                  0.179252    
## nativecountry Outlying-US(Guam-USVI-etc) 0.984673    
## nativecountry Peru                       0.205029    
## nativecountry Philippines                0.396441    
## nativecountry Poland                     0.350469    
## nativecountry Portugal                   0.213951    
## nativecountry Puerto-Rico                0.161076    
## nativecountry Scotland                   0.572856    
## nativecountry South                      0.039884 *  
## nativecountry Taiwan                     0.431905    
## nativecountry Thailand                   0.084266 .  
## nativecountry Trinadad&Tobago            0.471926    
## nativecountry United-States              0.373292    
## nativecountry Vietnam                    0.216499    
## nativecountry Yugoslavia                 0.752289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21175  on 19186  degrees of freedom
## Residual deviance: 12121  on 19090  degrees of freedom
## AIC: 12315
## 
## Number of Fisher Scoring iterations: 15

Use the logistic regression model to predict the outcome of the test dataset.


predict.CensusGLM <- predict(censusGLM, newdata=test, type='response')

Build Confusion Matrix and compute the accuracy of the logistic regression model on test dataset.


confusionMatrix <- table(test$fiftyKPlus, predict.CensusGLM>=0.5)

accuracy.CensusGLM <- (confusionMatrix[1,1] + confusionMatrix[2,2])/sum(confusionMatrix)
sprintf('The accuracy of the Logistic Regression model is: %f', accuracy.CensusGLM)
The accuracy of the Logistic Regression model is: 0.850989

Compute the baseline accuracy of the model.


baselineTrainMatrix <- table(train$fiftyKPlus)
baselineTestMatrix <- table(test$fiftyKPlus)

if(baselineTrainMatrix[1] > baselineTrainMatrix[2])
{
  baselineAccuracy.CensusGLM <- baselineTestMatrix[1]/sum(baselineTestMatrix)
}else
{
  baselineAccuracy.CensusGLM <- baselineTrainMatrix[1]/sum(baselineTestMatrix)
}

sprintf('The baseline accuracy of the Logistic Regression Model is: %f',baselineAccuracy.CensusGLM)
The baseline accuracy of the Logistic Regression Model is: 0.759362

Performance of the Logistic Regression model: Plot the Receiver Operating Characteristic curve and compute the area under the ROC curve


predict.CensusGLM.ROC <- prediction(predict.CensusGLM, test$fiftyKPlus)

perf.CensusGLM.ROC <- performance(predict.CensusGLM.ROC, 'tpr','fpr')

plot(perf.CensusGLM.ROC, main='ROC Curve for Logistic Regression Model',
     colorize=TRUE,
     print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.1,1.2))

auc.CensusGLM <- as.numeric(performance(predict.CensusGLM.ROC, 'auc')@y.values)
     
sprintf('The area under the ROC curve for the Logistic Regression Model is: %f',auc.CensusGLM)
The area under the ROC curve for the Logistic Regression Model is: 0.905733