Sachin Shrestha

R Code for Census Project

Problem: Use an individual's information to predict whether or not the person earns more than $50,000 per year.

Data: Census data for earnings, 2010
Source: UCI Machine Learning Laboratory

Install and load required packages.

Package caTools is required for splitting the data into training and testing datasets.


if(!require(caTools)){
  install.packages('caTools')
}
library(caTools)

CART (Classification and regression tree) models require rpart and rpart.plot packages.


if(!require(rpart)){
  install.packages('rpart')
}
library(rpart)


if(!require(rpart.plot)){
  install.packages('rpart.plot')
}
library(rpart.plot)

ROCR package is required to evaluate the performance of models and to plot the Receiver Operating Characteristic (ROC) curves.


if(!require(ROCR)){
  install.packages('ROCR')
}
library(ROCR)

Packages caret and e1071 are required for cross-validation.


if(!require(caret)){
  install.packages('caret')
}
library(caret)


if(!require(e1071)){
  install.packages('e1071')
}
library(e1071)

Random Forest models require the randomForest package.


if(!require(randomForest)){
  install.packages('randomForest')
}
library(randomForest)

Set the working directory and read data into dataframe.


setwd('C:/Users/Sachin/Desktop/MyRData')
census <- read.csv('census2010.csv')

View the names of variables in the dataframe and their types.


str(census)

## 'data.frame':    31978 obs. of  14 variables:
##  $ age          : int  39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass    : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
##  $ education    : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
##  $ educationnum : int  13 13 9 7 13 14 5 9 14 13 ...
##  $ maritalstatus: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 ...
##  $ occupation   : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
##  $ relationship : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 ...
##  $ race         : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
##  $ sex          : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
##  $ capitalgain  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
##  $ capitalloss  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ hoursperweek : int  40 13 40 40 40 40 16 45 50 40 ...
##  $ nativecountry: Factor w/ 41 levels " Cambodia"," Canada",..: 39 39 39 39 5 39 23 ...
##  $ fiftyKPlus   : Factor w/ 2 levels " <=50K"," >50K": 1 1 1 1 1 1 1 2 2 2 ...

The variable educationnum is a numerical representation of the variable education. Therefore, for the purpose of building models, I will use the variable education and will not use the variable educationnum.

Look at the summary of dataframe; notice missing values and any abnormal distributions.


summary(census)

##       age                    workclass             education    
##  Min.   :17.00    Private         :22286    HS-grad     :10368  
##  1st Qu.:28.00    Self-emp-not-inc: 2499    Some-college: 7187  
##  Median :37.00    Local-gov       : 2067    Bachelors   : 5210  
##  Mean   :38.58    ?               : 1809    Masters     : 1674  
##  3rd Qu.:48.00    State-gov       : 1279    Assoc-voc   : 1366  
##  Max.   :90.00    Self-emp-inc    : 1074    11th        : 1167  
##                  (Other)          :  964   (Other)      : 5006  

##   educationnum                  maritalstatus              occupation  
##  Min.   : 1.00    Divorced             : 4394    Prof-specialty :4038  
##  1st Qu.: 9.00    Married-AF-spouse    :   23    Craft-repair   :4030  
##  Median :10.00    Married-civ-spouse   :14692    Exec-managerial:3992  
##  Mean   :10.07    Married-spouse-absent:  397    Adm-clerical   :3721  
##  3rd Qu.:12.00    Never-married        :10488    Sales          :3584  
##  Max.   :16.00    Separated            : 1005    Other-service  :3212  
##                   Widowed              :  979   (Other)         :9401  

##           relationship                    race            sex       
##   Husband       :12947    Amer-Indian-Eskimo:  311    Female:10608  
##   Not-in-family : 8156    Asian-Pac-Islander:  956    Male  :21370  
##   Other-relative:  952    Black             : 3028                  
##   Own-child     : 5005    Other             :  253                  
##   Unmarried     : 3384    White             :27430                  
##   Wife          : 1534                                              

##   capitalgain     capitalloss       hoursperweek          nativecountry  
##  Min.   :    0   Min.   :   0.00   Min.   : 1.00    United-States:29170  
##  1st Qu.:    0   1st Qu.:   0.00   1st Qu.:40.00    Mexico       :  643  
##  Median :    0   Median :   0.00   Median :40.00    Philippines  :  198  
##  Mean   : 1064   Mean   :  86.74   Mean   :40.42    Germany      :  137  
##  3rd Qu.:    0   3rd Qu.:   0.00   3rd Qu.:45.00    Canada       :  121  
##  Max.   :99999   Max.   :4356.00   Max.   :99.00    Puerto-Rico  :  114  
##                                                    (Other)       : 1595  

##   fiftyKPlus   
##   <=50K:24283  
##   >50K : 7695

View the first and last few rows of the dataframe.


head(census)

##   age         workclass  education educationnum       maritalstatus
## 1  39         State-gov  Bachelors           13       Never-married
## 2  50  Self-emp-not-inc  Bachelors           13  Married-civ-spouse
## 3  38           Private    HS-grad            9            Divorced
## 4  53           Private       11th            7  Married-civ-spouse
## 5  28           Private  Bachelors           13  Married-civ-spouse
## 6  37           Private    Masters           14  Married-civ-spouse

##           occupation   relationship   race     sex capitalgain capitalloss
## 1       Adm-clerical  Not-in-family  White    Male        2174           0
## 2    Exec-managerial        Husband  White    Male           0           0
## 3  Handlers-cleaners  Not-in-family  White    Male           0           0
## 4  Handlers-cleaners        Husband  Black    Male           0           0
## 5     Prof-specialty           Wife  Black  Female           0           0
## 6    Exec-managerial           Wife  White  Female           0           0

##   hoursperweek  nativecountry fiftyKPlus
## 1           40  United-States      <=50K
## 2           13  United-States      <=50K
## 3           40  United-States      <=50K
## 4           40  United-States      <=50K
## 5           40           Cuba      <=50K
## 6           40  United-States      <=50K

tail(census)

##       age     workclass     education educationnum       maritalstatus
## 31973  22       Private  Some-college           10       Never-married
## 31974  27       Private    Assoc-acdm           12  Married-civ-spouse
## 31975  40       Private       HS-grad            9  Married-civ-spouse
## 31976  58       Private       HS-grad            9             Widowed
## 31977  22       Private       HS-grad            9       Never-married
## 31978  52  Self-emp-inc       HS-grad            9  Married-civ-spouse

##               occupation   relationship   race     sex capitalgain
## 31973    Protective-serv  Not-in-family  White    Male           0
## 31974       Tech-support           Wife  White  Female           0
## 31975  Machine-op-inspct        Husband  White    Male           0
## 31976       Adm-clerical      Unmarried  White  Female           0
## 31977       Adm-clerical      Own-child  White    Male           0
## 31978    Exec-managerial           Wife  White  Female       15024

##       capitalloss hoursperweek  nativecountry fiftyKPlus
## 31973           0           40  United-States      <=50K
## 31974           0           38  United-States      <=50K
## 31975           0           40  United-States       >50K
## 31976           0           40  United-States      <=50K
## 31977           0           20  United-States      <=50K
## 31978           0           40  United-States       >50K

Split the data into training and testing datasets. Use 60% of the dataset for training the model and the remaining 30% for testing the model.


set.seed(5678)

split <- sample.split(census$fiftyKPlus, SplitRatio=0.6)

train <- subset(census, split==TRUE)
test <- subset(census, split==FALSE)

CLASSIFICATION AND REGRESSION TREE (CART) MODEL

Use the training dataset to build a CART model.


censusTree <- rpart(fiftyKPlus ~age + workclass + education + maritalstatus + 
                      occupation + relationship + race + sex + capitalgain +
                      capitalloss + hoursperweek + nativecountry,
                    data = train,
                    method ='class')

Plot the censusTree.


prp(censusTree, type=1, compress=FALSE, box.col=c('yellow', 'pink')[censusTree$frame$yval])

Score the model using the test dataset, ie. use the CART model to predict the label using test dataset.


predictTest.CensusTree <- predict(censusTree, newdata=test, type='class')

Build a Confusion Matrix and compute the accuracy of the CART model on test dataset.


confusionMatrix.CensusTree <- table(test$fiftyKPlus, predictTest.CensusTree)

accuracy.CensusTree <- (confusionMatrix.CensusTree[1,1] + confusionMatrix.CensusTree[2,2])/sum(confusionMatrix.CensusTree)

sprintf('The accurary of the CART model is: %f', accuracy.CensusTree)

The accurary of the CART model is: 0.849582

Performance of the CART model: Compute the area under the ROC curve.


predictTest.CensusTree2 <- predict(censusTree, newdata=test)

predictTest.CensusTree3 <- predictTest.CensusTree2[,2]

censusTree.ROCR <- prediction(predictTest.CensusTree3, test$fiftyKPlus)

perf.censusTree.ROCR <- performance(censusTree.ROCR, 'tpr','fpr')

Plot Receiver Operating Characteristic (ROC) curve.


plot(perf.censusTree.ROCR, main='ROC Curve for CART Model',
     colorize=TRUE,
     print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.1,1.2))

Find the area under the ROC curve.


auc.censusTree <- as.numeric(performance(censusTree.ROCR, 'auc')@y.values)

sprintf('The area under the ROC curve for the CART Model is: %f',auc.censusTree)

The area under the ROC curve for the CART Model is: 0.846746

CART MODEL WITH CROSS VALIDATION

Create a 10-fold cross validation model.


fitControl <- trainControl(method='cv', number=10)
cartGrid = expand.grid(.cp=seq(0.0001,0.01,0.0002))

Train to find the optimum complexity parameter (cp).


cpTable=train(fiftyKPlus ~age + workclass + education + maritalstatus + occupation +
              relationship + race + sex + capitalgain + capitalloss + hoursperweek +
              nativecountry,
      data = train,
      method = 'rpart',
      trControl = fitControl,
      tuneGrid = cartGrid)

print(cpTable)

## CART 
## 
## 19187 samples
##    13 predictor
##     2 classes: ' <=50K', ' >50K' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 17268, 17268, 17269, 17268, 17268, 17268, ... 
## Resampling results across tuning parameters:
## 
##   cp      Accuracy   Kappa      Accuracy SD  Kappa SD  
##   0.0001  0.8451027  0.5521496  0.007447661  0.02237656
##   0.0003  0.8518260  0.5659783  0.005656349  0.02076003
##   0.0005  0.8527128  0.5674594  0.004759356  0.01974055
##   0.0007  0.8534945  0.5651194  0.004058885  0.01821072
##   0.0009  0.8548493  0.5705408  0.005293786  0.02616873
##   0.0011  0.8554226  0.5706891  0.005275069  0.02398613
##   0.0013  0.8557352  0.5649620  0.005583107  0.02105032
##   0.0015  0.8556310  0.5656655  0.005484671  0.01862422
##   0.0017  0.8554746  0.5654269  0.005869369  0.02030439
##   0.0019  0.8554224  0.5649776  0.006222305  0.02070911
##   0.0021  0.8547969  0.5636514  0.005754433  0.01971248
##   0.0023  0.8539110  0.5617916  0.005181031  0.01855891
##   0.0025  0.8537025  0.5616562  0.004974453  0.01823042
##   0.0027  0.8537025  0.5616562  0.004974453  0.01823042
##   0.0029  0.8535461  0.5615035  0.005229620  0.01891518
##   0.0031  0.8535461  0.5615035  0.005229620  0.01891518
##   0.0033  0.8535461  0.5615035  0.005229620  0.01891518
##   0.0035  0.8532855  0.5611465  0.005155199  0.01945739
##   0.0037  0.8529728  0.5608301  0.004935175  0.01917629
##   0.0039  0.8528686  0.5607988  0.004919927  0.01916543
##   0.0041  0.8514092  0.5587599  0.005941366  0.02106837
##   0.0043  0.8507318  0.5578820  0.005360193  0.02037283
##   0.0045  0.8505754  0.5586239  0.005215911  0.01912316
##   0.0047  0.8505754  0.5587028  0.005215911  0.01893425
##   0.0049  0.8503149  0.5578247  0.005042505  0.01856463
##   0.0051  0.8500022  0.5566787  0.005048704  0.01874286
##   0.0053  0.8491684  0.5515700  0.005634282  0.02394063
##   0.0055  0.8491684  0.5515700  0.005634282  0.02394063
##   0.0057  0.8486994  0.5519440  0.005830518  0.02378685
##   0.0059  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0061  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0063  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0065  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0067  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0069  0.8479699  0.5484015  0.005178705  0.02235647
##   0.0071  0.8475007  0.5474264  0.006070675  0.02397956
##   0.0073  0.8471357  0.5466614  0.005624372  0.02315024
##   0.0075  0.8467188  0.5460389  0.005518118  0.02216232
##   0.0077  0.8467188  0.5460389  0.005518118  0.02216232
##   0.0079  0.8460932  0.5419583  0.007144320  0.03149372
##   0.0081  0.8458847  0.5414910  0.006929118  0.03091918
##   0.0083  0.8458847  0.5414910  0.006929118  0.03091918
##   0.0085  0.8457805  0.5414673  0.006913012  0.03090712
##   0.0087  0.8457805  0.5414673  0.006913012  0.03090712
##   0.0089  0.8457805  0.5414673  0.006913012  0.03090712
##   0.0091  0.8457805  0.5417572  0.006913012  0.03097151
##   0.0093  0.8457805  0.5417572  0.006913012  0.03097151
##   0.0095  0.8457805  0.5417572  0.006913012  0.03097151
##   0.0097  0.8457805  0.5417572  0.006913012  0.03097151
##   0.0099  0.8453636  0.5409112  0.007141527  0.03083063
 
Accuracy was used to select the optimal model using  the largest value.

The final value used for the model was cp = 0.0013.

Use the optimum cp value to build the CART model with Cross Validation.


censusTreeCV <- rpart(fiftyKPlus ~age + workclass + education + maritalstatus + 
                        occupation + relationship + race + sex + capitalgain + 
                        capitalloss + hoursperweek + nativecountry,
                   data = train,
                   method ='class',
                   control = rpart.control(cp=cpTable$results$cp[which.max(cpTable$results[,2])]))

Plot the censusTree.


prp(censusTreeCV, type=1, compress=FALSE, box.col=c('yellow', 'pink')[censusTree$frame$yval])

Build a Confusion Matrix and compute the accuracy of the CART model with Cross Validation on test dataset.

predict.CensusTreeCV <- predict(censusTreeCV, newdata=test, type='class')
confusionMatrix <- table(test$fiftyKPlus, predict.CensusTreeCV)

accuracy.CensusTreeCV <- (confusionMatrix[1,1] + confusionMatrix[2,2])/sum(confusionMatrix)

sprintf('The accurary of the CART model with Cross Validation is: %f', accuracy.CensusTreeCV)

The accurary of the CART model with Cross Validation is: 0.862403

Performance of the CART model with Cross Validation: Compute the area under the ROC curve.


predict.CensusTreeCV2 <- predict(censusTreeCV, newdata=test)
predict.CensusTreeCV3 <- predict.CensusTreeCV2[,2]

predict.CensusTreeCV.ROC <- prediction(predict.CensusTreeCV3, test$fiftyKPlus)

perf.CensusTreeCV.ROC <- performance(predict.CensusTreeCV.ROC, 'tpr','fpr')

plot(perf.CensusTreeCV.ROC, main='ROC Curve for CART Model with Cross Validation',
     colorize=TRUE,
     print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.1,1.2))

Plot Receiver Operaring Characteristic (ROC) curve.

Find the area under the ROC curve.

auc.CensusTreeCV <- as.numeric(performance(predict.CensusTreeCV.ROC, 'auc')@y.values)

sprintf('The area under the ROC curve for the CART Model with Cross Validation is: %f',auc.CensusTreeCV)

The area under the ROC curve for the CART Model with Cross Validation is: 0.871933

RANDOM FOREST MODEL


set.seed(6789)

Use the training dataset to build a Random Forest model.


censusRF <- randomForest(fiftyKPlus ~age + workclass + education + maritalstatus +
                        occupation + relationship + race + sex + capitalgain +
                        capitalloss + hoursperweek + nativecountry,
                        data=train)

Use the Random Forest model to predict the outcome of the test dataset.


predict.censusRF <- predict(censusRF, newdata=test)

Build Confusion Matrix and compute the accuracy of the Random Forest model on the test dataset.


confusionMatrix <- table(test$fiftyKPlus, predict.censusRF)

accuracy.censusRF <- (confusionMatrix[1,1] + confusionMatrix[2,2])/sum(confusionMatrix)

sprintf('The accurary of the RANDOM FOREST model is: %f', accuracy.censusRF)

The accurary of the RANDOM FOREST model is: 0.825580

LOGISTIC REGRESSION MODEL

Build a Logistic Regression model using the training dataset.


censusGLM <- glm(fiftyKPlus ~age + workclass + education + maritalstatus + 
                    occupation + relationship + race + sex + capitalgain + 
                    capitalloss + hoursperweek + nativecountry,
                data=train, 
                family = 'binomial')

summary(censusGLM)


## Call:
## glm(formula = fiftyKPlus ~ age + workclass + education + maritalstatus + 
##     occupation + relationship + race + sex + capitalgain + capitalloss + 
##     hoursperweek + nativecountry, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.3156  -0.4984  -0.1806  -0.0007   3.6020  
## 
## Coefficients: (1 not defined because of singularities)
##                                            Estimate Std. Error z value
## (Intercept)                              -7.429e+00  9.951e-01  -7.466
## age                                       2.467e-02  2.136e-03  11.550
## workclass Federal-gov                     1.232e+00  1.985e-01   6.205
## workclass Local-gov                       5.380e-01  1.798e-01   2.993
## workclass Never-worked                   -1.307e+01  9.850e+02  -0.013
## workclass Private                         7.284e-01  1.589e-01   4.585
## workclass Self-emp-inc                    9.060e-01  1.916e-01   4.728
## workclass Self-emp-not-inc                2.890e-01  1.760e-01   1.642
## workclass State-gov                       2.776e-01  1.957e-01   1.419
## workclass Without-pay                    -1.373e+01  5.976e+02  -0.023
## education 11th                           -1.379e-01  2.785e-01  -0.495
## education 12th                           -1.812e-01  4.071e-01  -0.445
## education 1st-4th                        -8.493e-01  7.243e-01  -1.173
## education 5th-6th                        -5.030e-01  4.564e-01  -1.102
## education 7th-8th                        -4.900e-01  3.069e-01  -1.596
## education 9th                            -4.253e-01  3.506e-01  -1.213
## education Assoc-acdm                      1.237e+00  2.307e-01   5.364
## education Assoc-voc                       1.427e+00  2.207e-01   6.465
## education Bachelors                       1.965e+00  2.057e-01   9.551
## education Doctorate                       2.905e+00  2.797e-01  10.387
## education HS-grad                         8.289e-01  2.002e-01   4.140
## education Masters                         2.329e+00  2.196e-01  10.606
## education Preschool                      -1.281e+01  3.742e+02  -0.034
## education Prof-school                     2.855e+00  2.637e-01  10.827
## education Some-college                    1.224e+00  2.029e-01   6.032
## maritalstatus Married-AF-spouse           2.395e+00  7.854e-01   3.049
## maritalstatus Married-civ-spouse          1.603e+00  3.824e-01   4.193
## maritalstatus Married-spouse-absent       3.472e-01  2.807e-01   1.237
## maritalstatus Never-married              -5.178e-01  1.138e-01  -4.550
## maritalstatus Separated                   5.013e-02  2.079e-01   0.241
## maritalstatus Widowed                     9.353e-02  2.021e-01   0.463
## occupation Adm-clerical                  -7.306e-02  1.283e-01  -0.569
## occupation Armed-Forces                  -1.393e+01  9.937e+02  -0.014
## occupation Craft-repair                   4.995e-02  1.092e-01   0.457
## occupation Exec-managerial                7.267e-01  1.133e-01   6.414
## occupation Farming-fishing               -1.214e+00  1.851e-01  -6.558
## occupation Handlers-cleaners             -7.398e-01  1.879e-01  -3.938
## occupation Machine-op-inspct             -3.752e-01  1.393e-01  -2.693
## occupation Other-service                 -9.234e-01  1.655e-01  -5.579
## occupation Priv-house-serv               -1.390e+01  2.166e+02  -0.064
## occupation Prof-specialty                 3.865e-01  1.213e-01   3.186
## occupation Protective-serv                6.512e-01  1.718e-01   3.790
## occupation Sales                          1.892e-01  1.170e-01   1.617
## occupation Tech-support                   5.251e-01  1.532e-01   3.428
## occupation Transport-moving                      NA         NA      NA
## relationship Not-in-family               -3.240e-02  3.786e-01  -0.086
## relationship Other-relative              -8.494e-01  3.448e-01  -2.463
## relationship Own-child                   -1.321e+00  3.894e-01  -3.391
## relationship Unmarried                   -2.496e-01  3.981e-01  -0.627
## relationship Wife                         1.340e+00  1.340e-01   9.995
## race Asian-Pac-Islander                   1.057e+00  3.695e-01   2.861
## race Black                                6.486e-01  3.191e-01   2.033
## race Other                                5.881e-01  4.800e-01   1.225
## race White                                7.755e-01  3.050e-01   2.542
## sex Male                                  8.798e-01  1.038e-01   8.479
## capitalgain                               3.165e-04  1.349e-05  23.467
## capitalloss                               6.440e-04  4.820e-05  13.359
## hoursperweek                              3.092e-02  2.116e-03  14.609
## nativecountry Canada                     -8.524e-01  8.920e-01  -0.956
## nativecountry China                      -2.168e+00  9.122e-01  -2.377
## nativecountry Columbia                   -3.213e+00  1.390e+00  -2.311
## nativecountry Cuba                       -7.595e-01  9.132e-01  -0.832
## nativecountry Dominican-Republic         -1.438e+01  3.437e+02  -0.042
## nativecountry Ecuador                    -1.454e+00  1.202e+00  -1.210
## nativecountry El-Salvador                -1.185e+00  1.008e+00  -1.176
## nativecountry England                    -5.668e-01  9.114e-01  -0.622
## nativecountry France                     -4.422e-01  1.106e+00  -0.400
## nativecountry Germany                    -6.506e-01  8.793e-01  -0.740
## nativecountry Greece                     -1.763e+00  1.093e+00  -1.613
## nativecountry Guatemala                  -1.428e+01  3.288e+02  -0.043
## nativecountry Haiti                      -1.314e+00  1.273e+00  -1.032
## nativecountry Holand-Netherlands         -1.357e+01  2.400e+03  -0.006
## nativecountry Honduras                   -2.018e+00  2.737e+00  -0.737
## nativecountry Hong                       -2.114e+00  1.246e+00  -1.696
## nativecountry Hungary                    -7.671e-01  1.297e+00  -0.592
## nativecountry India                      -1.460e+00  8.671e-01  -1.684
## nativecountry Iran                       -3.787e-01  1.029e+00  -0.368
## nativecountry Ireland                    -2.845e-01  1.042e+00  -0.273
## nativecountry Italy                      -2.959e-02  9.068e-01  -0.033
## nativecountry Jamaica                    -2.903e-01  9.633e-01  -0.301
## nativecountry Japan                      -8.212e-01  9.677e-01  -0.849
## nativecountry Laos                       -1.583e+01  7.962e+02  -0.020
## nativecountry Mexico                     -1.237e+00  8.610e-01  -1.437
## nativecountry Nicaragua                  -1.872e+00  1.394e+00  -1.343
## nativecountry Outlying-US(Guam-USVI-etc) -1.525e+01  7.940e+02  -0.019
## nativecountry Peru                       -1.903e+00  1.501e+00  -1.267
## nativecountry Philippines                -7.124e-01  8.401e-01  -0.848
## nativecountry Poland                     -8.865e-01  9.495e-01  -0.934
## nativecountry Portugal                   -1.682e+00  1.354e+00  -1.243
## nativecountry Puerto-Rico                -1.373e+00  9.798e-01  -1.401
## nativecountry Scotland                   -6.633e-01  1.176e+00  -0.564
## nativecountry South                      -1.900e+00  9.244e-01  -2.055
## nativecountry Taiwan                     -7.558e-01  9.616e-01  -0.786
## nativecountry Thailand                   -2.125e+00  1.231e+00  -1.726
## nativecountry Trinadad&Tobago            -8.819e-01  1.226e+00  -0.719
## nativecountry United-States              -7.282e-01  8.179e-01  -0.890
## nativecountry Vietnam                    -1.221e+00  9.880e-01  -1.236
## nativecountry Yugoslavia                  3.934e-01  1.246e+00   0.316
##                                          Pr(>|z|)    
## (Intercept)                              8.28e-14 ***
## age                                       < 2e-16 ***
## workclass Federal-gov                    5.48e-10 ***
## workclass Local-gov                      0.002767 ** 
## workclass Never-worked                   0.989415    
## workclass Private                        4.54e-06 ***
## workclass Self-emp-inc                   2.26e-06 ***
## workclass Self-emp-not-inc               0.100665    
## workclass State-gov                      0.155982    
## workclass Without-pay                    0.981669    
## education 11th                           0.620585    
## education 12th                           0.656309    
## education 1st-4th                        0.240990    
## education 5th-6th                        0.270411    
## education 7th-8th                        0.110416    
## education 9th                            0.225078    
## education Assoc-acdm                     8.13e-08 ***
## education Assoc-voc                      1.01e-10 ***
## education Bachelors                       < 2e-16 ***
## education Doctorate                       < 2e-16 ***
## education HS-grad                        3.47e-05 ***
## education Masters                         < 2e-16 ***
## education Preschool                      0.972702    
## education Prof-school                     < 2e-16 ***
## education Some-college                   1.62e-09 ***
## maritalstatus Married-AF-spouse          0.002295 ** 
## maritalstatus Married-civ-spouse         2.75e-05 ***
## maritalstatus Married-spouse-absent      0.216096    
## maritalstatus Never-married              5.36e-06 ***
## maritalstatus Separated                  0.809444    
## maritalstatus Widowed                    0.643481    
## occupation Adm-clerical                  0.569102    
## occupation Armed-Forces                  0.988812    
## occupation Craft-repair                  0.647523    
## occupation Exec-managerial               1.42e-10 ***
## occupation Farming-fishing               5.46e-11 ***
## occupation Handlers-cleaners             8.22e-05 ***
## occupation Machine-op-inspct             0.007089 ** 
## occupation Other-service                 2.42e-08 ***
## occupation Priv-house-serv               0.948837    
## occupation Prof-specialty                0.001442 ** 
## occupation Protective-serv               0.000150 ***
## occupation Sales                         0.105965    
## occupation Tech-support                  0.000608 ***
## occupation Transport-moving                    NA    
## relationship Not-in-family               0.931792    
## relationship Other-relative              0.013774 *  
## relationship Own-child                   0.000696 ***
## relationship Unmarried                   0.530705    
## relationship Wife                         < 2e-16 ***
## race Asian-Pac-Islander                  0.004226 ** 
## race Black                               0.042071 *  
## race Other                               0.220512    
## race White                               0.011010 *  
## sex Male                                  < 2e-16 ***
## capitalgain                               < 2e-16 ***
## capitalloss                               < 2e-16 ***
## hoursperweek                              < 2e-16 ***
## nativecountry Canada                     0.339268    
## nativecountry China                      0.017471 *  
## nativecountry Columbia                   0.020846 *  
## nativecountry Cuba                       0.405575    
## nativecountry Dominican-Republic         0.966629    
## nativecountry Ecuador                    0.226257    
## nativecountry El-Salvador                0.239739    
## nativecountry England                    0.533998    
## nativecountry France                     0.689224    
## nativecountry Germany                    0.459358    
## nativecountry Greece                     0.106651    
## nativecountry Guatemala                  0.965373    
## nativecountry Haiti                      0.301939    
## nativecountry Holand-Netherlands         0.995488    
## nativecountry Honduras                   0.460864    
## nativecountry Hong                       0.089910 .  
## nativecountry Hungary                    0.554123    
## nativecountry India                      0.092160 .  
## nativecountry Iran                       0.712740    
## nativecountry Ireland                    0.784864    
## nativecountry Italy                      0.973967    
## nativecountry Jamaica                    0.763106    
## nativecountry Japan                      0.396072    
## nativecountry Laos                       0.984139    
## nativecountry Mexico                     0.150722    
## nativecountry Nicaragua                  0.179252    
## nativecountry Outlying-US(Guam-USVI-etc) 0.984673    
## nativecountry Peru                       0.205029    
## nativecountry Philippines                0.396441    
## nativecountry Poland                     0.350469    
## nativecountry Portugal                   0.213951    
## nativecountry Puerto-Rico                0.161076    
## nativecountry Scotland                   0.572856    
## nativecountry South                      0.039884 *  
## nativecountry Taiwan                     0.431905    
## nativecountry Thailand                   0.084266 .  
## nativecountry Trinadad&Tobago            0.471926    
## nativecountry United-States              0.373292    
## nativecountry Vietnam                    0.216499    
## nativecountry Yugoslavia                 0.752289    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 21175  on 19186  degrees of freedom
## Residual deviance: 12121  on 19090  degrees of freedom
## AIC: 12315
## 
## Number of Fisher Scoring iterations: 15

Use the logistic regression model to predict the outcome of the test dataset.


predict.CensusGLM <- predict(censusGLM, newdata=test, type='response')

Build Confusion Matrix and compute the accuracy of the logistic regression model on test dataset.


confusionMatrix <- table(test$fiftyKPlus, predict.CensusGLM>=0.5)

accuracy.CensusGLM <- (confusionMatrix[1,1] + confusionMatrix[2,2])/sum(confusionMatrix)
sprintf('The accuracy of the Logistic Regression model is: %f', accuracy.CensusGLM)

The accuracy of the Logistic Regression model is: 0.850989

Compute the baseline accuracy of the model.


baselineTrainMatrix <- table(train$fiftyKPlus)
baselineTestMatrix <- table(test$fiftyKPlus)

if(baselineTrainMatrix[1] > baselineTrainMatrix[2])
{
  baselineAccuracy.CensusGLM <- baselineTestMatrix[1]/sum(baselineTestMatrix)
}else
{
  baselineAccuracy.CensusGLM <- baselineTrainMatrix[1]/sum(baselineTestMatrix)
}

sprintf('The baseline accuracy of the Logistic Regression Model is: %f',baselineAccuracy.CensusGLM)

The baseline accuracy of the Logistic Regression Model is: 0.759362

Performance of the Logistic Regression model: Plot the Receiver Operating Characteristic curve and compute the area under the ROC curve


predict.CensusGLM.ROC <- prediction(predict.CensusGLM, test$fiftyKPlus)

perf.CensusGLM.ROC <- performance(predict.CensusGLM.ROC, 'tpr','fpr')

plot(perf.CensusGLM.ROC, main='ROC Curve for Logistic Regression Model',
     colorize=TRUE,
     print.cutoffs.at=seq(0,1,0.1),text.adj=c(-0.1,1.2))

auc.CensusGLM <- as.numeric(performance(predict.CensusGLM.ROC, 'auc')@y.values)
     
sprintf('The area under the ROC curve for the Logistic Regression Model is: %f',auc.CensusGLM)

The area under the ROC curve for the Logistic Regression Model is: 0.905733

R Code for Census Project

Author: Sachin Shrestha

Problem: Use an individual's information to predict whether or not the person earns more than $50,000 per year.

Data: Census data for earnings, 2010 Source: UCI Machine Learning Laboratory

Install and load required packages. Package caTools is required for splitting the data into training and testing datasets.

CART (Classification and regression tree) models require rpart and rpart.plot packages.

ROCR package is required to evaluate the performance of models and to plot the Receiver Operating Characteristic (ROC) curves.

Packages caret and e1071 are required for cross-validation.

Random Forest models require the randomForest package.

Set the working directory and read data into dataframe.

View the names of variables in the dataframe and their types.

The variable educationnum is a numerical representation of the variable education. Therefore, for the purpose of building models, I will use the variable education and will not use the variable educationnum.

Look at the summary of dataframe; notice missing values and any abnormal distributions.

View the first and last few rows of the dataframe.

Split the data into training and testing datasets. Use 60% of the dataset for training the model and the remaining 30% for testing the model.

CLASSIFICATION AND REGRESSION TREE (CART) MODEL

Use the training dataset to build a CART model.

Plot the censusTree.

Score the model using the test dataset, ie. use the CART model to predict the label using test dataset.

Build a Confusion Matrix and compute the accuracy of the CART model on test dataset.

Performance of the CART model: Compute the area under the ROC curve.

Plot Receiver Operating Characteristic (ROC) curve.

Find the area under the ROC curve.

CART MODEL WITH CROSS VALIDATION

Create a 10-fold cross validation model.

Train to find the optimum complexity parameter (cp).

Use the optimum cp value to build the CART model with Cross Validation.

Plot the censusTree.

Build a Confusion Matrix and compute the accuracy of the CART model with Cross Validation on test dataset.

Performance of the CART model with Cross Validation: Compute the area under the ROC curve.

Plot Receiver Operaring Characteristic (ROC) curve.

Find the area under the ROC curve.

RANDOM FOREST MODEL

Use the training dataset to build a Random Forest model.

Use the Random Forest model to predict the outcome of the test dataset.

Build Confusion Matrix and compute the accuracy of the Random Forest model on the test dataset.

LOGISTIC REGRESSION MODEL

Build a Logistic Regression model using the training dataset.

Use the logistic regression model to predict the outcome of the test dataset.

Build Confusion Matrix and compute the accuracy of the logistic regression model on test dataset.

Compute the baseline accuracy of the model.

Performance of the Logistic Regression model: Plot the Receiver Operating Characteristic curve and compute the area under the ROC curve

Data: Census data for earnings, 2010
Source: UCI Machine Learning Laboratory

Install and load required packages.

Package caTools is required for splitting the data into training and testing datasets.