What determines whether someone earns more than $50K?

We keep wondering what factors play a role in a person’s annual income. We may think of hundreds of factors, but are there that many?
Here I use publicly available census data (source: UCI Machine Learning Laboratory) to find what factors actually determine whether someone earns more than 50K annually.
The census dataframe used for this project contains 13 variables - 12 independent variables (features) and one dependent variable (label). The variables are shown in the table below.
The variable fiftyKPlus is the variable to be predicted. So, it is the dependent variable. The remaining 12 variables are used to predict the dependent variable.
| Name of Variable | Variable Type | Values |
|---|---|---|
| age | continuous | 17 - 90 years |
| workclass | categorical | Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov, Without-pay |
| education | categorical | Preschool, 1st-4th, 5th-6th, 7th-8th, 9th, 10th, 11th, 12th, HS-Grad, Assoc-acdm, Assoc-voc, Prof-school, Some-college, Bachelors, Masters, Doctorate |
| maritalstatus | categorical | Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated, Widowed |
| occupation | categorical | Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support, Transport-moving |
| relationship | categorical | Unmarried, Husband, Wife, Not-in-family, Other-relative |
| race | categorical | White, Black, Amer-Indian-Eskimo,,Asian-Pac-Islander, Other |
| sex | categorical | Female, Male |
| capitalgain | continuous | 0 - $100,000 |
| capitalloss | continuous | 0 - $4,356 |
| hoursperweek | continuous | 1 - 99 hours |
| nativecountry | categorical | Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany etc. |
| fiftyKPlus | categorical | Less than or equal to $50,000, Greater than $50,000 |
The dependent variable fiftyKPlus has two classes: <=50K or >50K. Therefore, this being a classification problem, I have developed the following models to predict whether or not a person earns more than 50K annually. I have used R to build the models.
- A CART (Classification and Regression Tree) model
- A CART model with cross-validation
- A Random Forest model
- A Logistic Regression model
The details of the modelling process and the R code to this project is avaiable here.
To decide which model is best suited to predict whether or not a person earns more than 50K a year, the performance of each model has to be evaluated.
Performance of the models:
| Type of Model | Baseline Accuracy | Model Accuracy | AUC |
|---|---|---|---|
| CART | 0.849582 | 0.846746 | |
| CART with cross validation | 0.862403 | 0.871933 | |
| Random Forest | 0.825580 | NA | |
| Logistic Regression | 0.759362 | 0.850989 | 0.905733 |
Since the accuracy of the random forest model is the least of the four models, it may not be best suited for predicting the label in this project.
The logistic regression model does a little better than the CART model. However, the logistic regression model is not easily interpretable. Although the model coefficients may be used to determine the significance of features, the coefficients do not offer simple explanation of how decision is made. For example, look at the following results for the variable education from the summary of the logistic regression model.
occupation Adm-clerical -7.306e-02 1.283e-01 -0.569 0.569102
occupation Armed-Forces -1.393e+01 9.937e+02 -0.014 0.988812
occupation Craft-repair 4.995e-02 1.092e-01 0.457 0.647523
occupation Exec-managerial 7.267e-01 1.133e-01 6.414 1.42e-10 ***
occupation Farming-fishing -1.214e+00 1.851e-01 -6.558 5.46e-11 ***
occupation Handlers-cleaners -7.398e-01 1.879e-01 -3.938 8.22e-05 ***
occupation Machine-op-inspct -3.752e-01 1.393e-01 -2.693 0.007089 **
occupation Other-service -9.234e-01 1.655e-01 -5.579 2.42e-08 ***
occupation Priv-house-serv -1.390e+01 2.166e+02 -0.064 0.948837
occupation Prof-specialty 3.865e-01 1.213e-01 3.186 0.001442 **
occupation Protective-serv 6.512e-01 1.718e-01 3.790 0.000150 ***
occupation Sales 1.892e-01 1.170e-01 1.617 0.105965
occupation Tech-support 5.251e-01 1.532e-01 3.428 0.000608 ***
Observe how some sub-categories of the variable have been marked as significant (with three asterisks) whereas some are marked as not significant at all. A similar trend can be seen with other variables too. This is complicated! The model is difficult to use to quickly make a prediciton for a new case. In summary, the logistic regression model is not easily interpretable.
On the other had, the CART model is more easily interpretable. A CART model is also preferable because it does not assume a linear model like a logistic regression model.
As seen in the plot of the CART model (see code), the features that split the tree are relationship, capitalgain and education. The CART model is more interpretable in the sense that the model tells that these three features are the strong predictors of whether or not a person earns more than 50K annually.
The CART model with cross-validation has the highest accuracy of the models. Although the plot of the CART model with cross-validation presents a tree that is more complex than that of the CART model without cross-validation, a closer look shows that both the trees have been split by the same three features: relationship, capitalgain and education.
We may thus conclude that, out of the four models, the CART model with cross-validation best predicts whether or not a person earns more than $50K a year.
And what factors are the strong predictors? Well, as we just saw, the following three factors most signify whether a person earns more than $50K every year:
- relationship,
- capitalgain and
- education.
Here’s the complete R code to the Census Project .