R Code for Market Segmentation for Airlines Project

Problem: Use clustering algorithm to segment the market of an airline's frequent flyer program.

Data source: www.dataminingbook.com


Load the data.

setwd('C:/Users/Sachin/Desktop/MyRData/Airlines')

airlines <- read.csv('FrequentFlyerProgram.csv')
str(airlines)

Have a look at the data and the data types. The dataframe contains 7 variables all of which are integers.

## 'data.frame':    3999 obs. of  7 variables:
##  $ Balance        : int  28143 19244 41354 14776 97752 16420 84914 20856 443003 ...
##  $ QualMiles      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BonusMiles     : int  174 215 4123 500 43300 0 27482 5250 1753 28426 ...
##  $ BonusTrans     : int  1 2 4 1 26 0 25 4 43 28 ...
##  $ FlightMiles    : int  0 0 0 0 2077 0 0 250 3850 1150 ...
##  $ FlightTrans    : int  0 0 0 0 4 0 0 1 12 3 ...
##  $ DaysSinceEnroll: int  7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...

Look at the summary of the data.

summary(airlines)
##     Balance          QualMiles         BonusMiles       BonusTrans  
##  Min.   :      0   Min.   :    0.0   Min.   :     0   Min.   : 0.0  
##  1st Qu.:  18528   1st Qu.:    0.0   1st Qu.:  1250   1st Qu.: 3.0  
##  Median :  43097   Median :    0.0   Median :  7171   Median :12.0  
##  Mean   :  73601   Mean   :  144.1   Mean   : 17145   Mean   :11.6  
##  3rd Qu.:  92404   3rd Qu.:    0.0   3rd Qu.: 23801   3rd Qu.:17.0  
##  Max.   :1704838   Max.   :11148.0   Max.   :263685   Max.   :86.0  

##   FlightMiles       FlightTrans     DaysSinceEnroll
##  Min.   :    0.0   Min.   : 0.000   Min.   :   2   
##  1st Qu.:    0.0   1st Qu.: 0.000   1st Qu.:2330   
##  Median :    0.0   Median : 0.000   Median :4096   
##  Mean   :  460.1   Mean   : 1.374   Mean   :4119   
##  3rd Qu.:  311.0   3rd Qu.: 1.000   3rd Qu.:5790   
##  Max.   :30817.0   Max.   :53.000   Max.   :8296

Note that the dataframe does not have any missing values. However, observe that some variables such as Balance and BonusMiles have larger values on average whereas FlightsTrans and BonusTrans have smaller values.

Therefore, before building a model, it is important that we scale the dataframe. In this case, we will normalize the data. Normalization prevents clustering from being dominated by variables that are on a larger scale.

To do so, we will require the caret package. If not installed, install the caret package and load it.

Following normalizaion, observe that the Mean of all variables are now zero. Also, observe that the Standard Deviations are all one.


if (!require(caret)){
  install.packages(caret)
}
library(caret)

preProc <- preProcess(airlines)
airlines.Normalized <- predict(preProc, airlines)

summary(airlines.Normalized)
##     Balance          QualMiles         BonusMiles        BonusTrans      
##  Min.   :-0.7303   Min.   :-0.1863   Min.   :-0.7099   Min.   :-1.20805  
##  1st Qu.:-0.5465   1st Qu.:-0.1863   1st Qu.:-0.6581   1st Qu.:-0.89568  
##  Median :-0.3027   Median :-0.1863   Median :-0.4130   Median : 0.04145  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.1866   3rd Qu.:-0.1863   3rd Qu.: 0.2756   3rd Qu.: 0.56208  
##  Max.   :16.1868   Max.   :14.2231   Max.   :10.2083   Max.   : 7.74673  

##   FlightMiles       FlightTrans       DaysSinceEnroll   
##  Min.   :-0.3286   Min.   :-0.36212   Min.   :-1.99336  
##  1st Qu.:-0.3286   1st Qu.:-0.36212   1st Qu.:-0.86607  
##  Median :-0.3286   Median :-0.36212   Median :-0.01092  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.:-0.1065   3rd Qu.:-0.09849   3rd Qu.: 0.80960  
##  Max.   :21.6803   Max.   :13.61035   Max.   : 2.02284
sapply(airlines.Normalized, sd)
##         Balance       QualMiles      BonusMiles      BonusTrans 
##               1               1               1               1 

##     FlightMiles     FlightTrans DaysSinceEnroll 
##               1               1               1


HIERARCHICAL CLUSTERING

Build a hierarchical clustering model.


distances <- dist(airlines.Normalized, method = "euclidian")
airlinesCluster <- hclust(distances, method = "ward.D")

Plot the dendrogram of the cluster.


plot(airlinesCluster)


Look at the dendrogram and see how a good number of clusters for this project may be 2, 3, 5 or 7. Deciding on the number of clusters may require consultation with various stakeholders. For the purpose of this project, let us choose the number of clusters to be 5.


clusterGroups <- cutree(airlinesCluster, k = 5)

Look at the number of data points in each of the five clusters.


table(clusterGroups)
## clusterGroups
##    1    2    3    4    5 
##  776  519  494  868 1342

Compare the average values of each of the variables in the five clusters. To make it easier to interprete, we will use the unnormalized data airlines to compare the averages.

Divide the data in airlines into the groups defined by clusterGroups by using split.


sapply(split(airlines,clusterGroups), colMeans)
##                            1            2            3            4
## Balance         5.786690e+04 1.106693e+05 1.981916e+05 52335.913594
## QualMiles       6.443299e-01 1.065983e+03 3.034615e+01     4.847926
## BonusMiles      1.036012e+04 2.288176e+04 5.579586e+04 20788.766129
## BonusTrans      1.082345e+01 1.822929e+01 1.966397e+01    17.087558
## FlightMiles     8.318428e+01 2.613418e+03 3.276761e+02   111.573733
## FlightTrans     3.028351e-01 7.402697e+00 1.068826e+00     0.344470
## DaysSinceEnroll 6.235365e+03 4.402414e+03 5.615709e+03  2840.822581

##                            5
## Balance         3.625591e+04
## QualMiles       2.511177e+00
## BonusMiles      2.264788e+03
## BonusTrans      2.973174e+00
## FlightMiles     1.193219e+02
## FlightTrans     4.388972e-01
## DaysSinceEnroll 3.060081e+03

For readability, convert the numbers to decimal format.

options(scipen = 999)
sapply(split(airlines,clusterGroups), colMeans)
##                             1             2             3            4
## Balance         57866.9046392 110669.265896 198191.574899 52335.913594
## QualMiles           0.6443299   1065.982659     30.346154     4.847926
## BonusMiles      10360.1237113  22881.763006  55795.860324 20788.766129
## BonusTrans         10.8234536     18.229287     19.663968    17.087558
## FlightMiles        83.1842784   2613.418112    327.676113   111.573733
## FlightTrans         0.3028351      7.402697      1.068826     0.344470
## DaysSinceEnroll  6235.3646907   4402.414258   5615.708502  2840.822581

##                             5
## Balance         36255.9098361
## QualMiles           2.5111773
## BonusMiles       2264.7876304
## BonusTrans          2.9731744
## FlightMiles       119.3219076
## FlightTrans         0.4388972
## DaysSinceEnroll  3060.0812221

Find the clusters with the maximum and minimum values for all variables.

clusterTable <- sapply(split(airlines, clusterGroups), colMeans)

findClusterGroupMax <-function(x){
  colnames(clusterTable)[which.max(clusterTable[x, ])]
}

sapply(names(airlines), findClusterGroupMax)

The following variables have their maximum values in the clusters shown below the name of the variable.

##         Balance       QualMiles      BonusMiles      BonusTrans 
##             "3"             "2"             "3"             "3" 

##     FlightMiles     FlightTrans DaysSinceEnroll 
##             "2"             "2"             "1"
findClusterGroupMin <-function(x){
  colnames(clusterTable)[which.min(clusterTable[x, ])]
}

sapply(names(airlines), findClusterGroupMin)

The following variables have their minimum values in the clusters shown below the name of the variable.

##         Balance       QualMiles      BonusMiles      BonusTrans 
##             "5"             "1"             "5"             "5" 

##     FlightMiles     FlightTrans DaysSinceEnroll 
##             "1"             "1"             "4"


Analysis of the Clusters

Cluster 1 has the largest value of the variable DaysSinceEnroll whereas it has the lowest values of the variables QualMiles, FlightsMiles and FlightTrans. Customers in Cluster 1 may, therefore, be described as infrequent but loyal customers.

Cluster 2 has the largest average values in QualMiles, FlightMiles and FlightTrans. This cluster also has relatively large values in BonusTrans and Balance. Customers in Cluster 2 may be described as customers who have accumulated a large amount of miles, and the ones with the largest number of flight transactions.

Cluster 3 has the largest values in Balance, BonusMiles, and BonusTrans. While it also has relatively large values in other variables, these are the three for which it has the largest values. Customers in Cluster 3 may be described as customers who have accumulated a large amount of miles, mostly through non-flight transactions.

Cluster 4 does not have the largest value in any of the variables. However, a close observation reveals that Cluster 4 describes relatively new customers who seem to be accumulating miles, mostly through non-flight transactions.

Cluster 5 also does not have the largest value in any of the variables. Again, a closer look shows that Cluster 5 describes relatively new customers who don't use the airline very often.



Conclusion

The clustering algorithm used in this project divides the airlines data into five distinct clusters. This implies that the airlines market may be segmented into the following five groups:
  1. Infrequent but loyal customers
  2. Customers with large amount of miles mostly from flight transactions
  3. Customers with large amount of miles mostly from non-flight transactions
  4. New customers accumulating miles from non-flight transactions
  5. New and infrequent customers