R Code for Market Segmentation for Frequent Flyer Program
Problem: Use clustering algorithm to segment the market of an airline's frequent flyer program.
Data source: www.dataminingbook.com
Load the data.
setwd('C:/Users/Sachin/Desktop/MyRData/Airlines')
airlines <- read.csv('FrequentFlyerProgram.csv')
str(airlines)
Have a look at the data and the data types. The dataframe contains 7 variables all of which are integers.
## 'data.frame': 3999 obs. of 7 variables:
## $ Balance : int 28143 19244 41354 14776 97752 16420 84914 20856 443003 ...
## $ QualMiles : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BonusMiles : int 174 215 4123 500 43300 0 27482 5250 1753 28426 ...
## $ BonusTrans : int 1 2 4 1 26 0 25 4 43 28 ...
## $ FlightMiles : int 0 0 0 0 2077 0 0 250 3850 1150 ...
## $ FlightTrans : int 0 0 0 0 4 0 0 1 12 3 ...
## $ DaysSinceEnroll: int 7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...
Look at the summary of the data.
summary(airlines)
## Balance QualMiles BonusMiles BonusTrans
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 18528 1st Qu.: 0.0 1st Qu.: 1250 1st Qu.: 3.0
## Median : 43097 Median : 0.0 Median : 7171 Median :12.0
## Mean : 73601 Mean : 144.1 Mean : 17145 Mean :11.6
## 3rd Qu.: 92404 3rd Qu.: 0.0 3rd Qu.: 23801 3rd Qu.:17.0
## Max. :1704838 Max. :11148.0 Max. :263685 Max. :86.0
## FlightMiles FlightTrans DaysSinceEnroll
## Min. : 0.0 Min. : 0.000 Min. : 2
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:2330
## Median : 0.0 Median : 0.000 Median :4096
## Mean : 460.1 Mean : 1.374 Mean :4119
## 3rd Qu.: 311.0 3rd Qu.: 1.000 3rd Qu.:5790
## Max. :30817.0 Max. :53.000 Max. :8296
Note that the dataframe does not have any missing values. However, observe that some variables such as Balance and BonusMiles have larger values on average whereas FlightsTrans and BonusTrans have smaller values.
Therefore, before building a model, it is important that we scale the dataframe. In this case, we will normalize the data. Normalization prevents clustering from being dominated by variables that are on a larger scale.
To do so, we will require the caret package. If not installed, install the caret package and load it.
Following normalizaion, observe that the Mean of all variables are now zero. Also, observe that the Standard Deviations are all one.
if (!require(caret)){
install.packages(caret)
}
library(caret)
preProc <- preProcess(airlines)
airlines.Normalized <- predict(preProc, airlines)
summary(airlines.Normalized)
## Balance QualMiles BonusMiles BonusTrans
## Min. :-0.7303 Min. :-0.1863 Min. :-0.7099 Min. :-1.20805
## 1st Qu.:-0.5465 1st Qu.:-0.1863 1st Qu.:-0.6581 1st Qu.:-0.89568
## Median :-0.3027 Median :-0.1863 Median :-0.4130 Median : 0.04145
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.1866 3rd Qu.:-0.1863 3rd Qu.: 0.2756 3rd Qu.: 0.56208
## Max. :16.1868 Max. :14.2231 Max. :10.2083 Max. : 7.74673
## FlightMiles FlightTrans DaysSinceEnroll
## Min. :-0.3286 Min. :-0.36212 Min. :-1.99336
## 1st Qu.:-0.3286 1st Qu.:-0.36212 1st Qu.:-0.86607
## Median :-0.3286 Median :-0.36212 Median :-0.01092
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.1065 3rd Qu.:-0.09849 3rd Qu.: 0.80960
## Max. :21.6803 Max. :13.61035 Max. : 2.02284
sapply(airlines.Normalized, sd)
## Balance QualMiles BonusMiles BonusTrans
## 1 1 1 1
## FlightMiles FlightTrans DaysSinceEnroll
## 1 1 1
HIERARCHICAL CLUSTERING
Build a hierarchical clustering model.
distances <- dist(airlines.Normalized, method = "euclidian")
airlinesCluster <- hclust(distances, method = "ward.D")
Plot the dendrogram of the cluster.
plot(airlinesCluster)
setwd('C:/Users/Sachin/Desktop/MyRData/Airlines')
airlines <- read.csv('FrequentFlyerProgram.csv')
str(airlines)## 'data.frame': 3999 obs. of 7 variables:
## $ Balance : int 28143 19244 41354 14776 97752 16420 84914 20856 443003 ...
## $ QualMiles : int 0 0 0 0 0 0 0 0 0 0 ...
## $ BonusMiles : int 174 215 4123 500 43300 0 27482 5250 1753 28426 ...
## $ BonusTrans : int 1 2 4 1 26 0 25 4 43 28 ...
## $ FlightMiles : int 0 0 0 0 2077 0 0 250 3850 1150 ...
## $ FlightTrans : int 0 0 0 0 4 0 0 1 12 3 ...
## $ DaysSinceEnroll: int 7000 6968 7034 6952 6935 6942 6994 6938 6948 6931 ...summary(airlines)## Balance QualMiles BonusMiles BonusTrans
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 18528 1st Qu.: 0.0 1st Qu.: 1250 1st Qu.: 3.0
## Median : 43097 Median : 0.0 Median : 7171 Median :12.0
## Mean : 73601 Mean : 144.1 Mean : 17145 Mean :11.6
## 3rd Qu.: 92404 3rd Qu.: 0.0 3rd Qu.: 23801 3rd Qu.:17.0
## Max. :1704838 Max. :11148.0 Max. :263685 Max. :86.0
## FlightMiles FlightTrans DaysSinceEnroll
## Min. : 0.0 Min. : 0.000 Min. : 2
## 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:2330
## Median : 0.0 Median : 0.000 Median :4096
## Mean : 460.1 Mean : 1.374 Mean :4119
## 3rd Qu.: 311.0 3rd Qu.: 1.000 3rd Qu.:5790
## Max. :30817.0 Max. :53.000 Max. :8296Therefore, before building a model, it is important that we scale the dataframe. In this case, we will normalize the data. Normalization prevents clustering from being dominated by variables that are on a larger scale.
To do so, we will require the caret package. If not installed, install the caret package and load it.
Following normalizaion, observe that the Mean of all variables are now zero. Also, observe that the Standard Deviations are all one.
if (!require(caret)){
install.packages(caret)
}
library(caret)
preProc <- preProcess(airlines)
airlines.Normalized <- predict(preProc, airlines)
summary(airlines.Normalized)## Balance QualMiles BonusMiles BonusTrans
## Min. :-0.7303 Min. :-0.1863 Min. :-0.7099 Min. :-1.20805
## 1st Qu.:-0.5465 1st Qu.:-0.1863 1st Qu.:-0.6581 1st Qu.:-0.89568
## Median :-0.3027 Median :-0.1863 Median :-0.4130 Median : 0.04145
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.1866 3rd Qu.:-0.1863 3rd Qu.: 0.2756 3rd Qu.: 0.56208
## Max. :16.1868 Max. :14.2231 Max. :10.2083 Max. : 7.74673
## FlightMiles FlightTrans DaysSinceEnroll
## Min. :-0.3286 Min. :-0.36212 Min. :-1.99336
## 1st Qu.:-0.3286 1st Qu.:-0.36212 1st Qu.:-0.86607
## Median :-0.3286 Median :-0.36212 Median :-0.01092
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:-0.1065 3rd Qu.:-0.09849 3rd Qu.: 0.80960
## Max. :21.6803 Max. :13.61035 Max. : 2.02284sapply(airlines.Normalized, sd)## Balance QualMiles BonusMiles BonusTrans
## 1 1 1 1
## FlightMiles FlightTrans DaysSinceEnroll
## 1 1 1
distances <- dist(airlines.Normalized, method = "euclidian")
airlinesCluster <- hclust(distances, method = "ward.D")
plot(airlinesCluster)Look at the dendrogram and see how a good number of clusters for this project may be 2, 3, 5 or 7. Deciding on the number of clusters may require consultation with various stakeholders. For the purpose of this project, let us choose the number of clusters to be 5.
clusterGroups <- cutree(airlinesCluster, k = 5)
Look at the number of data points in each of the five clusters.
table(clusterGroups)
## clusterGroups
## 1 2 3 4 5
## 776 519 494 868 1342
Compare the average values of each of the variables in the five clusters. To make it easier to interprete, we will use the unnormalized data airlines to compare the averages.
Divide the data in airlines into the groups defined by clusterGroups by using split.
sapply(split(airlines,clusterGroups), colMeans)
## 1 2 3 4
## Balance 5.786690e+04 1.106693e+05 1.981916e+05 52335.913594
## QualMiles 6.443299e-01 1.065983e+03 3.034615e+01 4.847926
## BonusMiles 1.036012e+04 2.288176e+04 5.579586e+04 20788.766129
## BonusTrans 1.082345e+01 1.822929e+01 1.966397e+01 17.087558
## FlightMiles 8.318428e+01 2.613418e+03 3.276761e+02 111.573733
## FlightTrans 3.028351e-01 7.402697e+00 1.068826e+00 0.344470
## DaysSinceEnroll 6.235365e+03 4.402414e+03 5.615709e+03 2840.822581
## 5
## Balance 3.625591e+04
## QualMiles 2.511177e+00
## BonusMiles 2.264788e+03
## BonusTrans 2.973174e+00
## FlightMiles 1.193219e+02
## FlightTrans 4.388972e-01
## DaysSinceEnroll 3.060081e+03
For readability, convert the numbers to decimal format.
options(scipen = 999)
sapply(split(airlines,clusterGroups), colMeans)
## 1 2 3 4
## Balance 57866.9046392 110669.265896 198191.574899 52335.913594
## QualMiles 0.6443299 1065.982659 30.346154 4.847926
## BonusMiles 10360.1237113 22881.763006 55795.860324 20788.766129
## BonusTrans 10.8234536 18.229287 19.663968 17.087558
## FlightMiles 83.1842784 2613.418112 327.676113 111.573733
## FlightTrans 0.3028351 7.402697 1.068826 0.344470
## DaysSinceEnroll 6235.3646907 4402.414258 5615.708502 2840.822581
## 5
## Balance 36255.9098361
## QualMiles 2.5111773
## BonusMiles 2264.7876304
## BonusTrans 2.9731744
## FlightMiles 119.3219076
## FlightTrans 0.4388972
## DaysSinceEnroll 3060.0812221
Find the clusters with the maximum and minimum values for all variables.
clusterTable <- sapply(split(airlines, clusterGroups), colMeans)
findClusterGroupMax <-function(x){
colnames(clusterTable)[which.max(clusterTable[x, ])]
}
sapply(names(airlines), findClusterGroupMax)
The following variables have their maximum values in the clusters shown below the name of the variable.
## Balance QualMiles BonusMiles BonusTrans
## "3" "2" "3" "3"
## FlightMiles FlightTrans DaysSinceEnroll
## "2" "2" "1"
findClusterGroupMin <-function(x){
colnames(clusterTable)[which.min(clusterTable[x, ])]
}
sapply(names(airlines), findClusterGroupMin)
The following variables have their minimum values in the clusters shown below the name of the variable.
## Balance QualMiles BonusMiles BonusTrans
## "5" "1" "5" "5"
## FlightMiles FlightTrans DaysSinceEnroll
## "1" "1" "4"