Airfare Prediction Model
Essay by Amrita Singh • July 20, 2016 • Coursework • 1,407 Words (6 Pages) • 2,240 Views
- Marketing to Frequent Fliers.
The file EastWestAirlinesCluster.xls (available on the textbook website http://dataminingbook.com/) contains information on 4000 passengers who belong to an airline’s frequent flier program. For each passenger the data include information on their mileage history and on different ways they accrued or spent miles in the last year. The goal is to try to identify clusters of passengers that have similar characteristics for the purpose of targeting different segments for different types of mileage offers.
a) Apply hierarchical clustering with Euclidean distance and Ward’s method. Make sure to
standardize the data first. How many clusters appear?
b) What would happen if the data were not standardized?
c) Compare the cluster centroids to characterize the different clusters and try to give each cluster a
label.
d) To check the stability of the clusters, remove a random 5% of the data (by taking a random
sample of 95% of the records), and repeat the analysis. Does the same picture emerge?
e) Use k-means clustering with the number of clusters that you found above in Part (a). Does the
same picture emerge ? If not, how does it contrast or validate the finding in Part c above?
f) Which cluster(s) would you target for offers, and what type of offers would you target to
customers in that cluster? Include proper reasoning in support of your choice of cluster(s) and the
corresponding offer(s).
2. Wine Data:
Step 1: Download the Wine data from the UCI machine learning repository
(http://archive.ics.uci.edu/ml/datasets/Wine)
Step 2: Do a Principal Components Analysis (PCA) on the data. Please include (copy-paste) the
relevant software outputs in your submission while answering the following questions.
a. Enumerate the insights you gathered during your PCA exercise. (Please do not clutter your
report with too MANY insignificant insights as it will dilute the value of your other significant
findings)
b. What are the social and business values of those insights, and how the value of those insights
can be harnessed?
Step 3: Do a cluster analysis using (i) all chemical measurements (ii) using two most significant PC
scores. Please include (copy-paste) the relevant software outputs in your submission while answering
the following questions.
c. Any more insights you come across during the clustering exercise?
d. Are there clearly separable clusters of wines? How many clusters did you go with? How the
clusters obtained in part (i) are different from or similar to clusters obtained in part (ii),
qualitatively?
e. Could you suggest a subset of the chemical measurements that can separate wines more
distinctly? How did you go about choosing that subset? How do the rest of the measurements
that were not included while clustering, vary across those clusters?
Question 1.
- Apply hierarchical clustering with Euclidean distance and Ward’s method. Make sure to standardize the data first. How many clusters appear?
Solution.
Number of clusters: 3
Cluster I | 13 | 16 | 2 | 17 | 10 | 14 | 15 | 18 | 5 | 20 | 19 |
Cluster II | 3 | 12 | 21 | 1 | 8 | 9 | 4 | 16 | 22 | ||
Cluster III | 1 | 23 | 6 | 11 | 24 | 25 | 30 | 27 | 29 | 28 |
[pic 1]
Dendrogram remains constant when cluster was 3 and when no restriction was given to it.
- What would happen if the data were not standardized?
Solution: Balance, Bonus miles and Days since enrolled will take higher weights hence the result will be skewed towards those variables.
Ex- Predicted Clusters will be as follows:
[pic 2]
Balance, Bonus miles and Days since enrolled will take higher weights hence the result will be skewed towards those variables.
C) Compare the cluster centroids to characterize the different clusters and try to give each cluster a label.
Solution.
Cluster 1 | Less frequent Fliers |
Cluster 2 | Frequent fliers |
Clusters 3 | Intermittent Fliers: Between cluster 1 and 3 hence Customers group for promotions |
[pic 3]
- To check the stability of the clusters, remove a random 5% of the data (by taking a random sample of 95% of the records), and repeat the analysis. Does the same picture emerge?
Solution.
Part A
Structure of the Dendrogram remained same but there was a change noticed in the formation of clusters.
[pic 4]
Part b
Total | 3999 |
less 5% | 3790 |
Removed | 209 |
Count Where Cluster1 = cluster 2 | Count Where Sub cluster 1= Sub cluster 2 |
1368 | 274 |
3999 | 0 |
34% | ID which remained constant after removing 5 % of data |
...
...