DATA SCIENTIST | STATISTICIAN
Airline Passenger Satisfaction
The main objective of this study is to determine the main factors driving an airline company’s passenger satisfaction levels. Such information can be useful in getting a clearer picture of the company’s strengths and shortcomings in terms of the services provided. By understanding these drivers, the company can make the right choices regarding the services it could improve and invest in. In addition, the data is used to train a machine-learning model to predict new customer satisfaction levels. The maintenance of such a model/pipeline would allow the airline to quickly obtain passenger satisfaction level predictions when new data becomes available.
The data, which consists of 23 features, comes from a large survey conducted by an undisclosed US airline. The response, `Satisfaction`, is a binary variable with two possible outcomes: ‘satisfied’ or ‘neutral or dissatisfied’. As such, predicting passengers' satisfaction levels is a supervised learning task and, more specifically, a binary classification. The data has been downloaded from the Kaggle website and has already been preprocessed for classification; it can be found through the following link.
Exploratory Data Analysis
Exploratory data analysis (EDA) is performed to further the understanding of the dataset. First, a pair plot, a grid of scatterplots depicting the relationship between two numerical variables, is presented. The plots on the main diagonal characterise each feature's distribution. Plot colours are set to two groups corresponding to the two categories of the response. Plotting the data this way will provide additional insight into the features' relationship to the response.
The first subplot (on the main diagonal), representing the `Age` variable's distribution, shows that many satisfied customers are middle-aged. The distribution of this plot resembles a bell-shaped curve. The second subplot on the main diagonal, `Flight Distance`, clearly depicts more passengers going on shorter flights. This distribution is right-skewed. There seem to be slightly more satisfied customers on longer flights. The other two subplots on the main diagonal show that delays (whether on departure or arrival) tend to be short; the distribution is extremely right-skewed. The off-diagonal plots clearly show that departure and arrival delays are highly correlated. The two groups of customers, 'satisfied' and 'neutral or dissatisfied', are more or less equally spread among the age, distance and delay variables.
Correlation Heatmap
The next analysis will involve correlation heatmaps. These plots help visualise the strength and direction of association between two variables. The darker the colour between two variables, the stronger the relationship. A correlation coefficient close to '1' means perfect positive correlation, '-1' means perfect negative correlation and '0' means no correlation. To plot the response variable `Satisfaction` alongside other numerical variables, its categories have been converted to `2` and `1`, corresponding to the ‘satisfied’ and ‘neutral or dissatisfied’ classes, respectively. Here, Kendall Tau’s correlation was used to calculate the associations since most of these variables are categorical.
The heatmap distinguishes three groups of positively correlated features. For example, `Food and Drink`, `Seat Comfort` and `Inflight Entertainment` are all positively correlated. Two features strongly correlating with the response are `Class` and `Online Boarding`. Other variables positively correlated with `Satisfaction` include `Inflight Wifi Service`, `Seat Comfort`, `On-board Service`, and `Leg Room Service`. These are all promising variables in predicting the response; these relationships will be examined further. When dealing with correlation, it is always important to remember that correlation does not imply causation. That is, even if a feature is correlated with the response, it does not mean that this feature causes the response outcome. In the case of positive correlation, it simply means that when a feature increases, the response also increases.
Countplots
Countplots are presented to gain further insights into the relationship between the ordinal variables and the response `Satisfaction`. Only the six most strongly correlated variables are included in the analysis. Count plots are simple visuals depicting the count of observations in each category of the variable of interest. Here, the plots have been separated into 'satisfied' and 'neutral or dissatisfied' customers.
In the `Class` subplot, the categories are `1`, `2`, and `3`, corresponding to the Economy, Economy Plus and Business classes. It can be seen that there are many dissatisfied passengers in Economy class (`1`). Conversely, there are many satisfied people in Business. By inspecting the other six plots, it can be seen that they all share a similar pattern. The dissatisfied passengers are distributed bell-shaped across the five categories, while most satisfied customers appear in the 4th and 5th categories. These most likely correspond to business-class services.
The EDA's analysis has provided substantial insight into the features most closely associated with the response variable. It remains to be seen whether implementing machine learning algorithms will confirm these preliminary findings.
Data Preparation
Although the dataset's creators had already done substantial data preprocessing for classification, this project implemented further preprocessing to achieve the best possible validation accuracy scores. Preprocessing included converting features to `categorical` or `ordered categorical`, one-hot encoding, and imputation. Numerous preprocessing combinations were tested. The final preprocessed data set included ‘Non-Applicable’ entries as a separate category for the categorical variables.
Machine Learning Models
First, six machine-learning models are implemented and compared. The best-performing model is identified and set as a base model for comparison with more complex neural networks. Since the main task is classification, the metrics of interest are accuracy, precision and recall. Precision can be viewed as the accuracy of the positive predictions; it can be calculated as follows:
where TP is the number of true positives, and FP is the number of false positives. This metric is usually reported alongside recall, also called sensitivity or the true positive rate (TPR). Recall is the ratio of positive instances that are correctly classified. Recall can be calculated as follows:
where TP is the number of true positives, and FN is the number of false negatives (Aurelian Géron, 2023).
The results for the six models are presented below:
The results clearly show that the Random Forest algorithm outperforms the others. It achieves an accuracy score of 0.965179, a precision of 0.958047 and a recall of 0.981519. This model is used to create the Feature Importance plot in the next section.
Feature Importance Plot
Machine learning models are excellent at making predictions using new, unseen data. However, understanding what features have the strongest impact in determining customer satisfaction levels is vital information for the airline to make sound data-driven decisions. This is where Feature Importance plots come in:
The three main features driving customer satisfaction are `Online Boarding`, `Inflight Wifi Service`, and `Class`. Offering online boarding and inflight wifi service strongly impacts customer satisfaction. This plot confirms the findings in the EDA that the passenger's class strongly impacts their satisfaction levels: the `Class` feature is followed by `Dummy_Business travel` and `Dummy_Personal travel`.
Deep Neural Network
To obtain even better results than those with standard machine learning methods, a deep neural network has been implemented using Pytorch. Extensive hyper-parameter tuning was performed, both manually and by defining a grid search. The best model was set up as folows:
One hundred epochs were used to train this model. The best accuracy score on the validation set was 0.96704. This model was saved to perform the final predictions. The training and validation accuracy and loss were plotted to ensure high predictive accuracy on new data. These plots can be seen below. Both the loss and accuracy scores seem to converge during training.
Final Predictions
The test data set was preprocessed using the same methods as the training data. The final deep neural network was used to make predictions on the test set. The overall accuracy achieved was 0.9667, very close to those on the validation set. This confirms that the model is stable and generalises well to new data. It also achieved a precision score of 0.97675 and a recall of 0.94668. A confusion matrix is presented below; predictions on the main diagonal are classified correctly, while those off the main diagonal are erroneous.
Another good way to present the overall performance of a classifier is by examining the area under the ROC (Receiver Operating Characteristics) curve, known as the AUC (Area Under the Curve). Ideally, the ROC curve should be very close to the upper left corner of the plot (Gareth James, 2015). The ROC curve below shows that the final classifier performs excellently. The AUC score is very high, 0.96452.
Results and Conclusion
The EDA section of this project showed that premium customers are generally satisfied with the airline company, while economy customers are divided. To ensure customer satisfaction, the company should improve economy-class experiences and services. The top three features driving customer satisfaction are `Online Boarding`, `Inflight Wi-Fi Service`, and `Class` (as seen in the feature importance plot).
The original data set has been processed in several ways and compared to find the settings leading to the best performance. The best-performing preprocessed data encodes ‘Non-Applicable’ services as a separate category. Numerous machine learning models were then implemented, including LDA, KNN, Logistic Regression and Random Forest; however, a Neural Network implemented using PyTorch and manually hyper-tuned achieved the best results.
The final model was trained with one hundred epochs, and its loss and accuracy curves depicted good model performance and stability. The validation accuracy of this model was 0.96704. The final model's predictions on the test data set achieved an accuracy score of 0.9667, a precision of 0.97675, and a recall of 0.94668, which is very close to those on the validation set. The consistent results confirm that the model is stable and performs well on new data. The AUC score under the ROC curve was 0.96452, a very high result.
The extensive preprocessing and hyperparameter tuning did not substantially improve the results compared to the first random forest base model, which had an accuracy of 0.964429, a precision of 0.957307, and a recall of 0.980958. Further efforts could continue indefinitely; for example, removing outliers or other types of feature engineering could be tested. Given the base model's already high accuracy and precision, one has to wonder whether the time-consuming effort would have been financially interesting to the airline.
References
-
Learn Statistics Easily, Kendall Tau-b vs Spearman: Which Correlation Coefficient Wins?, Learn Statistics Easily, 4 Jan 2024. [Link].
-
Minitab Support, What are Concordant and Discordant Pairs?, Minitab Support, 2024 [Link].
-
Shaun Turney, Chi-Square Test of Independence | Formula, Guide and Examples, *Scribbr*, June 22, 2023 [Link].
-
Aurelian Géron, Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (3rd Edition), O'Reilly Media Inc, 20 January 2023.
-
Haldun Akoglu, User's Guide to Correlation Coefficients, *National Library of Medicine*, 7 August 2018 [Link].
-
Scikit-learn developers, IterativeImputer, *Scikit-Learn*, 2024 [Link].
-
Kyaw Saw Htoon, A Guide to KNN Imputation, *Medium*, 3 July 2020 [Link].
-
Hannah Igboke, Iterative Imputer for Missing values in Machine Learning, *Medium*, 10 June 2024 [Link].
-
James Gareth, Daniela Witten, Trevor Hastie and Robert Tibshirani, An Introduction to Statistical Learning, Springer, New York, 2015.
-
PyTorch Contributors, BCEWithLogitsLoss, Pytorch, 2023 [Link].
-
Pytorch Contributors, SGD, Pytorch, 2023, [Link].