Unsupervised Learning: Bank Churn Clustering Analysis

3 min readJun 10, 2023

A quick look into bank customers and how they relate.

Introduction:

In this project, I utilized a Bank Churn Dataset sourced from Kaggle. The dataset comprises 10,000 rows and 14 columns, with each column containing information about bank customers, including both current and past account holders.

Project Objective:

The primary objective of this project was to employ different clustering models to effectively cluster the dataset. By using K-means, DBSCAN, and GMM algorithms, I aimed to observe how the models clustered the data and extract valuable insights from the newly formed groups.

Data Visualization & Method:

To begin, I performed exploratory data analysis and visualized the dataset. This involved checking for missing or null values and addressing outliers through winsorization. Additionally, I applied label encoding to the “Gender” and “Geography” features to prepare the data for clustering.

Next, I ran each of the unsupervised clustering models and evaluated their performance. These models included K-means, DBSCAN, and GMM. The evaluation process involved assessing the optimal number of clusters, ARI (Adjusted Rand Index), and Silhouette scores for each model.

Results & Findings:

The K-means model demonstrated promising performance, with three clusters identified as the optimal number. The ARI and Silhouette scores were calculated to evaluate the quality of the clusters.

Similarly, the DBSCAN model and GMM model were assessed for their respective clustering performances. The scores obtained from these models were compared to determine their effectiveness in clustering the dataset.

After a thorough evaluation, the GMM model emerged as the most effective clustering algorithm, achieving the highest score among the evaluated models. K-means closely followed GMM but was slightly less effective. On the other hand, DBSCAN underperformed in comparison to the other algorithms.

In terms of the relationships between the clusters and individual features, the GMM model exhibited a high correlation with the “NumOfProducts” feature (0.38) and the “Exited” feature (0.79). In contrast, the K-means model showed a strong correlation with the “Balance” feature (0.36) and a high negative correlation with the “Exited” feature (-0.81).

For visualization purposes, Principal Component Analysis (PCA) was chosen as the most effective technique. Utilizing PCA, the clusters formed by K-means and GMM were visualized and analyzed.

Conclusion:

In conclusion, the GMM model yielded the best clusters for the Bank Churn Dataset, as indicated by the ARI and Silhouette scores. Visualizations using PCA provided clear representations of the clusters.

Notably, both K-means and GMM clusters were strongly influenced by the “Exited” feature, highlighting its significance in customer churn. To gain further insights, future analysis should focus on investigating the factors contributing to customer exits.

This project has showcased the power of unsupervised learning techniques in clustering datasets and extracting valuable information. The findings and visualizations obtained can contribute to enhancing customer retention strategies and decision-making processes in the banking industry.