We are given a demographics data set (Udacity_CUSTOMERS_052018.csv) containing approximately 200k customers and 369 features for a mail-order company in Germany. In addition, a demographics data set covering the German population (Udacity_AZDIAS_052018.csv) is given with about 900k persons and 366 features. I will refer to the general population data set as general and the customer data set as customers. There are two goals for this analysis.
- First, we want to use unsupervised learning techniques (PCA and KMeans) to understand the customer base of a mail-order company and match the customer profiles to the general population. We are answering the question: Who are the target customers in the general population?
- Second, we want to use the learnings from this analysis and create a supervised predictive model to determine if a customer would be a good fit for the mail-order company — identify customers who will most likely convert to paying customers.
- The primary goal of the exploratory analysis stage is assessing missing values in both data sets and understanding the underlying feature types (e.g., categorical, continuous, feature distribution, etc.).
- We dropped features with over 40% of missing values and ambiguous features.
- We also dropped rows where missing values were greater than 50%. Each row is represented as a user.
- In addition, during the exploratory analysis stage, we analyzed feature dtypes, prefix groupings, and skewness.
- We applied the following imputation strategies to address missing values in the data sets.
- Imputation strategy 1: All categorical and binary features were imputed by their most-frequent values.
- Imputation strategy 2: The continuous features were imputed with their mean values.
- In this analysis stage, we combine Principal Component Analysis (PCA) and KMeans to conduct a powerful set of transformations to deduce important features and create relevant clusters.
- PCA Step 1: First, both data sets were standardized. Using the general data set, we fitted and transformed the data using PCA.
- PCA Step 2: We applied dimensionality reduction to the feature set, and the goal was to retain the most important principal components. More specifically, we used the general data set to fit the PCA model, and then we conducted dimensionality reduction with the fitted PCA model on the customers data set.
- KMeans Step 1: We determined the number of clusters using the elbow plot. We iterated through the general data set a few dozen times by taking a 10% random sample each time and fitting PCA to calculate inertia. Scikit-Learn’s documentation states, “The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion.” The x-axis is the number of clusters, and the y-axis is a range of inertia scores. Essentially, we are looking for the elbow shape to determine the number of clusters.
- KMeans Step 2: Once the number of clusters is determined, the KMeans model is fitted with the general PCA-transformed data set. Using the fitted model, the predict() method is applied to both general and customers data sets to generate cluster labels for each user. Next, we compared the two prediction outputs and examined which clusters were overrepresented in the customers data set using the general data set as the baseline.
- KMeans Step 3: We selected the most overrepresented cluster and examined the actual categorical labels. We used pca.inverse_transform() method to map back to the original features and then examined which labels were prominently attached to these users. This allowed us to understand the dominant categorical labels associated with these users.
- Step 1: First, we closely examined the train data with the response variable and addressed missing values and ambiguous features. We applied the same imputation techniques used earlier to preprocess the train data set.
- Step 2: The oversampling/undersampling approach comes from an ML blog post, which can be found here. In this approach, we oversampled the minority class to create a 10:1 ratio and then undersampled the majority class to create a 2:1 ratio by removing rows from the majority class.
- Step 3: The transformed train data set was plugged into several ML models, and the best-performing model was chosen using ROC AUC. In the end, Random Forest won out. We also leveraged Random Forest’s impurity-based feature importance method to identify important features.
- Step 1: Randomized Search CV tuned the model by optimizing Random Forest’s hyperparameters. Exhaustive GridSearchCV was also used but did not show incremental improvement in ROC AUC.
- Step 2: In the end, a fitted Random Forest model optimized with RandomSearchCV was used to make the predictions on the test data set.
To measure how models perform, we need metrics to help us gauge if what we observe is moving in the right direction. If not, we must implement techniques and strategies to mitigate against low-performing models.
For the unsupervised learning section, we leveraged PCA to mitigate against high dimensionality in the data set so that Euclidean distance-related calculations are not inflated when the input data set is plugged into KMeans. Therefore, the most helpful metric for PCA would be explained variance, which helps to determine the cutoff point and the number of principal components. In addition, for KMeans, the important metric or threshold would be the number of clusters.
For supervised learning, the metric that stood out for this particular case was ROC AUC, which is great for imbalanced data sets. Therefore, ROC AUC was used as the success metric in the supervised learning section. We also applied oversampling and undersampling techniques to generate a relatively more balanced data set.
All features in both data sets have missing values. Some have a lot more than others. Six features are missing 65% or more observations for the general data set. The chart shown in Figure 1 illustrates % of observations missing relative to each other. Although fonts are too small to see the feature names, the importance of the visualization is to show that all features are missing some values.
We are seeing similar patterns with the customers data set; some are missing a lot more than others. It will be important going forward to assess how these features will impact the analysis. I will soon dive into strategies and techniques to handle these features.
Drop features with over 40% of missing values and ambiguous features
At this stage, the goal is to go after the low-hanging fruits. It made sense to drop features with 40% or more missing values in both data sets. In addition, several very ambiguous and highly skewed features will most likely not add much value to our analysis.
The code snippet below shows the six features missing more than 40% across both data sets. The next line captures the ambiguous and highly skewed features. What does it mean that a feature is ambiguous? This means the feature had no description in the additional supplementary Excel files and no real way to decipher these. At the same time, the features are skewed, meaning most labels are a specific value or label.
missing_features =['ALTER_KIND1','ALTER_KIND2','ALTER_KIND3','ALTER_KIND4','EXTSEL992','KK_KUNDENTYP']ambiguous_skewed_features = ['STRUKTURTYP','GEMEINDETYP','ARBEIT','RELAT_AB','ANZ_HH_TITEL','KONSUMZELLE','FIRMENDICHTE','AGER_TYP','TITEL_KZ','GEBURTSJAHR', 'EINGEFUEGT_AM','VERDICHTUNGSRAUM','D19_LETZTER_KAUF_BRANCHE']
Drop user rows where missing values are greater than 50%
This step dropped user rows missing 50% or more of the feature values. For each person, if features were missing or flagged as NaN, these were dropped to reduce the number of individuals with incomplete or highly sparse profiles.
Before qualified rows were dropped, the general data set rows and columns had these dimensions (891221, 349). Once this logic was applied, it reduced the number of rows by about 100k (791987, 349).
Before dropping user rows: (891221, 349)
After dropping user rows: (791987, 349)
The same logic was applied to the customers data set and reduced the number of rows by about 50k. The below code output shows where we stand at this point.
Before dropping user rows: (191652, 352)
After dropping user rows: (140899, 352)
Analyze the dtypes, prefix groupings, and skewness
At this stage, I examined the dtypes and grouped features by prefix in preparation for imputing the features with missing values. Most features contain floats and integers except four from general and six from customers. This is good to know, and we’ll address these objects later in the feature transformation step.
Name: dtype, dtype: int64
Name: dtype, dtype: int64
At this point, I want to examine the general data set more closely. I wanted to see what features were still missing values and extract the prefix to create an imputation strategy to group these features. Although the features are mostly numeric values, almost all are ordinal or nominal categorical features. The breakdown by prefix analysis shows that 235 features can be grouped into 26 prefixes in the general data set.
Number of prefixes: 26
Number of features: 235
Count of Prefixes:
[('KBA13', 116), ('KBA05', 63), ('CJT', 8), ('D19', 8), ('PLZ8', 7), ('LP', 6), ('CAMEO', 3), ('RT', 3), ('VK', 3), ('UMFELD', 2), ('ALTERSKATEGORIE', 1), ('BALLRAUM', 1), ('EWDICHTE', 1), ('GEBAEUDETYP', 1), ('GFK', 1), ('HH', 1), ('INNENSTADT', 1), ('KKK', 1), ('KONSUMNAEHE', 1), ('MOBI', 1), ('ONLINE', 1), ('ORTSGR', 1), ('REGIOTYP', 1), ('RETOURTYP', 1), ('VHN', 1), ('W', 1)]
To get a sense of skewness by prefix, I created a function that generates histograms for each prefix and Pandas’ skew() function was used to calculate the skew value for each feature.
- The data are considered fairly symmetrical if the skewness is between -0.5 and 0.5.
- The data are moderately skewed if the skewness is between -1 and 0.5 or between 0.5 and 1.
- The data are highly skewed if the skewness is less than -1 or greater than 1.
Here is an example of the output for the “UMFELD” prefix, which is attached to two features. The output shows two histograms with corresponding skew values.
At this stage of the preprocessing step, we are ready to impute the data after removing sparse features and user rows. A few features in this group had no descriptions in the provided Excel files. In these cases, I examined their histograms and identified the features as ordinal or nominal. In Figure 4, all the features are categorical except one. And most categorical features are ordinal. There are a few nominal features, one continuous, and one binary feature.
After analyzing these features, the imputation procedure involves two main approaches:
- Strategy 1: All categorical and binary features are imputed by their most-frequent values.
- Strategy 2: The continuous features will be imputed with their mean values.
I leveraged sklearn to create an imputation pipeline for features missing values. Here is the code that creates the imputation preprocessors. Once the features were imputed, they were merged with the rest of the complete features to create the final data sets.
Unsupervised Learning Models: PCA + KMeans
A large portion of the customer segmentation analysis will perform principal component analysis on the two data sets to linearly reduce the number of dimensions (fewer features) using singular value decomposition and applying the clustering technique, KMeans, to better understand the underlying user base.
Let’s quickly articulate the goal for using PCA and applying KMeans. We are essentially using unsupervised learning techniques to describe the relationship between the demographics of the mail-order company’s customers and the general population of Germany. In the end, we will identify a group of folks in the general population that are a good representation of the mail-order company’s customer base.
Here’s a quick recap of where we stand. The general and customers data sets have the following rows and columns:
- General: 791,987 user rows and 346 features
- Customers: 140,899 user rows and 346 features
First, both data sets need to be scaled so that they can be fitted and transformed using PCA. I’m using sklearn’s StandardScaler() class object to scale the data. We instantiate the PCA object and fit the standardized general data set. In the code below, the instantiated PCA object only retains 95% of the variance. This is an arbitrary threshold that I have decided to use for this analysis. At the same time, we pass fit and transform to the general data set. The direct result is dimensionality reduction, where we had 346 features before, and after the transform step we are left with 212.
Let’s take a closer look at what is going on by visualizing the cumulative explained variance as we increase the number of principal components and the exponentially decreasing curve to observe how much each component contributes to the overall explained variance.
We often see the first few principal components capturing most of the explained variance. In this analysis, we have to go up to the 212th principal component to capture 95%. To further dissect each principal component, I’ll leverage sklearn’s pca.components_ attribute to understand what features are attributing the most to principal component 1, for example.
It will help to review what pca.components_ attribute is comprised of and what the numbers represent. The pca.components_ attribute contains the loading scores, and under singular value decomposition (SVD), the principal components are scaled from -1 to 1. Regardless of the sign associated with the numbers, anything close to -1 or 1 indicates a strong influence on the components. The scaling of these numbers basically gives us singular vectors or eigenvectors for the principal components. Therefore, the loading scores are the coefficients paired with each feature in the first principal component and so on.
I’m only going to take a closer look at the first principal component, which, in my analysis, contributes about 8.34% to the total explained variance. I visualized the top 30 positive loading scores and the top 30 negative loading scores.
We are ready to apply KMeans to our general pca (azdias_pca) data set. First, we generate the kmeans elbow plot to eyeball the number of clusters we want in our final analysis. To generate this plot, we take about 10% sample data from azdias_pca data set and instantiate KMeans algo object with an increasing number of n_clusters. For every loop, we calculate the inertia_, which is the sum of squared distances of samples to their closest cluster center. We are starting at n_clusters = 1 and going up to n_clusters = 40. Here is the result.
Scikit-Learn Clustering section has a great description of why we apply PCA before applying KMeans: (source)
Inertia can be recognized as a measure of how internally coherent clusters are. It suffers from various drawbacks:
* Inertia makes the assumption that clusters are convex and isotropic, which is not always the case. It responds poorly to elongated clusters, or manifolds with irregular shapes.
* Inertia is not a normalized metric: we just know that lower values are better and zero is optimal. But in very high-dimensional spaces, Euclidean distances tend to become inflated (this is an instance of the so-called “curse of dimensionality”). Running a dimensionality reduction algorithm such as Principal component analysis (PCA) prior to k-means clustering can alleviate this problem and speed up the computations.
We settled at n_clusters = 15 after closely examining the elbow plot. Now we are fitting the KMeans algo object with 15 clusters using azdias_pca (general pca data set). Once KMeans object is fitted, we can start passing the general and customers pca data sets to the predict() function to generate the cluster labels for each user row. Basically, every row will be assigned a cluster number ranging between 1 and 15 because we went with n_clusters = 15.
There are eight clusters where the mail-order customers are more represented than the general population. Let’s take a closer look at these eight clusters. Cluster 12 definitely stands out from the rest and represents about 26.3% of the customer base of the mail-order company. Clusters 6 and 7 are interesting as well. We’ll dive deeper into profiling the users in cluster 12, which should give us a better picture of the mail-order company’s main customer base.
# 8 clusters where the customers are larger than general population
km_clusters[‘diff’] = km_clusters.cust_cluster_pct — km_clusters.azdias_cluster_pct
To further our understanding of the customer base, we want to know the dominant features associated with users in cluster 12. We’re leveraging pca.inverse_transform() method to map back to the original features and then examine which labels are attached to these users. Keep in mind that almost all features are categorical. We’ll leverage the supplemental Excel files to look up the definitions of these categorical labels. Here are the functions we’ll use to inversely map back to the original features.
Because we are interested in cluster 12, it makes sense to examine the top 10 original feature PCA’s positive and negative loading scores.
At this point, we want to examine these 20 features closely and see if we can extract some insights about the users who largely represent the customer base of the mail-order company. We map the general cluster labels to the original general data set and isolate users attached to cluster 12. Next, we examined the distribution of the labels in each feature. To streamline this process, we took the most frequently occurring label in each feature and generalized this segment of users.
Figures 13 and 14 capture two feature distributions. For the mail-order company, 26% of the customers fell into cluster 12. In other words, for every 4 customers, one of them fell into this cluster. At a closer glance, the customers in cluster 12 are predominantly represented by these attributes:
- Mostly low-income earners
- Not versed with the online world
- Most are working-class folks
- Most are in their advanced age
- Many have multi-cultural backgrounds
- Many have low financial interests
- Doesn’t drive expensive cars
Supervised Learning Model: Random Forest + Feature Importance
We are now venturing into supervised learning to understand what features are highly associated with converting users into customers by using the train data containing the “RESPONSE” column. The train file, Udacity_MAILOUT_052018_TRAIN.csv, contains 42,962 user rows and 367 features. Similar to the earlier (PCA and KMeans) data sets, the train data set contains many features with missing values. We’ll be applying similar imputation strategies used in the unsupervised learning section to impute the features as well as drop ambiguous features.
First, we dropped features with 40% or more missing values and dropped ambiguous/skewed features using the same criteria used earlier. Second, we impute features using sklearn’s ColumnTransformer.
Before deciding on Random Forest, the train data set was standardized and plugged into several algorithms to see which one performed the best. However, when examining the distribution of the response variable, the majority class (0) is overrepresented by about 80:1 ratio compared to the minority class (1), which is what we are interested in detecting. We want to mitigate the class imbalance and improve model performance as much as possible.
The oversampling/undersampling approach comes from an ML blog post, which can be found here. In this approach, we are oversampling the minority class to create a 10:1 ratio and then undersampling the majority class to create a 2:1 ratio by removing rows from the majority class. The output_model_stats function captures all the details and it’s designed to iterate through list of models. We measure model performance using roc_auc metric.
As shown in Figure 15, Random Forest performed the best. Because Random Forest has a feature importance built-in using the impurity approach to ranking features, we are going to leverage this to examine the top 20 most important features.
RandomizedSearchCV: Random Forest Classifier
In order to improve the roc_auc metric, one of the ways is to run some form of grid search to tune hyperparameters. These hyperparameters are model specific and we can use RandomizedSearchCV in Scikit-Learn to further tune our Random Forest’s hyperparameters. This approach is less exhaustive and saves time. We want to try this approach first and maybe move on to running a full grid search using GridSearchCV leveraging the learnings from this iteration.
It is time to test to see if these tuned hyperparameters will make a tangible difference in our model performance. Using our optimized Random Forest model, we’ll conduct cross-validation using RepeatedStratifiedKFold to generate the new roc_auc score. The new roc_auc of 0.70165 is an improvement from the initial number of 0.662905.
After observing an improvement in the roc_auc score using RandomizedSearchCV, we went ahead and ran GridSearchCV based on our learnings from previous tuning work. The GridSearchCV procedure did not generate a better roc_auc metric versus the RandomizedSearchCV. For the Kaggle competition, we’ll use the randomized search fitted model to make the predictions and submit our results.
With Random Forest, we can use labeled train data to understand what features are important. In applying PCA with KMeans, we reduced the number of features and understood relevant clusters of people that represented the customer base of the mail-order company.
Two powerful approaches — supervised and unsupervised learning techniques — identified important user attributes to discover prevailing patterns associated with mail-order conversion. In conclusion, we provided two views. First, a view of customers’ attributes that best fit the mail-order company. Second, a view of leveraging past conversion data and mining to see what features best describe the core customer base.
A few improvements can be made further to enhance the performance in unsupervised and supervised learning approaches. First, rather than dropping features with a high % of missing values, applying different imputation strategies might have helped retain more information about the data set.
Second, although ambiguous features were dropped due to lack of documentation and skewness, they could be retained and eliminated iteratively by understanding their contribution to explained variance during PCA analysis. In doing so, during KMeans clustering analysis, we could have increased our probability of seeing an actual elbow in the elbow plot.
In the supervised learning section, different imputation methods could have been adopted to retain more information and thus create a more predictive model. We could also have tried different hyperparameter settings to tune the Random Forest model further. We could have also used a different metric to select the best-performing model after applying oversampling and undersampling techniques.