Who are the Potential Customers

— a Machine Learning Approach for Customer Segmentation & Response Prediction

Customer targeting

Problem Statement

Customers are tired of generic marketing. They want brands to serve them content and offers that appeal to them. 66 percent see the level of challenge in securing internal resources to execute personalized marketing programs. A Harvard Business School professor even went as far to say that, of 30,000 new consumer products launches each year, 95% fail because of ineffective market segmentation.

Project Overview

Customer segmentation has the potential to allow marketers to address each customer in the most effective way. Using the large amount of data available on customers (and potential customers), a customer segmentation analysis allows marketers to identify discrete groups of customers with a high degree of accuracy based on demographic, behavioral and other indicators. Machine learning customer segmentation allows advanced algorithms to surface insights and groupings that marketers might find difficulty discovering on their own.

This post is based on a real-life data science project using both unsupervised and supervised machine learning and is also a Kaggle InClass Competition. This is part of the Udacity Data Scientist Nanodegree Capstone Project provided in collaboration with Bertelsmann Arvato Analytics, aimed to help the company create a more efficient way of targeting people who are more likely to become customers. The purpose is to analyze the demographics of the general population in Germany against demographics data of customers of a German mail-order company, explain the difference and predict customer response to a marketing campaign.

The project comprised of 3 major parts:

  1. Data exploration and clean — This is the crucial part and consists of data cleaning, feature extraction and transformation.
  2. Customer segmentation report — By using unsupervised learning methods to analyze attributes of established customers and the general population in order to create customer segments.
  3. Supervised learning model & Kaggle competition — Build a machine learning model that predicts whether or not each individual will respond to the campaign. Predictions on the campaign data was submitted and joined the Kaggle Competition.

Data Preview

There are six data sets provided with the project:

Udacity_AZDIAS_052018.csv: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).

Udacity_CUSTOMERS_052018.csv: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).

Udacity_MAILOUT_052018_TRAIN.csv: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).

Udacity_MAILOUT_052018_TEST.csv: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).

DIAS Attributes_Values_2017.csv: Information data for the each column for the demographic data.

DIAS Information Levels_Attributes 2017.xlsx: Detail explanation for the columns in the demographic data.

Data provided in this project is covering the most popular type of data for customer segmentation

Demographic — Age, gender, income, occupation, education, education, marital status, nationality

Geographic — Country, land, rural or metropolitan area.

Psychographic — Social status, lifestyle, personality, attitude, values, and interests

Behavioral — Intensity of product use, brand loyalty, user behaviors, price sensitivity, technology adoption.

Data Exploration and Data Clean

Started to explore the data by checking the attribute information in file “DIAS Attributes_Values_2017.csv”, and found there are inconsistency with the data set. E.g. Udacity_AZDIAS_052018.csv, 94 fields in AZDIAS data set has no description, and 42 items in description file but not in data set. Instead of dropping the columns without description, referred to similar attributes in the description file, suggested description and possible unknown value was added, e.g. D19_HAUS_DEKO in data files, and D19_HAUS_DEKO_RZ in information file.

Data type report shows that most of the data are numerical, and 6 ones are object type. Columns (CAMEO_INTL_2015) with abnormal characters, e.g. X have been converted to unknown. Fields with too many categorical values, e.g. D19_LETZTER_KAUF_BRANCHE, were put into the drop list. And other categorical columns, e.g. FINANZTYP, SHOPPER_TYP, were kept and to be processed by one hot encoding.

Time value columns (EINGEFUEGT_AM) were converted to year, and two value column (OST_WEST_KZ) were encoded to binary data(1,0). Unknown value (-1,9) specified in the attribute information file were converted to NaN.

Looking at the description, field CAMEO_INTL_2015 were combination of two attributes. It was split and converted to two columns, CAMEO_INTL_2015_HH represent Wealthy, Prosperous or Comfortable Households, and CAMEO_INTL_2015_FM represent family type, e.g. Pre-Family Couples & Families With School Age Children etc.

Data Imputation

Missing Values

Real life data always have missing values, for both general population and customer data set, there are 67 columns have higher than 35 percent of missing data.

After manual checking, EXTSEL992, KBA05_BAUMAX,GEBURTSJAHR, have significant difference between customer and general population, these columns were kept. And remaining 64 fields are to be dropped.


Interquartile range (IQR), also called the midspread, was employed to identify outlier values. It is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles. Values higher than Q3 + 1.5 * IQR or lower than Q1–1.5 * IQR are taken as outlier.

By calculation, around 89 attributes have outliers. Among them, a few severely imbalanced features by nature, e.g. TITEL_KZ, were kept, and others, e.g. ALTER_HH, ANZ_STATISTISCHE_HAUSHALTE, their outlier values were converted to NaN.

One Hot Encoding & Feature Scaling

After data clean, categorical columns were processed by One hot encoding, missing data were filled back by mode mode imputation method(use most frequent value of the entire feature column) and StandardScaler were employed to finalize the feature scaling.

After wrangling and ETL, data columns reduced from 366 to 353, and is now ready for analysis.

Customer Segmentation

The goal of cluster analysis in marketing is to accurately segment customers in order to achieve more effective customer marketing via personalization. A common cluster analysis method is a mathematical algorithm known as k-means cluster analysis.

PCA in conjunction with k-means is a powerful method for visualizing high dimensional data. Using PCA before K-means clustering reduces dimensions and decrease computation cost.

Principal Component Analysis (PCA) was used to decrease the data set dimensions, and in the meantime, to keep as much as possible the variance explained. By plotting the explained variance against the number of components, around 120 components can explain more than 80% of the variance.

By running the model generated by supervised learning(details see supervised learning), 222 features show they are of very less importance, e.g. CAMEO_DEUG_2015_8.0, LP_STATUS_FEIN_2.0, FINANZTYP_4.0, which were dropped before PCA. The remaining 128 attributes which explain >50% of variance were kept for further processing.

After PCA running, it shows that, 54 components are able to explain >80% of the variance. Decision was to use PCA with 55 components to reduce the data dimensions further for KMeans classification.

By plotting the result of KMeans SSE value against number of clusters, the “elbow point” (the point after which the SSE or inertia starts decreasing in a linear fashion) gives an indication that, the optimal number of clusters is about 9 -11.

By plotting the customers distribution comparison, 10 clusters demonstrates significant difference from customer and general population. Customer have higher possibility to be classified in cluster 2,9,10, and very low possibility in cluster 8,3

Cluster center values for customer and general population are significantly different

Findings in cluster 2, in which, customers are the major parts of this class, features descriptions show that:
LP_STATUS_FEIN_9.0 — Social status: customers are mostly houseowners or top earners
WOHNLAGE — Building: more customers living with rural neighbourhood or new building in rural neighbourhood
MOBI_REGIO — Moving patterns: customers have low or no mobility
KBA13_ANTG1 — Higher share car owners between 1–2 within the PLZ8 for customers
PLZ8_ANTG1 — Customers have higher share number of 1–2 family houses in the PLZ8

Comparison of the cluster center values in the plot shows the important distinctions between cluster 2 (for major customers) and cluster 8 (for general population)

Supervised Learning & Response Prediction

Before PCA, I have already employed supervised learning to decrease the dimensions, he is the detail algorism and metrics used.

Employed in this project, XGBClassifier algorithm, which has become the ultimate weapon of many data scientist, is a highly sophisticated algorithm, powerful enough to deal with all sorts of irregularities of data.

Since imbalanced dataset using the majority run as a classifier will lead to high accuracy what will make it a misleading measure. AUC-ROC matric was employed in this project, to avoid the performance issue caused by improper metrics, e.g. accuracy, for severe imbalance data like this.

To fine tune the model, hyper parameters with different values were compared. Max_depth, min_child_weight , n_estimators and learning_rate are the four major parameters to be tuned. Train and test AUC shows that, highest train AUC is 0.9896, however, the AUC for testing is 0.6372. So a more balanced model with score 0.9085 for train, and 0.7709 for test were used.

For the optimized model, plot for feature importance shows that, distinctive attributes of customer are:
LP_STATUS_GROB_5.0 — people who are the top earners
KBA13_ANTG2 — people who live in the PLZ8 area with higher share number of 3–5 family houses
PLZ8_HHZ — people live in the area where there are larger number of households within the PLZ8
D19_SOZIALES — people have higher transactional activity on the product group of social
LP_FAMILIE_FEIN_10.0 — family type of two-generational household
FINANZ_SPARER — customers have lower possibility of money saver

To improve the prediction accuracy, KMeans cluster label values was leveraged and extended the Mailout train and test data set attributes. And finally, the data set with 363 dimension was fit into the XGBClassifier model for training. Final score for data prediction is 0.80642.


This project aimed to help the company create a more efficient way of targeting people who are more likely to become customers. Using principal component analysis and unsupervised techniques KMeans, original data set was consolidated to 55 components and classified into 10 clusters. Center value comparison shows that, there are significant difference between cluster 2, and cluster 8, in turn made it possible to identify parts of the population that best described the core customer base of the German company. After that, supervised learning tool XGBClassifier was used to identify the important features for customer, and predicted whether a person would respond to a marketing campaign. The final score in Kaggle is above 80%, which could help the German mail-order company to find options to increase the accuracy of targeting the customer, and significantly reduce the cost of mailout campaign.

Future Work

There are a couple of possible improvement for the project. More time could be spent on the data wrangling and feature engineering. Instead of adopting the no description features directly, more work should be done with the mail-order company to understand the data. Missing data could be back filled by other methods instead of using average values uniformly. And different scaler can be employed for different type of data. Further more, hyper parameter tuning surely can improve the prediction model, and other models and algorisms can be adopted as well to compare the results.

Finally, I would like to thank Udacity and Bertelsmann Arvato Analytics for providing this excellent project and real-life data.

Detailed analysis and code navigate to my GitHub.