Which Is The Best Website Categorization API For Fraud Prevention?

Fraud prevention is defined as the deployment of a strategy to detect fraudulent transactions or banking acts and prevent them from inflicting financial and reputational harm to the consumer and the…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Customer Segmentation Report for Arvato Financial Services

understand customers in clusters

This blog post is a part of the Capstone project for Udacity’s Data Scientist Nano Degree program. work is based on the real-life data science problem and data is provided by Udacity’s partners at Bertelsmann Arvato Analytics.

In this project, I will try to analyze demographics data for customers of a mail-order sales company in Germany, comparing it against demographics information for the general population.

The data that you will use has been provided by Bertelsmann Arvato Analytics, and represents a real-life data science task.

Goal of this project is to predict customers who are most likely to become a potential customers for a mail-order sales company in Germany.

Approach to solve the problem in following phases :

Finally Receiver Operating Characteristic (ROC) and Area Under the ROC Curve (AUC) is used as the evaluation metric for this project. Prediction for test set is to be submitted to Kaggle competition for evaluation.

Part 0: Get to Know the Data

There are four data files associated with this project:

AZDIAS dataset contains 891211 persons (rows) x 366 features (columns).
CUSTOMERS dataset contains 191 652 persons (rows) x 369 features (columns).

Step 1. Handle missing data
Firstly in data exploration I collect information about missing data in the dataset. Identifying and understanding missing data information at earlier stage will help us to figure out data present for evaluation in data processing tasks accordingly.

We found that there are significant/high percent of missing data in many columns/features.

let’s look at the distribution of missing values. We can see that there are many columns that have more than 70% even 90% of missing values in them.

missing value percent (%) wise

From above data we have identified following points :

Step 2. Fill columns that has some missing data in them. Replace ‘NAN’ with most frequent values

I used unsupervised learning techniques to describe the relationship between the demographics of the company’s existing customers and the general population of Germany. By the end of this part, you should be able to describe parts of the general population that are more likely to be part of the mail-order company’s main customer base, and which parts of the general population are less so.

Dimensional Reduction is needed since we have many correlated data. After data pre-processing step we could find that general population data (azdias.csv) now has 415405 rows and 283 columns. I have dropped less important features and outlier data, but still high dimensional data is available. So I am approaching Principal Component Analysis to reduce dimension to use Unsupervised Learning efficiently in future steps.

PCA

Based on above chart we can see that at around 220 components, cumulative variance is still high.

With dimension now reduced, let’s do clustering. To decide on number of clusters, we will try using elbow method

We found using elbow method, around 12 clusters, average distance within cluster almost flattens. So, We will use 12 number of clusters for our segmentation task.

Review and analyze clustering data. From below we can see one particular cluster (1) has over representation of Customer.

Step 1: Data cleaned using same preprocessing pipeline built for AZDIAS and CUSTOMER dataset

Step 2: Data split into training and validation based on sampling technique

Step 3: Evaluate and pick best performing algorithm — We try 3 algorithms (AdaBoostRegressor, GradientBoostingRegressor and XGBRegressor)

Step 4: Fine Tune Algotithm: We picked up XGBRegressor algorithm using ROC AUC evaluation metrics to finalize on the best algorithm to use and fine tune

Feature Importance

In unsupervised learning we identified various clusters in which some clusters are over represented and some under represented.

From the real life demographic data provided by Arvato Financials, we have been able to create segmentation of customers and also able to identify key features that will help identify customers for a company.

It is great learning experience on how to approach the problem in a methodical approach. Most challenging (and big learning) part for me was mainly on getting the data cleansed and processed without loosing key information.

As a final note, I would like to point out a couple of possible improvements that can be made to the project.

Finally, I would like to thank Udacity and Arvato Analytics for providing a fantastic platform and exciting opportunity to work on a real life problem that helped me put my data science skills to practice.
Hope you find some information in this blog useful.

Thank you for your time.

Add a comment

Related posts:

What is Asthma?

Asthma is a chronic inflammation of the airways, which can cause constriction in the airways, difficulty in breathing, coughing, and wheezing. While the symptoms of asthma can sometimes be mild, they…