Customer Churn Prediction

Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers.

Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients.

Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer\’s relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided.

Predictive analytics use churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small, prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn.

Dataset and Features

The quality and quantity of data will directly determine how good your predictive model can be.

Data is having the records of customers who left within the last month. It is a sample data set provided by IBM. It is available on Kaggle. The dataset contains 7043 rows, customer data and 21 columns, customer’s attributes. The “Churn” column is the target variable.

The dataset includes information about:

Target column is called Churn. Categorical Yes/No

CustomerID – Ananoymized unique ID‌‌Gender – Categorical (Male/Female) ‌‌‌SeniorCitizen- Categorical (0/1) ‌‌‌Partner – Whether the customer has a partner or not (Yes, No) ‌‌‌Dependents-Whether the customer has dependents or not (Yes, No) ‌‌‌tenure-Number of months the customer has stayed with the company‌‌PhoneService-Whether the customer has a phone service or not (Yes, No) ‌‌‌MultipleLines-Whether the customer has multiple lines or not (Yes, No, No phone service) ‌‌‌InternetService-Customer’s internet service provider (DSL, Fiber optic, No) ‌‌‌OnlineSecurity-Whether the customer has online security or not (Yes, No, No internet service) ‌‌‌OnlineBackup-Whether the customer has online backup or not (Yes, No, No internet service) ‌‌‌DeviceProtection-Whether the customer has device protection or not (Yes, No, No internet service) ‌‌‌TechSupport-Whether the customer has tech support or not (Yes, No, No internet service) ‌‌‌StreamingTV-Whether the customer has streaming TV or not (Yes, No, No internet service) ‌‌‌StreamingMovies-Whether the customer is streaming movies or not (Yes, No, No internet service)‌‌Contract-The contract term of the customer (Month-to-month, One year, Two years) ‌‌‌PaperlessBilling-Whether the customer has a paperless billing or not (Yes, No) ‌‌‌PaymentMethod-The customer’s payment method (Electronic check, Mailed check, Bank transfer (automatic), Credit card (automatic)) ‌‌‌MonthlyCharges-The amount charged to the customer monthly‌‌TotalCharges-The total amount charged to the customer

Exploratory Data Analysis

The dataset contains 7043 rows and 21 features.

Features : [\’customerID\’, \’gender\’, \’SeniorCitizen\’, \’Partner\’, \’Dependents\’, \’tenure\’, \’PhoneService\’, \’MultipleLines\’, \’InternetService\’, \’OnlineSecurity\’, \’OnlineBackup\’, \’DeviceProtection\’, \’TechSupport\’, \’StreamingTV\’, \’StreamingMovies\’, \’Contract\’, \’PaperlessBilling\’, \’PaymentMethod\’, \’MonthlyCharges\’, \’TotalCharges\’, \’Churn\’]

There are no missing values in the dataset.

For the \”TotalCharges\” feature, Pandas didn’t detect all of the values to be float64 type, so we probably have some non-numeric data in the column.

# convert all TotalCharges to numeric and set the invalid parsings/errors as NaNtelcom[\'TotalCharges\'] = pd.to_numeric(telcom[\'TotalCharges\'], errors = \'coerce\')

There are 11 rows where \”TotalCharges\” is null and in all of these, tenure is equal to zero. So, it\’s better to drop these data points.

The \”tenure\” feature is having 73 unique values. It is converted into categorical feature having 5 bins: 0-12, 12-24, 24-48, 48-60, above 60.

Graphical Analysis

Numerical Variables

In this part we will look into our numerical variables, how they are distributed, how they relate to each other and how they can help us to predict the ‘Churn’ variable.

There are only three numerical columns: tenure, monthly charges and total charges. The probability density distribution can be estimated using the Seaborn kdeplot function.‌

From the plots above, we can conclude that:

Recent clients are more likely to churn
Clients with higher MonthlyCharges are also more likely to churn
Tenure and MonthlyCharges are probably important features

Categorical variables

This dataset has 16 categorical features:

Six binary features (Yes/No)
Nine features with three unique values each (categories)
One feature with four unique values

Partner and dependents

From the plots above, we can conclude that:

Customers that don\’t have partners are more likely to churn
Customers without dependents are also more likely to churn

Contract and Payment

A few observations:

Customers with paperless billing are more probable to churn
The preferred payment method is Electronic check with around 35% of customers. This method also has a very high churn rate
Short term contracts have higher churn rates

One and two year contracts probably have contractual fines and therefore customers have to wait until the end of contract to churn.

These observations are important for when we design the retention campaigns so that we know where we can focus.

Now we have a better picture of the variables that are more important to us, for example, having Month-to-month contract is a strong indicator if the client might leave soon, so is the Electronic check payment method, being a senior citizen on the other hand is a good predictor, but only represents a small amount of the companies clients so you might prefer to focus on the variables that delivers the best results first before tackling it.

Machine Learning Models And Performance Evaluation

We will use Logistic Regression, Decision Tree, Random Forest and SVM. First set aside a test data.

We need to encode the categorical variables into numeric. For this we will use Label Encoder.

# categorical variable encodingcat_vars_list = objects_ds.columns.tolist()## Label Encoderfrom sklearn.preprocessing import LabelEncoderle = LabelEncoder()for i in cat_vars_list :    telcom[i] = le.fit_transform(telcom[i])

Divide the dataset into training and test datasets. Here, we split the dataset into 70/30 ratio.

Logistic Regression

from sklearn.linear_model import LogisticRegressionlogisticReg = LogisticRegression()result = logisticReg.fit(X_train, Y_train)Y_pred_lr = logisticReg.predict(X_test)

The accuracy of model in test dataset is 0.7962085308056872, while on train, it is 0.8051605038602194. And we got Precision- 0.631915, Recall- 0.536101. The confusion matrix is as follows-

Decision Tree

## decision tree from sklearn import tree, metricsdec_tree = tree.DecisionTreeClassifier()result_tree = dec_tree.fit(X_train, Y_train)Y_pred_dt = dec_tree.predict(X_test)

We got the accuracy for test- 0.7137440758293839 and for train- 0.9985778138967899. One thing to note here is, the model accuracy for test is equal to tree score i.e.,

dec_tree.score(X_test, Y_test) = metrics.accuracy_score(Y_test, Y_pred_dt)

We need to perform regularization to optimize the DT model. Here, we limited the max_depth of tree to 4 and this has improved the model accuracy to 0.7895734597156399 for test and 0.7927671678179602 for train. This helps the model to be not overfitted.

Precision- 0.631579, Recall- 0.476534 and Confusion matrix-

Visualization of Decision Tree-

Random Forest

from sklearn.ensemble import RandomForestClassifierrandomForest = RandomForestClassifier()randomForest.fit(X_train, Y_train)Y_pred_rf = randomForest.predict(X_test)

The test accuracy obtained in random forest is 0.7748815165876777 and train accuracy is 0.9985778138967899. The model is over-fitting. To fix this we have restricted n_estimator (default = 10) to 4 and max_depth to 4. This gives us accuracy to 0.7867298578199052 in test while in train- 0.7990654205607477.

Precision- 0.628079, Recall- 0.460289 and confusion matrix-

Support Vector Machine

# SVMfrom sklearn.svm import SVCsvc_cl = SVC()svc_cl.fit(X_train, Y_train)Y_pred_svm = svc_cl.predict(X_test)

The accuracy of model in test dataset is 0.7374407582938388, while on train, it is 0.7328321820398213. And we got Precision- 0, Recall- 0. The confusion matrix is as follows-