Introduction

Cohort analysis is when you group your users based on their actions to understand what compels them to stick around for the long haul. It allows you to make informed product decisions that will reduce churn and drastically increase revenue. You could also call it customer churn analysis.

This analysis is based on a public data set that was made available by IBM and contains customer information that was captured for a telecommunications company. I’ll generate some question via the initial EDA in order to better understand our customer segments.

Data

Let’s have a closer look at the data set.

rm(list=ls())

library(tidyverse)

# reading in the data
df <- read_csv("C:/Users/driku/OneDrive/Documents/Data Science/Churn Models/Telco/Telco_Customer_Churn.csv")

# dimensions of the data
dim(df)
## [1] 7043   21

We learn that our data set contains 7043 observations and has 21 dimensions.
Let’s look at the column names and the variable types.

glimpse(df)
## Observations: 7,043
## Variables: 21
## $ customerID       <chr> "7590-VHVEG", "5575-GNVDE", "3668-QPYBK", "77...
## $ gender           <chr> "Female", "Male", "Male", "Male", "Female", "...
## $ SeniorCitizen    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Partner          <chr> "Yes", "No", "No", "No", "No", "No", "No", "N...
## $ Dependents       <chr> "No", "No", "No", "No", "No", "No", "Yes", "N...
## $ tenure           <dbl> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 5...
## $ PhoneService     <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "Yes"...
## $ MultipleLines    <chr> "No phone service", "No", "No", "No phone ser...
## $ InternetService  <chr> "DSL", "DSL", "DSL", "DSL", "Fiber optic", "F...
## $ OnlineSecurity   <chr> "No", "Yes", "Yes", "Yes", "No", "No", "No", ...
## $ OnlineBackup     <chr> "Yes", "No", "Yes", "No", "No", "No", "Yes", ...
## $ DeviceProtection <chr> "No", "Yes", "No", "Yes", "No", "Yes", "No", ...
## $ TechSupport      <chr> "No", "No", "No", "Yes", "No", "No", "No", "N...
## $ StreamingTV      <chr> "No", "No", "No", "No", "No", "Yes", "Yes", "...
## $ StreamingMovies  <chr> "No", "No", "No", "No", "No", "Yes", "No", "N...
## $ Contract         <chr> "Month-to-month", "One year", "Month-to-month...
## $ PaperlessBilling <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Yes"...
## $ PaymentMethod    <chr> "Electronic check", "Mailed check", "Mailed c...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89....
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820....
## $ Churn            <chr> "No", "No", "Yes", "No", "Yes", "Yes", "No", ...

Columns appear to be well named. What does tenure stand for? The duration of time the customer stayed with the telco company? Further investigation yields 4 numerical variables and 17 categorical variables. However, the SeniorCitizen variable seems to be a categorical variable as well and I assume the number 0 indicates that the person is not a senior citizen and the number 1 that he/she is. Therefore, we’ll treat the SeniorCitizen variable as a categorical variable.
Let’s cast the categorical variables as factors.

df[1:18] <- lapply(df[1:18], factor)
df[21] <- lapply(df[21], factor)
glimpse(df)
## Observations: 7,043
## Variables: 21
## $ customerID       <fct> 7590-VHVEG, 5575-GNVDE, 3668-QPYBK, 7795-CFOC...
## $ gender           <fct> Female, Male, Male, Male, Female, Female, Mal...
## $ SeniorCitizen    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Partner          <fct> Yes, No, No, No, No, No, No, No, Yes, No, Yes...
## $ Dependents       <fct> No, No, No, No, No, No, Yes, No, No, Yes, Yes...
## $ tenure           <fct> 1, 34, 2, 45, 2, 8, 22, 10, 28, 62, 13, 16, 5...
## $ PhoneService     <fct> No, Yes, Yes, No, Yes, Yes, Yes, No, Yes, Yes...
## $ MultipleLines    <fct> No phone service, No, No, No phone service, N...
## $ InternetService  <fct> DSL, DSL, DSL, DSL, Fiber optic, Fiber optic,...
## $ OnlineSecurity   <fct> No, Yes, Yes, Yes, No, No, No, Yes, No, Yes, ...
## $ OnlineBackup     <fct> Yes, No, Yes, No, No, No, Yes, No, No, Yes, N...
## $ DeviceProtection <fct> No, Yes, No, Yes, No, Yes, No, No, Yes, No, N...
## $ TechSupport      <fct> No, No, No, Yes, No, No, No, No, Yes, No, No,...
## $ StreamingTV      <fct> No, No, No, No, No, Yes, Yes, No, Yes, No, No...
## $ StreamingMovies  <fct> No, No, No, No, No, Yes, No, No, Yes, No, No,...
## $ Contract         <fct> Month-to-month, One year, Month-to-month, One...
## $ PaperlessBilling <fct> Yes, No, Yes, No, Yes, Yes, Yes, No, Yes, No,...
## $ PaymentMethod    <fct> Electronic check, Mailed check, Mailed check,...
## $ MonthlyCharges   <dbl> 29.85, 56.95, 53.85, 42.30, 70.70, 99.65, 89....
## $ TotalCharges     <dbl> 29.85, 1889.50, 108.15, 1840.75, 151.65, 820....
## $ Churn            <fct> No, No, Yes, No, Yes, Yes, No, No, Yes, No, N...

Check for null values.

df_nulls <- df %>%
  select(everything()) %>%  # replace to your needs
  summarise_all(funs(sum(is.na(.))))

t(df_nulls)
##                  [,1]
## customerID          0
## gender              0
## SeniorCitizen       0
## Partner             0
## Dependents          0
## tenure              0
## PhoneService        0
## MultipleLines       0
## InternetService     0
## OnlineSecurity      0
## OnlineBackup        0
## DeviceProtection    0
## TechSupport         0
## StreamingTV         0
## StreamingMovies     0
## Contract            0
## PaperlessBilling    0
## PaymentMethod       0
## MonthlyCharges      0
## TotalCharges       11
## Churn               0

The variable TotalCharges has 11 null values. I’d rather not impute those observation with mean values as it does not add any additional information to the data set. I’d rather just remove them as it is only 11 observations in a data set containing 7043 observations.

df <- df[!is.na(df$TotalCharges),]

#Check again
sum(is.na(df$TotalCharges))
## [1] 0

No more null values and we should have 11 observation less now.

dim(df)
## [1] 7032   21

Because we’re concerned about the churn rate of customers, let’s look at the broad picture first.

ggplot(df, aes(x = Churn, fill = Churn), position = "dodge") + 
  geom_bar() +
  geom_text(stat='count', aes(label=..count..), vjust=2) +
  labs(x = "Churn", y = "Number of Customers") +
  ggtitle("Churners(Yes/No)")  

From the above bar chart we can see that we have lost 1869 customers over a period of 73 months (6 years).

Let’s turn our attention to the number of different levels of the categorical variables:

#create a data.table from the data.frame
dt <- data.table(df)

dt_tbl <- dt %>%
  select_if(is.factor) %>%
  summarise_all(n_distinct)

t(dt_tbl)
##                  [,1]
## customerID       7032
## gender              2
## SeniorCitizen       2
## Partner             2
## Dependents          2
## tenure             72
## PhoneService        2
## MultipleLines       3
## InternetService     3
## OnlineSecurity      3
## OnlineBackup        3
## DeviceProtection    3
## TechSupport         3
## StreamingTV         3
## StreamingMovies     3
## Contract            3
## PaperlessBilling    2
## PaymentMethod       4
## Churn               2

Method

Clearly the customerID variable contains unique ids, which negates time series analysis from this EDA (we can’t follow a customer over time to analyse their behavior that caused them to changes their services portfolio).
From the list of variables the two numerical variables, MonthlyCharges and TotalCharges lends us an opportunity to create a scatter plot. But, unfortunately, the two variables are too heavily correlated (high monthly charges naturally leads to high total charges). All the variables with 2 levels are clearly YES/NO categories. Tenure has 72 categories and alludes to a time frame. Therefore, I’d rather have it converted back to a numerical value. This would allow us to pursue our scatter plot strategy.

#convert tenure back to a numeric
dt$tenure <- as.numeric(dt$tenure)

Plotting tenure against monthly charges reveals the following plot. And, while we are at it let’s overlay the scatter plot with the Yes/No categorical variable Churn.

ggplot(dt) +
       geom_point(aes(x=dt[,dt$MonthlyCharges,],
                      y=dt[,dt$tenure,],
                      colour = Churn)) + 
       labs(x = "Monthy Charges", y = "Months") +
       ggtitle("Scatterplot: MonthlyCharges vs. Tenure")

Let’s remove the non-churners from the plot.

The above plot reveals that there is a cluster of customers that have a disproportionate churn rate for monthly charges between $65 and $95 who churns within the first 18 months. This cohort equates to 657 customers.

Let’s see if we can find out more about these customers by starting off to inspect their payment methods.

filtered_dt <-  dt %>% 
                filter(MonthlyCharges >= 65) %>% 
                filter(MonthlyCharges <= 95) %>%
                filter(tenure <=18) %>%
                filter(Churn == "Yes") %>%  
                group_by(PaymentMethod) %>%
                summarise(Total = n()) %>%
                arrange(desc(Total))

#plot the graph
ggplot(data = filtered_dt, aes(x = reorder(PaymentMethod, Total), y = Total)) + 
  geom_bar(stat="identity") +
  labs(x = "Payment Methods", y = "Number of Customers") +
  ggtitle("Payment methods of churners") + 
  coord_flip() +
  geom_text(aes(label=Total, color = 'red'), vjust=0.3, hjust=1.2, size=5)

What type of contract do they have?

Do they have dependents?

Do they have partners?

Do they watch streaming videos?

Do they watch streaming TV?

Which internet services do they use?

Are our churners senior citizens?

Gender split?

The above list identifies with teenagers and young adults. The fiber optic use might indicate gamers or movie down loaders:
* It might also be that this is the reason why the customers churn: + Unhappy with quality of fiber optics service? + Not enough bandwidth? + Restricted on data downloads? + Too many service interruptions? + Too many drop-outs?

Understanding our bread and butter

Let’s revisit our initial scatter plot and identify three interesting areas.

  • From the above figure we can observe two marked areas:
    • RED = Bread and butter customers (1452 customers)
    • BLACK = Our “crème de la crème” customers (822 customers)

How much revenue did each of these segments earn?

Who is BLACK?

  • The customer in the BLACK cohort has the following in common:
    • They prefer to pay by automatic payment (credit card & bank transfer)
    • Two year contracts
    • More than half have dependents
    • Most have partners
    • Most watch streaming movies & TV
    • Almost 50/50 between Fibre optic and DSL
    • Most are senior citizens

The BLACK cohort is our high income couple/individual or company.

Who is RED?

  • The customer in the RED cohort has the following in common:
    • They prefer to pay by mailed check
    • Two year, Month-to-month and on year contracts
    • More than half don’t have dependents
    • Equal split have partners
    • No streaming movies or TV
    • No internet services (small number have DSL)
    • Not senior citizens
    • Almost exclusively use telephone services

The RED cohort is our pay-as-you-go mobile users.

Understanding our deficiencies

The figure below displays a sparsely highlighted area as demarcated by a GREEN rectangle. This region is of concern as it presents rather empty compared to the rest of the graph. This area contains only 375 customers for a total revenue of $407,949 . What are the reasons for not attracting customers?

Going back in time

In order to understand across which product groupings we lost most of our customers, we need to look back and try and determine when most of our customers churn across any of our product and services.

From the two plots below we can observe that over the period of six years the churn rate is very low for customers who have either utilized the Tech Support and/or Online support services. Expanding on these two service offerings might have a positive effect on the churn rate of the telco’s customers.

ggplot() +
       geom_jitter(aes(x=dt[Churn == "Yes"]$tenure, 
                       y=dt[Churn == "Yes"]$TechSupport), 
                       color = "#00BFC4") + 
       labs(x = "Months", y = "Tech Support") +
       ggtitle("Scatterplot: Tech Support (Churn = Yes)")

ggplot() +
       geom_jitter(aes(x=dt[Churn == "Yes"]$tenure, 
                       y=dt[Churn == "Yes"]$OnlineSecurity), 
                       color = "#00BFC4") + 
       labs(x = "Months", y = "Online Security") +
       ggtitle("Scatterplot: Online Security (Churn = Yes)")

Conclusion

From the above analysis we have gained deeper insight into the Telco’s customer behavior. We have discovered that the largest customer churn occurs early on in a customer’s tenure. We’ve identified the cohort who is most likely to churn as well as the “bread and butter” cohort and the “creme de la creme” cohort. Additionally, we have added some monetary values to these latter two cohorts and now understand the importance of maintaining and further expanding both of them.
The next step would be to look at possible classification models that could predict the churn rate of customers and what the relative monetary effects of that would be.