# How Engage Affects Retention

This analysis is a work in progress. We’ll attempt to analyze the effect that using Engage has on retention for Business customers.

The dataset we’ll use contains around 11K Business customers that started subscriptions on or after January 2019. We’ll separate the customers into groups defined by whether or not they replied to a comment in Engage.

## Preliminary Findings

The data suggests that Business customers that replied to a comment in Engage churn at significantly lower rates than those that did not use Engage. This outcome is likely affected by sampling bias – those customers that used Engage likely have Instagram accounts with more followers and engagement, which would have granted them earlier access to the product. Customers that churned were less likely to have been invited to use Engage.

When all Business customers have access to Engage and the sampling is less biased we’ll be able to get a better understanding of Engage’s effect on retention and churn.

## Data Exploration

Let’s count the number of customers that used Engage.

# count engage users
users %>%
count(used_engage) %>%
mutate(prop = percent(n / sum(n)))
## # A tibble: 2 x 3
##   used_engage     n prop
##   <lgl>       <int> <chr>
## 1 FALSE       11575 98%
## 2 TRUE          263 2%

Only 263 (2%) of these customers replied to a comment in Engage. Next we will use a technique called survival analysis to compare churn and retention rates of these customers.

## Survival Analysis

Survival analysis is a common branch of statistics used for analyzing the expected duration of time before a certain event occurs. It is commonly used in the medical field to analyze mortality rates, hence the name “survival”.

It’s especially useful when the data is right-censored, meaning there are cases in which the event hasn’t happened yet, but will likely happen at some time in the future. In the figures below, see the numbers with a “+”. These refer to customers that haven’t churned yet – their time to churn is X days “+”.

# build survival object
km <- Surv(users$time_to_cancel, users$canceled)

# preview data
head(km, 20)
##  [1] 162+   0    1+   8+ 408+ 566+  10+  10+ 496+ 443+ 622+ 455+ 503+ 670   52
## [16] 111+ 185   78   60  243

To begin our analysis, we use the formula Surv(time, canceled) ~ 1 and the survfit() function to produce the Kaplan-Meier estimates of the probability of “survival” over time. The times parameter of the summary() function gives some control over which times to print. Here, it is set to print the estimates for 1, 30, 60 and 90 days, and then every 90 days thereafter.

# get survival probabilities
km_fit <- survfit(Surv(time_to_cancel, canceled) ~ 1, data = users)

# summarise
summary(km_fit, times = c(1, 7, 14, 30, 60, 90 * (1:10)))
## Call: survfit(formula = Surv(time_to_cancel, canceled) ~ 1, data = users)
##
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1  11729     181    0.985 0.00113        0.982        0.987
##     7  11433     150    0.972 0.00152        0.969        0.975
##    14  11258      66    0.966 0.00166        0.963        0.970
##    30  10750     400    0.931 0.00235        0.927        0.936
##    60   9309     873    0.853 0.00333        0.847        0.860
##    90   8165     702    0.787 0.00389        0.780        0.795
##   180   5859    1121    0.672 0.00461        0.663        0.681
##   270   4192     666    0.589 0.00504        0.580        0.599
##   360   2933     374    0.531 0.00538        0.520        0.541
##   450   1567     609    0.409 0.00603        0.397        0.421
##   540    890     123    0.370 0.00641        0.358        0.383
##   630    327      44    0.343 0.00721        0.329        0.358

The survival column shows the estimate for the percentage of Business customers still active after a certain number of days after subscribing. For example, around 93% of Business subscriptions are still active by day 30 and around 85% are still active by day 60.

This curve can be plotted.

You can see dips in the curve every 30 days, which makes sense given the monthly billing period of most subscriptions. Now lets segment the customers by whether or not they used Engage and fit survival curves for each segment

# get survival probabilities
km_fit <- survfit(Surv(time_to_cancel, canceled) ~ used_engage, data = users)

# summarise
summary(km_fit, times = c(1, 7, 14, 30, 60, 90, 180))
## Call: survfit(formula = Surv(time_to_cancel, canceled) ~ used_engage,
##     data = users)
##
##                 used_engage=FALSE
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1  11466     181    0.984 0.00115        0.982        0.987
##     7  11172     149    0.971 0.00155        0.968        0.974
##    14  10997      65    0.966 0.00170        0.962        0.969
##    30  10495     397    0.930 0.00239        0.925        0.935
##    60   9061     870    0.850 0.00339        0.844        0.857
##    90   7928     697    0.783 0.00396        0.776        0.791
##   180   5657    1110    0.666 0.00468        0.657        0.675
##
##                 used_engage=TRUE
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##     1    263       0    1.000 0.00000        1.000        1.000
##     7    261       1    0.996 0.00380        0.989        1.000
##    14    261       1    0.992 0.00537        0.982        1.000
##    30    255       3    0.981 0.00847        0.964        0.998
##    60    248       3    0.969 0.01072        0.948        0.990
##    90    237       5    0.949 0.01372        0.923        0.977
##   180    202      11    0.902 0.01914        0.865        0.940

We can see that, although the sample is small, customers that used Engage churned at significantly lower rates than those that did not. For example, around 85% of Business customers that did not use Engage are still active by day 60, compared to around 97% of Business customers that did use Engage.

Plotting these survival curves shows a seemingly large difference in churn rates.

The data suggests that customers that replied to a comment in Engage churn at significantly lower rates. However, this is likely affected by sampling bias. Those customers that used Engage likely have Instagram accounts with more followers and engagement, which would have granted them earlier access to the product. Customers that churned were less likely to have been invited to use Engage.