5 min read

Analyzing Target Customer Survey Data

This is an exploratory analysis of the target customer survey that users see when they sign up for Publish. We’ll collect all of the responses with the SQL query below. We’ll only include users that connected at least one Profile to Publish in order to exclude drive-by signups.

# connect to redshift
con <- redshift_connect()
select
  u.id as user_id
  , u.billing_stripe_customer_id as customer_id
  , date(u.created_at) as signup_at
  , u.billing_plan_name as billing_plan
  , r.id
  , r.attribution
  , r.b2b_or_b2c
  , r.business_type
  , r.direct_or_indirect
  , r.company_size
  , r.plan
  , r.product
  , r.is_target_customer
  , f.did_user_upgrade as upgraded
from dbt.users u
inner join dbt.target_customer_survey_responses r
  on u.id = r.user_id
inner join dbt.profiles as p
  on p.user_id = u.id
left join dbt.user_upgrade_facts as f
  on u.id = f.user_id
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14

There are around 71 thousand users in this dataset.

Exploratory Analysis

Let’s plot the distribution of responses for each question.

These are interesting distributions. Notice the different scales on each graph. This gives us an idea of the type of user signing up for Publish. Let’s see how likely they are to start trials.

Interesting results. Let’s look at the proportion of users that actually upgraded.

It looks like B2C businesses, physical stores, companies with 5-10 employees, online-first stores, and those who heard of Buffer through an acquaintance upgrade at the highest rates.

Now let’s plot the percentage of signups that are target customers.

Less than 2% of new signups fit our target customer criteria. Now let’s plot the upgrade rates for different types of each category. Let’s start with the is_target_customer field.

Users that fit our target customer criteria (B2B online direct-to-consumer store with less than 500 employees) upgrade at a rate of around 6.2%. Non taret customers upgrade at a rate around 4.2%

Logistic Regression

Let’s fit a logistic regression model to get an idea of which responses are correlated with the probability of upgrading.

library(broom)

# change column types to factors
factor_cols <- c("attribution", "b2b_or_b2c", "business_type", "direct_or_indirect", "company_size")
users[factor_cols] <- lapply(users[factor_cols], as.factor)

# plot model output
users %>% 
  mutate(attribution = fct_relevel(attribution, "other"),
         business_type = fct_relevel(business_type, "none"),
         b2b_or_b2c = fct_relevel(b2b_or_b2c, "-"),
         direct_or_indirect = fct_relevel(direct_or_indirect, "-"),
         company_size = fct_relevel(company_size, "na")) %>% 
  glm(upgraded ~ attribution + b2b_or_b2c + business_type + direct_or_indirect + company_size,
      family = "binomial",
      data = .)  %>% 
  tidy(conf.int = TRUE) %>% 
  filter(term != "(Intercept)" & p.value < 0.1) %>% 
  mutate(term = str_replace(term, "company_size", "Size: "),
         term = str_replace(term, "business_type", "Type: "),
         term = str_replace(term, "attribution", "Attribution: "),
         term = fct_reorder(term, estimate)) %>% 
  ggplot(aes(x = estimate, y = term)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high)) +
  labs(x = "estimate",
       y = NULL,
       title = "Logistic Regression Coefficients",
       subtitle = "Company size indicates higher probability of upgrading.")

This plot shows the coefficients (and confidence intervals) of the model. Only the statistically significant effects are included in this plot. Notice that the confidence intervals generally do not overlap with 0. Users with a company size of 0-50 employees have a higher probability of upgrading, and users with a personal company and those that heard of Buffer through social media are less likely to upgrade.

Let’s update the regression model to only include attribution and company size.

# set factor levels
users <- users %>% 
  mutate(attribution = fct_relevel(attribution, "other"),
         company_size = fct_relevel(company_size, "na"))

# fit model
mod <- glm(upgraded ~ attribution + company_size + business_type, family = "binomial", data = users)

# summarize model
summary(mod)
## 
## Call:
## glm(formula = upgraded ~ attribution + company_size + business_type, 
##     family = "binomial", data = users)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4133  -0.3582  -0.3155  -0.2337   3.1805  
## 
## Coefficients: (1 not defined because of singularities)
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -3.48235    0.11707 -29.745  < 2e-16 ***
## attribution-                -6.61666   72.46303  -0.091 0.927246    
## attributionacquiantance      0.03809    0.05313   0.717 0.473485    
## attributionadvertisement    -0.20104    0.11950  -1.682 0.092505 .  
## attributionblog             -0.48875    0.08684  -5.628 1.82e-08 ***
## attributionsearch           -0.21104    0.05517  -3.825 0.000131 ***
## attributionsocial-media     -0.57997    0.08352  -6.944 3.82e-12 ***
## company_size5-to-50          0.98004    0.07812  12.545  < 2e-16 ***
## company_size51-to-500        0.93706    0.09927   9.439  < 2e-16 ***
## company_sizeless-than-5      0.83452    0.07320  11.400  < 2e-16 ***
## company_sizemore-than-500    0.53306    0.14894   3.579 0.000345 ***
## business_typeagency          0.01549    0.09263   0.167 0.867167    
## business_typenone           -0.10456    0.09319  -1.122 0.261859    
## business_typeonline-store   -0.15675    0.09722  -1.612 0.106883    
## business_typepersonal       -0.98895    0.12727  -7.770 7.82e-15 ***
## business_typephysical-store  0.04707    0.10307   0.457 0.647926    
## business_typepublisher      -0.28993    0.10171  -2.850 0.004365 ** 
## business_typesaas                 NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 26883  on 71479  degrees of freedom
## Residual deviance: 26132  on 71463  degrees of freedom
##   (2950 observations deleted due to missingness)
## AIC: 26166
## 
## Number of Fisher Scoring iterations: 8

Now let’s quickly evaluate the model. Unlike linear regression with ordinary least squares estimation, there is no R2 statistic which explains the proportion of variance in the dependent variable that is explained by the predictors. However, there are a number of pseudo R2 metrics that could be of value. Most notable is McFadden’s R2, which is defined as 1−[ln(LM)/ln(L0)]where ln(LM)is the log likelihood value for the fitted model and ln(L0) is the log likelihood for the null model with only an intercept as a predictor. The measure ranges from 0 to just under 1, with values closer to zero indicating that the model has no predictive power.

library(pscl)
pR2(mod)
##           llh       llhNull            G2      McFadden          r2ML 
## -1.306582e+04 -1.344132e+04  7.510037e+02  2.793637e-02  1.045149e-02 
##          r2CU 
##  3.334266e-02

McFadded is very low, around 0.028, indicating that this model has very little predictive power.