In this analysis we’ll explore the results from the sticky header A/B test, also known as experimnent EID21. The experiment was run as an A/B test via our A/Bert framework, and split visitors randomly 50/50 between the control and the variation groups.
The experiment hypthosis was:
- If we show a sticky header to all visitors across the entire website, then we’ll see an increase in Publish Small Business trial starts, because more visitors will have the opportunity to click the “Try Buffer for Business” CTA in the header, which will drive more traffic to the /business page and lead to an increase in Small Business trial starts.
Given this hypothesis, our success metric was:
- number of Publish Small Business trial starts
TLDR
The result of the experiment is that there is insufficient evidence to confirm the hypothesis, as there was no observed statistical difference between the control and variation groups.
Data Collection
To analyze the results of this experiment, we will use the following query to retrieve data about users enrolled in the experiment.
# connect to bigquery
con <- dbConnect(
bigrquery::bigquery(),
project = "buffer-data"
)
# define sql query to get experiment enrolled visitors
sql <- "
with enrolled_users as (
select
anonymous_id
, experiment_group
, first_value(timestamp) over (
partition by anonymous_id order by timestamp asc
rows between unbounded preceding and unbounded following) as enrolled_at
from segment_marketing.experiment_viewed
where
first_viewed
and experiment_id = 'eid21_sticky_header_all_times'
)
select
e.anonymous_id
, e.experiment_group
, e.enrolled_at
, i.user_id as account_id
, c.email
, c.publish_user_id
, a.timestamp as account_created_at
, t.product as trial_product
, t.timestamp as trial_started_at
, t.subscription_id as trial_subscription_id
, t.stripe_event_id as stripe_trial_event_id
, t.plan_id as trial_plan_id
, t.cycle as trial_billing_cycle
, t.cta as trial_started_cta
, s.product as subscription_product
, s.timestamp as subscription_started_at
, s.subscription_id as subscription_id
, s.stripe_event_id as stripe_subscription_event_id
, s.plan_id as subscription_plan_id
, s.cycle as subscritpion_billing_cycle
, s.revenue as subscription_revenue
, s.amount as subscritpion_amount
, s.cta as subscription_started_cta
from enrolled_users e
left join segment_login_server.identifies i
on e.anonymous_id = i.anonymous_id
left join dbt_buffer.core_accounts c
on i.user_id = c.id
left join segment_login_server.account_created a
on i.user_id = a.user_id
left join segment_publish_server.trial_started t
on i.user_id = t.user_id
and t.timestamp > e.enrolled_at
left join segment_publish_server.subscription_started s
on i.user_id = s.user_id
and s.timestamp > e.enrolled_at
group by 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23
"
# query BQ
users <- dbGetQuery(con, sql)
Exploratory Analysis
Let’s start by reviewing a few of the summary statistics from out data.
skim(users)
Name | users |
Number of rows | 477118 |
Number of columns | 23 |
_______________________ | |
Column type frequency: | |
character | 17 |
numeric | 2 |
POSIXct | 4 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
anonymous_id | 0 | 1.00 | 36 | 36 | 0 | 473819 | 0 |
experiment_group | 0 | 1.00 | 7 | 9 | 0 | 2 | 0 |
account_id | 438745 | 0.08 | 24 | 24 | 0 | 35794 | 0 |
438858 | 0.08 | 7 | 58 | 0 | 35686 | 0 | |
publish_user_id | 439435 | 0.08 | 24 | 24 | 0 | 35114 | 0 |
trial_product | 450771 | 0.06 | 7 | 7 | 0 | 2 | 0 |
trial_subscription_id | 450771 | 0.06 | 18 | 18 | 0 | 25902 | 0 |
stripe_trial_event_id | 450771 | 0.06 | 18 | 18 | 0 | 25963 | 0 |
trial_plan_id | 450771 | 0.06 | 11 | 44 | 0 | 19 | 0 |
trial_billing_cycle | 450771 | 0.06 | 4 | 5 | 0 | 2 | 0 |
trial_started_cta | 450931 | 0.05 | 30 | 68 | 0 | 50 | 0 |
subscription_product | 474311 | 0.01 | 7 | 7 | 0 | 2 | 0 |
subscription_id | 474311 | 0.01 | 18 | 18 | 0 | 2183 | 0 |
stripe_subscription_event_id | 474311 | 0.01 | 18 | 18 | 0 | 2186 | 0 |
subscription_plan_id | 474311 | 0.01 | 11 | 44 | 0 | 19 | 0 |
subscritpion_billing_cycle | 474311 | 0.01 | 4 | 5 | 0 | 2 | 0 |
subscription_started_cta | 474625 | 0.01 | 30 | 67 | 0 | 40 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
subscription_revenue | 474311 | 0.01 | 35.72 | 51.04 | 10 | 15 | 15 | 50 | 500 | ▇▁▁▁▁ |
subscritpion_amount | 474311 | 0.01 | 98.60 | 173.92 | 10 | 15 | 50 | 144 | 2030 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
enrolled_at | 0 | 1.00 | 2019-12-04 18:23:45 | 2020-02-13 13:19:42 | 2019-12-18 19:49:21 | 473756 |
account_created_at | 438745 | 0.08 | 2019-04-23 18:05:15 | 2020-02-18 19:22:35 | 2019-12-14 19:45:37 | 35794 |
trial_started_at | 450771 | 0.06 | 2019-12-04 20:26:27 | 2020-02-18 19:22:47 | 2019-12-24 10:25:08 | 25968 |
subscription_started_at | 474311 | 0.01 | 2019-12-04 20:55:43 | 2020-02-18 20:31:12 | 2020-01-07 21:02:03 | 2226 |
Let’s start with a quick validation of the visitor count split between the two experiment groups.
users %>%
group_by(experiment_group) %>%
summarise(visitors = n_distinct(anonymous_id), accounts = n_distinct(account_id), trials = n_distinct(trial_subscription_id), subscriptions = n_distinct(subscription_id)) %>%
mutate(visitor_split_perct = visitors / sum(visitors)) %>%
kable() %>%
kable_styling()
experiment_group | visitors | accounts | trials | subscriptions | visitor_split_perct |
---|---|---|---|---|---|
control | 237661 | 18102 | 13186 | 1093 | 0.501586 |
variant_1 | 236158 | 17694 | 12718 | 1092 | 0.498414 |
Great, there is a total of 475,219 unique visitors enrolled in the experiment, and the percentage split between the two experiment groups is within 0.15% of a 50/50 (well within reason for our randomization split). This confirms that our experiment framework correctly split enrollments for the experiment.
res <- prop.test(x = c(18348, 18055), n = c(238347, 236872), alternative = "two.sided")
res
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(18348, 18055) out of c(238347, 236872)
## X-squared = 0.95332, df = 1, p-value = 0.3289
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.0007589254 0.0022741252
## sample estimates:
## prop 1 prop 2
## 0.0769802 0.0762226
explain(res)
This was a two-sample proportion test of the null hypothesis that the true population proportions are equal. Using a significance level of 0.05, we do not reject the null hypothesis, and cannot conclude that two population proportions are different from one another. The observed difference in proportions is -0.000757599899208872. The observed proportion for the first group is 0.0769802011353195 (18,348 events out of a total sample size of 238,347). For the second group, the observed proportion is 0.0762226012361106 (18,055, out of a total sample size of 236,872).
The confidence interval for the true difference in population proportions is (-7.589254210^{-4}, 0.0022741). This interval will contain the true difference in population proportions 95 times out of 100.
The p-value for this test is 0.3288756. This, formally, is defined as the probability – if the null hypothesis is true – of observing a difference in sample proportions that is as or more extreme than the difference in sample proportions from this data set. In this case, this is the probability – if the true population proportions are equal – of observing a difference in sample proportions that is greater than 0.000757599899208872 or less than -0.000757599899208872.
We can see that the number of visitors that ended up creating a Buffer account (ie, signed up for their first Buffer product), was a few hundred higher in the control group. Using a quick proportion test we can see that this difference in proportion of enrolled visitors that created a Buffer account is NOT statistically significant, with a p-value of 0.329 (which is far more than the generally accepted 0.05 threshold). TL;DR, there was no difference between the variation and control with the total visitor to total signup rate.
Next, we will calculate how many users from each experiment group started a Publish trial.
users %>%
mutate(has_publish_trial = trial_product == "publish") %>%
group_by(experiment_group, has_publish_trial) %>%
summarise(users = n_distinct(account_id)) %>%
ungroup() %>%
filter(has_publish_trial) %>%
group_by(experiment_group) %>%
summarise(users_with_publish_trials = users) %>%
kable() %>%
kable_styling()
experiment_group | users_with_publish_trials |
---|---|
control | 11761 |
variant_1 | 11406 |
There were 11,790 users in the control group that started a Publish trial, and 11,541 in the variation group. Just like above, we should also run a proportion test here.
res <- prop.test(x = c(11790, 11541), n = c(18348, 18055), alternative = "two.sided")
res
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(11790, 11541) out of c(18348, 18055)
## X-squared = 0.43279, df = 1, p-value = 0.5106
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.006548244 0.013274911
## sample estimates:
## prop 1 prop 2
## 0.6425768 0.6392135
explain(res)
This was a two-sample proportion test of the null hypothesis that the true population proportions are equal. Using a significance level of 0.05, we do not reject the null hypothesis, and cannot conclude that two population proportions are different from one another. The observed difference in proportions is -0.0033633333508416. The observed proportion for the first group is 0.642576847612819 (11,790 events out of a total sample size of 18,348). For the second group, the observed proportion is 0.639213514261977 (11,541, out of a total sample size of 18,055).
The confidence interval for the true difference in population proportions is (-0.0065482, 0.0132749). This interval will contain the true difference in population proportions 95 times out of 100.
The p-value for this test is 0.5106212. This, formally, is defined as the probability – if the null hypothesis is true – of observing a difference in sample proportions that is as or more extreme than the difference in sample proportions from this data set. In this case, this is the probability – if the true population proportions are equal – of observing a difference in sample proportions that is greater than 0.0033633333508416 or less than -0.0033633333508416.
We can see that the difference in proportion of accounts that started a Publish trial is also NOT statistically significant, with a p-value of 0.51. TL;DR, there is no difference between the two groups in total Publish trial starts.
users %>%
mutate(has_sbp_trial = (trial_product == "publish" & str_detect(trial_plan_id, ".small.") == TRUE)) %>%
group_by(experiment_group, has_sbp_trial) %>%
summarise(users = n_distinct(account_id)) %>%
ungroup() %>%
filter(has_sbp_trial) %>%
group_by(experiment_group) %>%
summarise(users_with_sbp_trials = users) %>%
kable() %>%
kable_styling()
experiment_group | users_with_sbp_trials |
---|---|
control | 2787 |
variant_1 | 2668 |
There were 2769 users in the control group that started a Publish Small Business trial, and 2670 in the variation group. Just like above, we should also run a proportion test here.
res <- prop.test(x = c(2769, 2670), n = c(18348, 18055), alternative = "two.sided")
res
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(2769, 2670) out of c(18348, 18055)
## X-squared = 0.63555, df = 1, p-value = 0.4253
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.004344667 0.010412983
## sample estimates:
## prop 1 prop 2
## 0.1509156 0.1478815
explain(res)
This was a two-sample proportion test of the null hypothesis that the true population proportions are equal. Using a significance level of 0.05, we do not reject the null hypothesis, and cannot conclude that two population proportions are different from one another. The observed difference in proportions is -0.00303415785535768. The observed proportion for the first group is 0.150915631131458 (2,769 events out of a total sample size of 18,348). For the second group, the observed proportion is 0.147881473276101 (2,670, out of a total sample size of 18,055).
The confidence interval for the true difference in population proportions is (-0.0043447, 0.010413). This interval will contain the true difference in population proportions 95 times out of 100.
The p-value for this test is 0.4253263. This, formally, is defined as the probability – if the null hypothesis is true – of observing a difference in sample proportions that is as or more extreme than the difference in sample proportions from this data set. In this case, this is the probability – if the true population proportions are equal – of observing a difference in sample proportions that is greater than 0.00303415785535768 or less than -0.00303415785535768.
We can see that the difference in proportion of accounts that started a Publish SBP trial is also NOT statistically significant, with a p-value of 0.43. This means we can say with confidence that there is no observed difference in the number of Publish Small Business trials started between the two experiment groups.
Next, we will calculate how many users from each experiment group started a paid Publish subscription.
users %>%
mutate(has_publish_sub = (subscription_product == "publish")) %>%
group_by(experiment_group, has_publish_sub) %>%
summarise(users = n_distinct(account_id)) %>%
ungroup() %>%
filter(has_publish_sub) %>%
group_by(experiment_group) %>%
summarise(paying_publish_users = users) %>%
kable() %>%
kable_styling()
experiment_group | paying_publish_users |
---|---|
control | 972 |
variant_1 | 959 |
There were 741 users in the control group that started a paid Publish subscription, and 709 in the variation group. Just like above, we should also run a proportion test here, for both the account to paid subscription proportion, and also the trial to paid subscription proportion.
res <- prop.test(x = c(741, 709), n = c(18348, 18055), alternative = "two.sided")
res
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(741, 709) out of c(18348, 18055)
## X-squared = 0.26838, df = 1, p-value = 0.6044
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.002955545 0.005189490
## sample estimates:
## prop 1 prop 2
## 0.04038587 0.03926890
explain(res)
This was a two-sample proportion test of the null hypothesis that the true population proportions are equal. Using a significance level of 0.05, we do not reject the null hypothesis, and cannot conclude that two population proportions are different from one another. The observed difference in proportions is -0.00111697253812972. The observed proportion for the first group is 0.0403858731196861 (741 events out of a total sample size of 18,348). For the second group, the observed proportion is 0.0392689005815564 (709, out of a total sample size of 18,055).
The confidence interval for the true difference in population proportions is (-0.0029555, 0.0051895). This interval will contain the true difference in population proportions 95 times out of 100.
The p-value for this test is 0.6044234. This, formally, is defined as the probability – if the null hypothesis is true – of observing a difference in sample proportions that is as or more extreme than the difference in sample proportions from this data set. In this case, this is the probability – if the true population proportions are equal – of observing a difference in sample proportions that is greater than 0.00111697253812972 or less than -0.00111697253812972.
There is no statistical difference between the two proportions of signups that started a paid Publish subscription, as the p-value is 0.604.
res <- prop.test(x = c(741, 709), n = c(11790, 11541), alternative = "two.sided")
res
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(741, 709) out of c(11790, 11541)
## X-squared = 0.17726, df = 1, p-value = 0.6737
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.004864409 0.007697851
## sample estimates:
## prop 1 prop 2
## 0.06284987 0.06143315
explain(res)
This was a two-sample proportion test of the null hypothesis that the true population proportions are equal. Using a significance level of 0.05, we do not reject the null hypothesis, and cannot conclude that two population proportions are different from one another. The observed difference in proportions is -0.00141672140017238. The observed proportion for the first group is 0.0628498727735369 (741 events out of a total sample size of 11,790). For the second group, the observed proportion is 0.0614331513733645 (709, out of a total sample size of 11,541).
The confidence interval for the true difference in population proportions is (-0.0048644, 0.0076979). This interval will contain the true difference in population proportions 95 times out of 100.
The p-value for this test is 0.6737409. This, formally, is defined as the probability – if the null hypothesis is true – of observing a difference in sample proportions that is as or more extreme than the difference in sample proportions from this data set. In this case, this is the probability – if the true population proportions are equal – of observing a difference in sample proportions that is greater than 0.00141672140017238 or less than -0.00141672140017238.
There is no statistical difference between the two proportions of trial starts that started a paid Publish subscription, as the p-value is 0.67.
Next, let’s look into the number of Publish SBP paid subscriptions between the two experiment groups.
users %>%
mutate(has_sbp_sub = (subscription_product == "publish" & str_detect(subscription_plan_id, ".small.") == TRUE)) %>%
group_by(experiment_group, has_sbp_sub) %>%
summarise(users = n_distinct(account_id)) %>%
ungroup() %>%
filter(has_sbp_sub) %>%
group_by(experiment_group) %>%
summarise(paying_sbp_users = users) %>%
kable() %>%
kable_styling()
experiment_group | paying_sbp_users |
---|---|
control | 76 |
variant_1 | 69 |
The number of users for a Publish SBP paid subscription was 47 in the control and 43 in the variation. Just like above, we should also run a couple of proportion tests here too.
res <- prop.test(x = c(47, 43), n = c(18348, 18055), alternative = "two.sided")
res
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(47, 43) out of c(18348, 18055)
## X-squared = 0.057684, df = 1, p-value = 0.8102
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.0008949943 0.0012549450
## sample estimates:
## prop 1 prop 2
## 0.002561587 0.002381612
explain(res)
This was a two-sample proportion test of the null hypothesis that the true population proportions are equal. Using a significance level of 0.05, we do not reject the null hypothesis, and cannot conclude that two population proportions are different from one another. The observed difference in proportions is -0.000179975352061444. The observed proportion for the first group is 0.00256158709396119 (47 events out of a total sample size of 18,348). For the second group, the observed proportion is 0.00238161174189975 (43, out of a total sample size of 18,055).
The confidence interval for the true difference in population proportions is (-8.949942710^{-4}, 0.0012549). This interval will contain the true difference in population proportions 95 times out of 100.
The p-value for this test is 0.8101945. This, formally, is defined as the probability – if the null hypothesis is true – of observing a difference in sample proportions that is as or more extreme than the difference in sample proportions from this data set. In this case, this is the probability – if the true population proportions are equal – of observing a difference in sample proportions that is greater than 0.000179975352061444 or less than -0.000179975352061444.
There is no statistical difference between the two proportions of signups that started a paid SBP subscription, as the p-value is 0.810.
res <- prop.test(x = c(47, 43), n = c(11790, 11541), alternative = "two.sided")
res
##
## 2-sample test for equality of proportions with continuity correction
##
## data: c(47, 43) out of c(11790, 11541)
## X-squared = 0.0464, df = 1, p-value = 0.8294
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.001415507 0.001936671
## sample estimates:
## prop 1 prop 2
## 0.003986429 0.003725847
explain(res)
This was a two-sample proportion test of the null hypothesis that the true population proportions are equal. Using a significance level of 0.05, we do not reject the null hypothesis, and cannot conclude that two population proportions are different from one another. The observed difference in proportions is -0.000260582196937878. The observed proportion for the first group is 0.00398642917726887 (47 events out of a total sample size of 11,790). For the second group, the observed proportion is 0.00372584698033099 (43, out of a total sample size of 11,541).
The confidence interval for the true difference in population proportions is (-0.0014155, 0.0019367). This interval will contain the true difference in population proportions 95 times out of 100.
The p-value for this test is 0.8294495. This, formally, is defined as the probability – if the null hypothesis is true – of observing a difference in sample proportions that is as or more extreme than the difference in sample proportions from this data set. In this case, this is the probability – if the true population proportions are equal – of observing a difference in sample proportions that is greater than 0.000260582196937878 or less than -0.000260582196937878.
There is no statistical difference between the two proportions of Publish trials that started a paid SBP subscription, as the p-value is 0.829.
Final Results
Given the above observations, the result of the experiment is that there is insufficient evidence to confirm the hypothesis. Based on these observations, there is no difference between the control and the variation.