5 min read

Forecasting Twitter Usage in Reply

We’re currently working with Twitter to finalize our contract for the next 12 months and there is a possibility to lock in the next pricing tier for the API that Reply uses. In order to make an informed decision on this, we would like to forecast the number of Twitter handles used in Reply. We currently pay $11,250/mo for 1,000 Twitter handles.

Data Collection

The Reply data we need lives in BigQuery. We can collect this data with the bigrquery package.

# load DBI package
library(DBI)

# connect to BigQuery
con <- dbConnect(
  bigrquery::bigquery(),
  project = "buffer-data",
  dataset = "atlas_reply_meteor"
)

Let’s start by querying the twitter_accounts table.

# define sql query
sql <- "
  select 
    _id as id
    , created_at
    , name
  from twitter_accounts
"

# collect data
profiles <- dbGetQuery(con, sql)

There are 26 thousand Twitter profiles in this dataset. Not all of them are currently being used though.

Let’s quickly plot the cumulative sum of the number of Twitter profiles connected to Reply across all time.

We can see that the number of Twitter profiles really ramped up in early 2016, and has increased slowly since then. What we need, though, is to see the number of Twitter profiles being used.

Twitter Conversations

We will use the conversations table to get a better sense of how many Twitter profiles are in use.

# define sql query
sql <- "
  select 
    _id as id
    , timestamp_millis(cast(created_at as INT64)) as created_at
    , contact_ref
    , source
  from conversations
  where type = 'twitter'
  and timestamp_millis(cast(created_at as INT64)) >= '2018-01-01'
"

# collect data
conversations <- dbGetQuery(con, sql)

There are 7.8 million Twitter conversations in this dataset. Let’s extract the profile ids.

# extract the twitter profile id
conversations <- conversations %>% 
  mutate(profile_id = str_sub(source, start = 7, end = 22))

Now let’s group the data by month and count the profile ids.

# group by month
by_month <- conversations %>% 
  mutate(month = floor_date(created_at, unit = "months")) %>% 
  group_by(month) %>% 
  summarise(profiles = n_distinct(profile_id))

It looks like there was a big spike in May 2018. Let’s only look at data beyond that point, so that we have less noise when making our predictions.

Great, we can use this dataset to make our forecasts.

Forecasting Twitter Usage

We’ll use the prophet package developed by the folks at Facebook to make our basic forecasts. We first need to prep a dataframe with columns ds and y, containing the date and numeric value respectively. The ds column should be YYYY-MM-DD for a date, or YYYY-MM-DD HH:MM:SS for a timestamp.

# create dataframe to send to prophet
prophet_df <- by_month %>% 
  filter(month > "2018-05-01" & month < "2019-04-02") %>% 
  rename(ds = month, y = profiles)

# fit model
mod <- prophet(prophet_df)

Predictions are made on a dataframe with a column ds containing the dates for which predictions are to be made. The make_future_dataframe function takes the model object and a number of periods to forecast and produces a suitable dataframe. By default it will also include the historical dates so we can evaluate in-sample fit.

# make future dataframe
future <- make_future_dataframe(mod, periods = 24, freq = "month")

# view dataframe
tail(future)
##            ds
## 30 2020-11-01
## 31 2020-12-01
## 32 2021-01-01
## 33 2021-02-01
## 34 2021-03-01
## 35 2021-04-01

Now let’s make the forecast and plot the results.

# make predictions
fcast <- predict(mod, future)

# plot results
plot(mod, fcast) +
  geom_point(data = filter(fcast, ds > "2019-04-01"), aes(x = ds, y = yhat), color = "red") +
  buffer_theme() +
  labs(x = "Month", 
       y = "Twitter Profiles",
       title = "Forecasted Monthly Twitter Profile Usage",
       subtitle = "Based on conversations data from June 2018")

The red dots are the forecasted values. This forecast assumes a linear growth trend. The biggest source of uncertainty in the forecast is the potential for future trend changes, so the furthest months forecasted for have the highest risk of being off.

Let’s get the forecasted values and the confidence intervals. The upper_bound and lower_bound represent the uncertainty interval, which is based on the variance between historic months. You can see that most of the black dots in the plot above fall within this uncertainty interval.

# get forecasted values
fcast %>% 
  filter(ds > "2019-03-01") %>% 
  rename(month = ds, forecast = yhat, lower_bound = yhat_lower, upper_bound = yhat_upper) %>% 
  select(month, forecast, lower_bound, upper_bound) %>% 
  kable() %>% 
  kable_styling()
month forecast lower_bound upper_bound
2019-04-01 631.0274 602.8680 658.9454
2019-05-01 651.8503 624.2726 682.6478
2019-06-01 673.3674 644.8695 702.8760
2019-07-01 694.1904 665.0089 724.8179
2019-08-01 715.7075 686.3447 742.8970
2019-09-01 737.2246 706.9211 768.3517
2019-10-01 758.0476 729.7161 787.9056
2019-11-01 779.5646 748.6914 808.1656
2019-12-01 800.3876 771.7929 828.4855
2020-01-01 821.9047 790.4469 850.7148
2020-02-01 843.4218 814.6097 872.1398
2020-03-01 863.5507 833.1483 893.1340
2020-04-01 885.0678 856.7935 914.1149
2020-05-01 905.8907 877.9398 934.7082
2020-06-01 927.4078 898.2667 955.3987
2020-07-01 948.2308 920.2203 976.7215
2020-08-01 969.7479 942.7830 999.8960
2020-09-01 991.2650 959.3655 1018.5003
2020-10-01 1012.0880 984.0225 1041.6228
2020-11-01 1033.6050 1004.1362 1062.6886
2020-12-01 1054.4280 1025.2365 1083.3474
2021-01-01 1075.9451 1046.5998 1105.1054
2021-02-01 1097.4622 1067.2860 1128.2006
2021-03-01 1116.8970 1088.6338 1144.5780
2021-04-01 1138.4141 1111.9210 1167.6486

Based on this linear growth, we would start getting close to hitting 1000 profiles per month around August or September of 2020.

Tracking Usage Across Social Networks

Let’s see how Twitter usage compares with usage of the other social networks. We’ll use DBI in conjunction with the bigrquery package to collect the data.

# define sql query
sql <- "
  select 
    extract(month from timestamp_millis(cast(created_at as INT64))) as month
    , extract(year from timestamp_millis(cast(created_at as INT64))) as year
    , type
    , array(select * from unnest(split(substr(source, 2 , length(source) - 2))))[offset(1)] as profile_id
    , count(distinct _id) as conversations
  from atlas_reply_meteor.conversations
  where timestamp_millis(cast(created_at as INT64)) >= '2018-01-01'
  group by 1, 2, 3, 4
"

# collect data
convos <- dbGetQuery(con, sql)
# create date year column and trim whitespace
convos <- convos %>% 
  mutate(month = gsub(" ", "0", format(month, digits = 2)),
         yearmon = paste(year, month, sep = "-"),
         profile_id = trimws(profile_id, which = c("both")))

Now let’s plot the growth by month.

It looks like Facebook has overtaken Twitter as the most popular network in recent months!