We’re currently working with Twitter to finalize our contract for the next 12 months and there is a possibility to lock in the next pricing tier for the API that Reply uses. In order to make an informed decision on this, we would like to forecast the number of Twitter handles used in Reply. We currently pay $11,250/mo for 1,000 Twitter handles.
The Reply data we need lives in BigQuery. We can collect this data with the
# load DBI package library(DBI) # connect to BigQuery con <- dbConnect( bigrquery::bigquery(), project = "buffer-data", dataset = "atlas_reply_meteor" )
Let’s start by querying the
# define sql query sql <- " select _id as id , created_at , name from twitter_accounts " # collect data profiles <- dbGetQuery(con, sql)
There are 26 thousand Twitter profiles in this dataset. Not all of them are currently being used though.
Let’s quickly plot the cumulative sum of the number of Twitter profiles connected to Reply across all time.
We can see that the number of Twitter profiles really ramped up in early 2016, and has increased slowly since then. What we need, though, is to see the number of Twitter profiles being used.
We will use the
conversations table to get a better sense of how many Twitter profiles are in use.
# define sql query sql <- " select _id as id , timestamp_millis(cast(created_at as INT64)) as created_at , contact_ref , source from conversations where type = 'twitter' and timestamp_millis(cast(created_at as INT64)) >= '2018-01-01' " # collect data conversations <- dbGetQuery(con, sql)
There are 7.8 million Twitter conversations in this dataset. Let’s extract the profile ids.
# extract the twitter profile id conversations <- conversations %>% mutate(profile_id = str_sub(source, start = 7, end = 22))
Now let’s group the data by month and count the profile ids.
# group by month by_month <- conversations %>% mutate(month = floor_date(created_at, unit = "months")) %>% group_by(month) %>% summarise(profiles = n_distinct(profile_id))
## `summarise()` ungrouping output (override with `.groups` argument)
It looks like there was a big spike in May 2018. Let’s only look at data beyond that point, so that we have less noise when making our predictions.
Great, we can use this dataset to make our forecasts.
Forecasting Twitter Usage
We’ll use the
prophet package developed by the folks at Facebook to make our basic forecasts. We first need to prep a dataframe with columns
y, containing the date and numeric value respectively. The
ds column should be YYYY-MM-DD for a date, or YYYY-MM-DD HH:MM:SS for a timestamp.
# create dataframe to send to prophet prophet_df <- by_month %>% filter(month > "2018-05-01" & month < "2019-04-02") %>% rename(ds = month, y = profiles) # fit model mod <- prophet(prophet_df, algorithm = 'LBFGS')
Predictions are made on a dataframe with a column
ds containing the dates for which predictions are to be made. The
make_future_dataframe function takes the model object and a number of periods to forecast and produces a suitable dataframe. By default it will also include the historical dates so we can evaluate in-sample fit.
# make future dataframe future <- make_future_dataframe(mod, periods = 24, freq = "month") # view dataframe tail(future)
## ds ## 30 2020-11-01 ## 31 2020-12-01 ## 32 2021-01-01 ## 33 2021-02-01 ## 34 2021-03-01 ## 35 2021-04-01
Now let’s make the forecast and plot the results.
# make predictions fcast <- predict(mod, future) # plot results plot(mod, fcast) + geom_point(data = filter(fcast, ds > "2019-04-01"), aes(x = ds, y = yhat), color = "red") + buffer_theme() + labs(x = "Month", y = "Twitter Profiles", title = "Forecasted Monthly Twitter Profile Usage", subtitle = "Based on conversations data from June 2018")
The red dots are the forecasted values. This forecast assumes a linear growth trend. The biggest source of uncertainty in the forecast is the potential for future trend changes, so the furthest months forecasted for have the highest risk of being off.
Let’s get the forecasted values and the confidence intervals. The
lower_bound represent the uncertainty interval, which is based on the variance between historic months. You can see that most of the black dots in the plot above fall within this uncertainty interval.
# get forecasted values fcast %>% filter(ds > "2019-03-01") %>% rename(month = ds, forecast = yhat, lower_bound = yhat_lower, upper_bound = yhat_upper) %>% select(month, forecast, lower_bound, upper_bound) %>% kable() %>% kable_styling()
Based on this linear growth, we would start getting close to hitting 1000 profiles per month around August or September of 2020.