Introduction

This page describes the simulation, modeling and evaluation code in R/run_eval.R. The goal is to provide sufficient information to run and modify the code. The focus for the current data generating process (DGP) is SIRTD epidemiological models derived from compartments for the states (S) susceptible, (I) infectious and (R) for recovered – SIR models. The current simulations add compartments for (T) terminally-ill and (D) deceased counts useful for the grim task of distinguishing the deceased from recovered. There is additionally data generated for tweets that mention symptoms from individuals in the (I) infectious compartment but that exists separate from the core SIRTD model.

For more explanation about SIR models see Stan Case Study – Boarding School which we found to be a good explanation as well as an excellent introduction to Bayesian approaches using Stan. This is the model used to fit this data, there is a model reproduction checklist at codatmo/Simple_SIR.

Data Generating Process – DGP

The DGP simulates a fixed population at the individual level on a daily basis. There are a series of simulating DGPs with varied properties. This document covers a DGP that closely mirrors the compartment model structure with real-valued compartments with population transfers between states computed via real-valued multiplication on each day of the simulation. There is no stochastic (random) element at all in this simulation since it serves as a baseline for more complex models which will have stochastic elements as well as operating on different scales, e.g., agent-based models.

The actual code is shown below, the Rmarkdown commands are included to be clear about where code is and how the document is being generated. Note that there is a pull_section function that prints sections of code based on being between comments that will be used throughout. This allows software to exist outside the notebook format without introducing copy/paste errors that often come up or other mismatches between running code and it’s description.

#' Returns lines of file between the identified section '# <section name>' and
#' the matching section later in the file or the end of the file.
#' @param sections Comment that starts/stops collection of lines, e.g., '# section 1'
#' @param file Path to file to pull from
pull_section <- function(section, file, num_lines = FALSE) {
  collect = FALSE
  return_string = ''
  line_num = 0
  for (line in readLines(file)) {
    line_num = line_num + 1
    if (line == section) {
      collect = !collect
    }
    prefix = ""
    if (num_lines) {
        prefix = paste0(line_num,": ")
    }
    if (grepl(section,line)) {
      return_string = paste0(return_string, paste0(prefix,line), '\n')
    }
    else if (collect) {
      return_string = paste0(return_string, paste0(prefix,line), '\n')
    }
  }
  return(return_string)
}

cat(pull_section('# sirtd_vary_beta_exact', here::here('R', 'SIRTDsim.R')))

# sirtd_vary_beta_exact
#' Runs a simple deterministic SIRTD model with possible daily varied beta values. 
#' @param seed Seed for stochastic processes, ignored since this is a deterministic function
#' @param n_pop Population size
#' @param n_days Number of days to run simulation
#' @param print Boolean to control whether daily summary is printed
#' @param beta_daily_inf_rates Controls number of s that are infected by i to 
#' become i. Should be between 0 and 1
#' @param gamma_res_per_day_rate Controls number of i that transition to r or t
#' per day. Should be between 0 and 1
#' @param death_prob Probability transitioning from i to t
#' @param tweet_rate_infected If infected, what rate of tweets per day. Can be 
#' > 1
#' @param n_patient_zero Number of patients infected at day 0
#' @param mean_days_to_death_from_t Mean number of days in t before d
#' @return dataframe simulating sirtd+tweets state for number days 
sirtd_vary_beta_exact <- function(seed,
                            n_pop,
                            n_days,
                            print,
                            beta_daily_inf_rates,
                            gamma_res_per_day_rate,
                            death_prob,
                            tweet_rate_infected,
                            n_patient_zero,
                            mean_days_to_death_from_t) {
  old_seed <- .Random.seed
  on.exit({.Random.seed <<- old_seed})
  set.seed(seed)
  i <- n_patient_zero
  r <- 0
  t <- 0
  d <- 0
  s <- n_pop - n_patient_zero 
  col_names = c('day', 's', 'i', 'r', 't', 'd', 'tweets')
  df = data.frame(matrix(nrow = 0, ncol=length(col_names)))
  colnames(df) = col_names
  tweets = 0
  for (day in 1:n_days) {
    df[day,] = rep(NA,length(col_names)) 
    df[day,]$s=  round(s)
    df[day,]$i = round(i)
    df[day,]$r = round(r)
    df[day,]$t = round(t)
    df[day,]$d = round(d)
    df[day,]$day = day
    df[day,]$tweets = round(tweets)
    if (print) {
      cat(
        sprintf(
          "\nDay=%d, susceptible=%.2f, infected=%.2f, recovered=%.2f, terminal=%.2f,
           dead=%.2f, tweets=%.2f, R0=%.2f %.2f",
          df[day,]$day, df[day,]$s, df[day,]$i, df[day,]$r, df[day,]$t,
          df[day,]$d, df[day,]$tweets,
          beta_daily_inf_rates[day]/gamma_res_per_day_rate,
          s + i + r + t + d))
    }
    resolved <- gamma_res_per_day_rate * i  
    terminal <- death_prob * resolved
    recovered <- resolved - terminal
    dead <- 1/mean_days_to_death_from_t * t
    tweets = tweet_rate_infected * i
    infected <- min(i * beta_daily_inf_rates[day], s)
    s <- s - infected
    i <- i - resolved + infected
    r <- r + recovered
    t <- t - dead + terminal
    d <- d + dead
  }
  return(df)
}
# sirtd_vary_beta_exact

The focus on this presentation is the surrounding framework so we offer the simulation code without comment, discussion of the DGP/simulation is at to be determined. We are planning to include the following simulations:

SIRTD model, real-valued, non-stochastic (this model)
SIRTD model, real-valued, stochastic
SIRTD model, real-valued, stochastic, variying-beta
SIRTD model, integer-valued, agent-based, stochastic
SIRTD model, integer-valued, agent-based, variying-beta

We may also run models that are time-limited, feature-richer social interaction models etc…

Running simulation and fitting a model

Below we go over how to setup and run the simulation code that generates data and runs experiments. The central code is in R/run_eval.R with use of functions in R/util.R for printing/graphing functions, R/data_configs.R for code that creates data via simulation or provides actual data, R/modeling_configs.R for code that configures parameters for modeling software and R/SIRTDsim.R that contains simulation software and is called from functions in R/data_configs.R. The libraries and dependencies are:

cat(pull_section('# dependencies', here::here('R', 'run_eval.R')))

# dependencies
library(tidyverse)
library(cmdstanr)
library(data.table)
library(kableExtra)

source(here::here("R","util.R"))
source(here::here("R","SIRTDsim.R"))
source(here::here("R","sim_configs.R"))
source(here::here("R", "data_configs.R"))
source(here::here("R","modeling_configs.R"))
# dependencies

Setting up the `run_df` data frame that controls the experiments

cat(pull_section('# setup run_df', here::here('R', 'run_eval.R'), num_lines = TRUE))

15: # setup run_df
16: setup_run_df <- setup_run_df(seed = 93435, n_pop = 214110287, n_days = 300) # in R/util.R
17: brazil_sim_df <- sim_brazil_1(setup_run_df) # in R/sim_configs.R
18: brazil_actual_df <- data_brazil_1(setup_run_df)
19: draws_run_df <- sim_draw_params(2, run_df) # in R/sim_configs.R
20: run_data_df <- brazil_actual_df #rbind(brazil_sim_df, brazil_actual_df)
21: 
22: run_df <- model_stan_baseline(run_data_df) #in R/modeling_configs.R
23: run_df$use_tweets <- 0
24: #run_df$compute_likelihood <- 1 # compute likelihood across all runs
25: #run_df$reports <- list(c('graph_sim', 'graph_tweets', 'graph_d', 'plot'))
26:    #list(c('graph_sim','graph_ODE', 'graph_tweets', 'graph_d', 'plot','param_recovery'))
27: # setup run_df

The run_df is a dataframe where each row will be a complete experiment configuration. The above functions add information relevant to their role. The source is very straightforward should one want to modify for their own purposes. In more detail:

cat(pull_section('setup_run_df', here::here('R', 'run_eval.R'), num_lines = TRUE))

16: setup_run_df <- setup_run_df(seed = 93435, n_pop = 214110287, n_days = 300) # in R/util.R
17: brazil_sim_df <- sim_brazil_1(setup_run_df) # in R/sim_configs.R
18: brazil_actual_df <- data_brazil_1(setup_run_df)

Running code up to this line shows the following values, note that the dataframe is transposed with rows as columns and columns as rows–unreadable otherwise:

sim_run_id	NA
description	NA
seed	93435
n_pop	1000
n_days	7
model_to_run	none
compute_likelihood	NA
use_tweets	NA
s	NULL
i	NULL
r	NULL
t	NULL
d	NULL
tweets	NULL
beta_daily_rate	NA
beta_mean	NA
gamma	NA
death_prob	NA
tweet_rate	NA
days2death	NA
reports	NA
d_in_interval	NA
tweets_in_interval	NA
fit	NA

This line constructs a basic template for the run_df, there is just one row at this point–remember transposed output. All subsequent additions/elaborations draw from this dataframe:

cat(pull_section('sim_brazil_1', here::here('R', 'run_eval.R'), num_lines = TRUE))

17: brazil_sim_df <- sim_brazil_1(setup_run_df) # in R/sim_configs.R

The sim_brazil_1 is a parameterization that roughly mimics the shape of the COVID-19 outbreak with corresponding simulated values as determined by trying values by hand. Look at the source code for more info. Note that the prefix on the function is sim, which means that it is part of simulation. There is still one row in the dataframe but with simulation data and the generating parameterization.

The, remember transposed, dataframe is:

sim_run_id	1
description	Brazil 1 approximation
seed	93435
n_pop	1000
n_days	7
model_to_run	none
compute_likelihood	NA
use_tweets	NA
s	c(990, 987, 984, 981, 977, 974, 970)
i	c(10, 10, 11, 11, 12, 12, 13)
r	c(0, 3, 5, 8, 11, 14, 17)
t	c(0, 0, 0, 0, 0, 0, 0)
d	c(0, 0, 0, 0, 0, 0, 0)
tweets	c(0, 1, 1, 1, 1, 1, 1)
beta_daily_rate	c(0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3)
beta_mean	0.3
gamma	0.255
death_prob	0.01
tweet_rate	0.1
days2death	10
reports	c("graph_sim", "plot")
d_in_interval	NA
tweets_in_interval	NA
fit	NA
n_patient_zero	10

We now have simulation data from the SIRTD model for n_days. But still only one row–remember transposed above:

cat(pull_section('sim_draw_params', here::here('R','run_eval.R'), num_lines = TRUE))

19: draws_run_df <- sim_draw_params(2, run_df) # in R/sim_configs.R

The next line runs two random based draws from the parameterization in the single simulation in run_df. Two draws are conducted and returned:

sim_run_id	2	3
description	Brazil 1 approximation	Brazil 1 approximation
seed	93435	93435
n_pop	1000	1000
n_days	7	7
model_to_run	none	none
compute_likelihood	NA	NA
use_tweets	NA	NA
s	c(990, 985, 980, 974, 966, 958, 948)	c(990, 988, 986, 984, 983, 982, 981)
i	c(10, 12, 14, 16, 19, 22, 26)	c(10, 8, 6, 5, 4, 3, 3)
r	c(0, 3, 6, 10, 15, 20, 26)	c(0, 4, 8, 11, 13, 15, 16)
t	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)
d	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)
tweets	c(0, 2, 3, 3, 4, 4, 5)	c(0, 3, 2, 2, 2, 1, 1)
beta_daily_rate	c(0.459, 0.459, 0.459, 0.459, 0.459, 0.459, 0.459)	c(0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24)
beta_mean	0.459	0.24
gamma	0.29	0.443
death_prob	0.009	0.01
tweet_rate	0.234	0.306
days2death	10	10
reports	c("graph_sim", "plot")	c("graph_sim", "plot")
d_in_interval	NA	NA
tweets_in_interval	NA	NA
fit	NA	NA
n_patient_zero	10	10

Above are transposed output for two draws for slight variations of the Brazil hand parameterization.

cat(pull_section("rbind", here::here('R', 'run_eval.R'), num_lines = TRUE))

20: run_data_df <- brazil_actual_df #rbind(brazil_sim_df, brazil_actual_df)

We bind the Brazil run with the draws resulting in 3 rows:

sim_run_id	1	2	3
description	Brazil 1 approximation	Brazil 1 approximation	Brazil 1 approximation
seed	93435	93435	93435
n_pop	1000	1000	1000
n_days	7	7	7
model_to_run	none	none	none
compute_likelihood	NA	NA	NA
use_tweets	NA	NA	NA
s	c(990, 987, 984, 981, 977, 974, 970)	c(990, 985, 980, 974, 966, 958, 948)	c(990, 988, 986, 984, 983, 982, 981)
i	c(10, 10, 11, 11, 12, 12, 13)	c(10, 12, 14, 16, 19, 22, 26)	c(10, 8, 6, 5, 4, 3, 3)
r	c(0, 3, 5, 8, 11, 14, 17)	c(0, 3, 6, 10, 15, 20, 26)	c(0, 4, 8, 11, 13, 15, 16)
t	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)
d	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)
tweets	c(0, 1, 1, 1, 1, 1, 1)	c(0, 2, 3, 3, 4, 4, 5)	c(0, 3, 2, 2, 2, 1, 1)
beta_daily_rate	c(0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3)	c(0.459, 0.459, 0.459, 0.459, 0.459, 0.459, 0.459)	c(0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24)
beta_mean	0.3	0.459	0.24
gamma	0.255	0.29	0.443
death_prob	0.01	0.009	0.01
tweet_rate	0.1	0.234	0.306
days2death	10	10	10
reports	c("graph_sim", "plot")	c("graph_sim", "plot")	c("graph_sim", "plot")
d_in_interval	NA	NA	NA
tweets_in_interval	NA	NA	NA
fit	NA	NA	NA
n_patient_zero	10	10	10

Setting up modeling

At this point we have three sets of data, one hand configured to mirror Brazil data and two that vary simulation parameters from the hand configured model. Now we apply modeling configurations with subroutines from R/modeling_configs.R.

cat(pull_section("model_stan_baseline", here::here('R', 'run_eval.R'), num_lines = TRUE))

22: run_df <- model_stan_baseline(run_data_df) #in R/modeling_configs.R

There are two model configurations which apply to the three rows uniformly. Hence, we end up with 6 configurations total. One model applies tweets to the task, and one does not. The final result is:

sim_run_id	1	2	3
description	Brazil 1 approximation use tweets	Brazil 1 approximation use tweets	Brazil 1 approximation use tweets
seed	93435	93435	93435
n_pop	1000	1000	1000
n_days	7	7	7
model_to_run	baseline	baseline	baseline
compute_likelihood	1	1	1
use_tweets	NA	NA	NA
s	c(990, 987, 984, 981, 977, 974, 970)	c(990, 985, 980, 974, 966, 958, 948)	c(990, 988, 986, 984, 983, 982, 981)
i	c(10, 10, 11, 11, 12, 12, 13)	c(10, 12, 14, 16, 19, 22, 26)	c(10, 8, 6, 5, 4, 3, 3)
r	c(0, 3, 5, 8, 11, 14, 17)	c(0, 3, 6, 10, 15, 20, 26)	c(0, 4, 8, 11, 13, 15, 16)
t	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)
d	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)	c(0, 0, 0, 0, 0, 0, 0)
tweets	c(0, 1, 1, 1, 1, 1, 1)	c(0, 2, 3, 3, 4, 4, 5)	c(0, 3, 2, 2, 2, 1, 1)
beta_daily_rate	c(0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3)	c(0.459, 0.459, 0.459, 0.459, 0.459, 0.459, 0.459)	c(0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24)
beta_mean	0.3	0.459	0.24
gamma	0.255	0.29	0.443
death_prob	0.01	0.009	0.01
tweet_rate	0.1	0.234	0.306
days2death	10	10	10
reports	c("graph_sim", "plot")	c("graph_sim", "plot")	c("graph_sim", "plot")
d_in_interval	NA	NA	NA
tweets_in_interval	NA	NA	NA
fit	NA	NA	NA
n_patient_zero	10	10	10
apply_twitter_data	1	1	1
ode_solver	block	block	block

Configuring reporting

The specific reporting can be controlled from the run_df as well.

cat(pull_section("reports <-", here::here('R','run_eval.R'), num_lines = TRUE))

25: #run_df$reports <- list(c('graph_sim', 'graph_tweets', 'graph_d', 'plot'))

After each run_df iteration the listed graphing and print routines are controlled via the above. This list is all available, they are explained in the next section.

Running models

Each row of run_df is a complete specification of both data and model configuration. The code simply iterates over each row, fits it with the appropriate model/data configuration (if specified) and adds the results by computing simulated days data for tweets and deaths using the .2 to .8 central sample’s interval for the model run. See R/util.R for the implementation of countPredictionsInQuantile.

cat(pull_section('# run models', here::here('R','run_eval.R')))

# run models
j <- 0
while (j < nrow(run_df)) {
  j <- j + 1
  fit <- NA
  if (run_df[j,]$model_to_run == 'baseline') {
    stan_data <-
      list(n_days = run_df[j,]$n_days,
           sDay1 = run_df[j,]$n_pop - 1,
           iDay1 = 1,
           rDay1 = 0,
           dDay1 = 0,
           NPop = run_df[j,]$n_pop,
           tweets = unlist(run_df[j,]$tweets),
           deaths = unlist(run_df[j,]$d),
           compute_likelihood = run_df[j,]$compute_likelihood,
           run_twitter = run_df[j,]$use_tweets,
           run_block_ODE = ifelse(run_df[j,]$ode_solver == 'block', 1, 0),
           run_rk45_ODE = ifelse(run_df[j,]$ode_solver == 'rk45', 1, 0),
           scale = 1,
           center = 1)
    model <- cmdstan_model(here::here("stan", "baseline.stan"))

    fit <- model$sample(data=stan_data,
                        parallel_chains = 4,
                        iter_warmup = 1000,
                        iter_sampling = 1000,
                        chains = 4,
                        seed = 4857)
    run_df[j,]$fit = list(fit)
  }
  else if (run_df[j,]$model_to_run == 'UNINOVE_Brazil') {
    stan_data_2 <- list(n_days = run_df[j,]$n_days,
                        y0 = c(run_df[j,]$n_pop - run_df[j,]$n_patient_zero,
                               run_df[j,]$n_patient_zero, 0, 0, 0), # one more zero here
                        t0 = 0,
                        ts = 1:run_df[j,]$n_days,
                        compute_likelihood = run_df[j,]$compute_likelihood,
                        use_twitter = run_df[j,]$apply_twitter_data,
                        death_count = unlist(run_df[j,]$d),
                        symptomaticTweets = unlist(run_df[j,]$tweets),
                        prior_beta_mean = 0,
                        prior_beta_std = 10,
                        prior_omega_mean = 0,
                        prior_omega_std = 10,
                        prior_dI_mean = 0,
                        prior_dI_std = 10,
                        prior_dT_mean = 0,
                        prior_dT_std = 10,
                        prior_twitter_lambda = 1
                        )
    model2 <- cmdstan_model(here::here("stan", "tweet_sirtd_negbin_ODE.stan"))
    fit <- model2$sample(data=stan_data_2,
                         parallel_chains = 4,
                         iter_warmup = 1000,
                         iter_sampling = 1000,
                         chains = 4,
                         seed = 4857)
  }
  if (run_df[j,]$model_to_run != 'none') {
    d_tweets_in_interval = countPredictionsInQuantile(fit = fit,
                                                      run_df = run_df,
                                                      j = j, print = TRUE)
    run_df[j,]$d_in_interval = d_tweets_in_interval[1]
    run_df[j,]$tweets_in_interval = d_tweets_in_interval[2]
  }
# run models

Only two models are available at present.

Presenting results

At this point results will likely get more idiosyncratic to experiment needs. The below if statements check the listed values for reports and apply the corresponding reporting. Note that each run is reported with no attempt to collect them into a collection–we intend to add that functionality eventually.There are the following graphing options:

Graph simulation compartments with dots.
Graph ODE compartments if the Stan model mirrors the states of the simulation, e.g., is roughly isomorphic.
Graph predicted tweets and predicted d or deceased from the Stan model. If the likelihood is computed then this is a posterior predictive check, if the likelihood is not computed then it is a prior predictive check. There is a 20%/80% central interval ribbon plot as well.
Control over whether the plots are printed. All plots for a single run are overlaid.

All subroutines are in the R/Util.R file.

  cat(pull_section('# section 6', here::here('R','run_eval.R')))

# section 6
  plot <- ggplot(data = NULL, aes(x = day, y = count))
  if ('graph_data' %in% unlist(run_df[j,]$reports)) {
    plot <- graph_real_data(data_df = run_df[j,], plot = plot)
  }
  if ('graph_sim' %in% unlist(run_df[j,]$reports)) {
    plot <- graph_sim_data(data_df = run_df[j,], hide_s = TRUE, plot = plot)
  }
  if ('graph_ODE' %in% unlist(run_df[j,]$reports)) {
    plot <- graph_ODE(data_df = run_df[j,], fit = fit, hide_s = TRUE,
                             plot = plot)
  }
  if ('graph_tweets' %in% unlist(run_df[j,]$reports)) {
    plot <- plot_predictions(plot = plot, prediction_label = 'pred_tweets',
                             fit = fit,
                             show_ribbon = TRUE)
  }
  if ('graph_d' %in% unlist(run_df[j,]$reports)) {
    plot <- plot_predictions(plot = plot, prediction_label = 'pred_deaths',
                             fit = fit,
                             show_ribbon = TRUE)
  }
  plot = plot + theme(legend.position = "none")
  if (length(unlist(run_df[j,]$reports)) > 0) {
    print(plot)
  }
# section 6

A parameter recovery report finalizes the reporting options per run:

 cat(pull_section('# section 7', here::here('R','run_eval.R')))

# section 7
  if ('param_recovery' %in% run_df[j,]$reports) {
    cat(param_recovery(data_df = run_df[j,], fit = fit))
  }
# section 7

which reports on whether the simulation parameters and those from the model.

All run summary

A summary across all the runs is done next by reporting relevant columns of the run_df:

 cat(pull_section('# section 8', here::here('R','run_eval.R')))

# section 8
summary_cols = c('sim_run_id', 'model_to_run', 'beta_mean', 'gamma', 'death_prob',
                 'tweet_rate', 'days2death', 'description',
                 'd_in_interval', 'tweets_in_interval','n_days')
print(run_df[,summary_cols])
# section 8

There are many more result views available that we will eventually integrate into this document. They include:

Time series analysis, e.g., what happens with simulations that only provide the first interval of data to the models?

Simulate/Run/Evaluate Code for SIRTD models

Breck Baldwin

Jose Storopoli

Conor Rosato

August 08, 2021

Introduction

Navigation

Data Generating Process – DGP

Running simulation and fitting a model

Setting up the `run_df` data frame that controls the experiments

Setting up modeling

Configuring reporting

Running models

Presenting results

All run summary

Simulate/Run/Evaluate Code for SIRTD models

Breck Baldwin

Jose Storopoli

Conor Rosato

August 08, 2021

Introduction

Navigation

Data Generating Process – DGP

Running simulation and fitting a model

Setting up the run_df data frame that controls the experiments

Setting up modeling

Configuring reporting

Running models

Presenting results

All run summary

Setting up the `run_df` data frame that controls the experiments