This page describes the simulation, modeling and evaluation code in R/run_eval.R. The goal is to provide sufficient information to run and modify the code. The focus for the current data generating process (DGP) is SIRTD epidemiological models derived from compartments for the states (S) susceptible, (I) infectious and (R) for recovered – SIR models. The current simulations add compartments for (T) terminally-ill and (D) deceased counts useful for the grim task of distinguishing the deceased from recovered. There is additionally data generated for tweets that mention symptoms from individuals in the (I) infectious compartment but that exists separate from the core SIRTD model.
For more explanation about SIR models see Stan Case Study – Boarding School which we found to be a good explanation as well as an excellent introduction to Bayesian approaches using Stan. This is the model used to fit this data, there is a model reproduction checklist at codatmo/Simple_SIR.
The DGP simulates a fixed population at the individual level on a daily basis. There are a series of simulating DGPs with varied properties. This document covers a DGP that closely mirrors the compartment model structure with real-valued compartments with population transfers between states computed via real-valued multiplication on each day of the simulation. There is no stochastic (random) element at all in this simulation since it serves as a baseline for more complex models which will have stochastic elements as well as operating on different scales, e.g., agent-based models.
The actual code is shown below, the Rmarkdown commands are included to be clear about where code is and how the document is being generated. Note that there is a pull_section function that prints sections of code based on being between comments that will be used throughout. This allows software to exist outside the notebook format without introducing copy/paste errors that often come up or other mismatches between running code and it’s description.
#' Returns lines of file between the identified section '# <section name>' and
#' the matching section later in the file or the end of the file.
#' @param sections Comment that starts/stops collection of lines, e.g., '# section 1'
#' @param file Path to file to pull from
pull_section <- function(section, file, num_lines = FALSE) {
collect = FALSE
return_string = ''
line_num = 0
for (line in readLines(file)) {
line_num = line_num + 1
if (line == section) {
collect = !collect
}
prefix = ""
if (num_lines) {
prefix = paste0(line_num,": ")
}
if (grepl(section,line)) {
return_string = paste0(return_string, paste0(prefix,line), '\n')
}
else if (collect) {
return_string = paste0(return_string, paste0(prefix,line), '\n')
}
}
return(return_string)
}
cat(pull_section('# sirtd_vary_beta_exact', here::here('R', 'SIRTDsim.R')))
# sirtd_vary_beta_exact
#' Runs a simple deterministic SIRTD model with possible daily varied beta values.
#' @param seed Seed for stochastic processes, ignored since this is a deterministic function
#' @param n_pop Population size
#' @param n_days Number of days to run simulation
#' @param print Boolean to control whether daily summary is printed
#' @param beta_daily_inf_rates Controls number of s that are infected by i to
#' become i. Should be between 0 and 1
#' @param gamma_res_per_day_rate Controls number of i that transition to r or t
#' per day. Should be between 0 and 1
#' @param death_prob Probability transitioning from i to t
#' @param tweet_rate_infected If infected, what rate of tweets per day. Can be
#' > 1
#' @param n_patient_zero Number of patients infected at day 0
#' @param mean_days_to_death_from_t Mean number of days in t before d
#' @return dataframe simulating sirtd+tweets state for number days
sirtd_vary_beta_exact <- function(seed,
n_pop,
n_days,
print,
beta_daily_inf_rates,
gamma_res_per_day_rate,
death_prob,
tweet_rate_infected,
n_patient_zero,
mean_days_to_death_from_t) {
old_seed <- .Random.seed
on.exit({.Random.seed <<- old_seed})
set.seed(seed)
i <- n_patient_zero
r <- 0
t <- 0
d <- 0
s <- n_pop - n_patient_zero
col_names = c('day', 's', 'i', 'r', 't', 'd', 'tweets')
df = data.frame(matrix(nrow = 0, ncol=length(col_names)))
colnames(df) = col_names
tweets = 0
for (day in 1:n_days) {
df[day,] = rep(NA,length(col_names))
df[day,]$s= round(s)
df[day,]$i = round(i)
df[day,]$r = round(r)
df[day,]$t = round(t)
df[day,]$d = round(d)
df[day,]$day = day
df[day,]$tweets = round(tweets)
if (print) {
cat(
sprintf(
"\nDay=%d, susceptible=%.2f, infected=%.2f, recovered=%.2f, terminal=%.2f,
dead=%.2f, tweets=%.2f, R0=%.2f %.2f",
df[day,]$day, df[day,]$s, df[day,]$i, df[day,]$r, df[day,]$t,
df[day,]$d, df[day,]$tweets,
beta_daily_inf_rates[day]/gamma_res_per_day_rate,
s + i + r + t + d))
}
resolved <- gamma_res_per_day_rate * i
terminal <- death_prob * resolved
recovered <- resolved - terminal
dead <- 1/mean_days_to_death_from_t * t
tweets = tweet_rate_infected * i
infected <- min(i * beta_daily_inf_rates[day], s)
s <- s - infected
i <- i - resolved + infected
r <- r + recovered
t <- t - dead + terminal
d <- d + dead
}
return(df)
}
# sirtd_vary_beta_exact
The focus on this presentation is the surrounding framework so we offer the simulation code without comment, discussion of the DGP/simulation is at to be determined. We are planning to include the following simulations:
We may also run models that are time-limited, feature-richer social interaction models etc…
Below we go over how to setup and run the simulation code that generates data and runs experiments. The central code is in R/run_eval.R with use of functions in R/util.R for printing/graphing functions, R/data_configs.R for code that creates data via simulation or provides actual data, R/modeling_configs.R for code that configures parameters for modeling software and R/SIRTDsim.R that contains simulation software and is called from functions in R/data_configs.R. The libraries and dependencies are:
cat(pull_section('# dependencies', here::here('R', 'run_eval.R')))
# dependencies
library(tidyverse)
library(cmdstanr)
library(data.table)
library(kableExtra)
source(here::here("R","util.R"))
source(here::here("R","SIRTDsim.R"))
source(here::here("R","sim_configs.R"))
source(here::here("R", "data_configs.R"))
source(here::here("R","modeling_configs.R"))
# dependencies
run_df data frame that controls the experimentscat(pull_section('# setup run_df', here::here('R', 'run_eval.R'), num_lines = TRUE))
15: # setup run_df
16: setup_run_df <- setup_run_df(seed = 93435, n_pop = 214110287, n_days = 300) # in R/util.R
17: brazil_sim_df <- sim_brazil_1(setup_run_df) # in R/sim_configs.R
18: brazil_actual_df <- data_brazil_1(setup_run_df)
19: draws_run_df <- sim_draw_params(2, run_df) # in R/sim_configs.R
20: run_data_df <- brazil_actual_df #rbind(brazil_sim_df, brazil_actual_df)
21:
22: run_df <- model_stan_baseline(run_data_df) #in R/modeling_configs.R
23: run_df$use_tweets <- 0
24: #run_df$compute_likelihood <- 1 # compute likelihood across all runs
25: #run_df$reports <- list(c('graph_sim', 'graph_tweets', 'graph_d', 'plot'))
26: #list(c('graph_sim','graph_ODE', 'graph_tweets', 'graph_d', 'plot','param_recovery'))
27: # setup run_df
The run_df is a dataframe where each row will be a complete experiment configuration. The above functions add information relevant to their role. The source is very straightforward should one want to modify for their own purposes. In more detail:
cat(pull_section('setup_run_df', here::here('R', 'run_eval.R'), num_lines = TRUE))
16: setup_run_df <- setup_run_df(seed = 93435, n_pop = 214110287, n_days = 300) # in R/util.R
17: brazil_sim_df <- sim_brazil_1(setup_run_df) # in R/sim_configs.R
18: brazil_actual_df <- data_brazil_1(setup_run_df)
Running code up to this line shows the following values, note that the dataframe is transposed with rows as columns and columns as rows–unreadable otherwise:
| sim_run_id | NA |
| description | NA |
| seed | 93435 |
| n_pop | 1000 |
| n_days | 7 |
| model_to_run | none |
| compute_likelihood | NA |
| use_tweets | NA |
| s | NULL |
| i | NULL |
| r | NULL |
| t | NULL |
| d | NULL |
| tweets | NULL |
| beta_daily_rate | NA |
| beta_mean | NA |
| gamma | NA |
| death_prob | NA |
| tweet_rate | NA |
| days2death | NA |
| reports | NA |
| d_in_interval | NA |
| tweets_in_interval | NA |
| fit | NA |
This line constructs a basic template for the run_df, there is just one row at this point–remember transposed output. All subsequent additions/elaborations draw from this dataframe:
cat(pull_section('sim_brazil_1', here::here('R', 'run_eval.R'), num_lines = TRUE))
17: brazil_sim_df <- sim_brazil_1(setup_run_df) # in R/sim_configs.R
The sim_brazil_1 is a parameterization that roughly mimics the shape of the COVID-19 outbreak with corresponding simulated values as determined by trying values by hand. Look at the source code for more info. Note that the prefix on the function is sim, which means that it is part of simulation. There is still one row in the dataframe but with simulation data and the generating parameterization.
The, remember transposed, dataframe is:
| sim_run_id | 1 |
| description | Brazil 1 approximation |
| seed | 93435 |
| n_pop | 1000 |
| n_days | 7 |
| model_to_run | none |
| compute_likelihood | NA |
| use_tweets | NA |
| s | c(990, 987, 984, 981, 977, 974, 970) |
| i | c(10, 10, 11, 11, 12, 12, 13) |
| r | c(0, 3, 5, 8, 11, 14, 17) |
| t | c(0, 0, 0, 0, 0, 0, 0) |
| d | c(0, 0, 0, 0, 0, 0, 0) |
| tweets | c(0, 1, 1, 1, 1, 1, 1) |
| beta_daily_rate | c(0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3) |
| beta_mean | 0.3 |
| gamma | 0.255 |
| death_prob | 0.01 |
| tweet_rate | 0.1 |
| days2death | 10 |
| reports | c("graph_sim", "plot") |
| d_in_interval | NA |
| tweets_in_interval | NA |
| fit | NA |
| n_patient_zero | 10 |
We now have simulation data from the SIRTD model for n_days. But still only one row–remember transposed above:
cat(pull_section('sim_draw_params', here::here('R','run_eval.R'), num_lines = TRUE))
19: draws_run_df <- sim_draw_params(2, run_df) # in R/sim_configs.R
The next line runs two random based draws from the parameterization in the single simulation in run_df. Two draws are conducted and returned:
| sim_run_id | 2 | 3 |
| description | Brazil 1 approximation | Brazil 1 approximation |
| seed | 93435 | 93435 |
| n_pop | 1000 | 1000 |
| n_days | 7 | 7 |
| model_to_run | none | none |
| compute_likelihood | NA | NA |
| use_tweets | NA | NA |
| s | c(990, 985, 980, 974, 966, 958, 948) | c(990, 988, 986, 984, 983, 982, 981) |
| i | c(10, 12, 14, 16, 19, 22, 26) | c(10, 8, 6, 5, 4, 3, 3) |
| r | c(0, 3, 6, 10, 15, 20, 26) | c(0, 4, 8, 11, 13, 15, 16) |
| t | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) |
| d | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) |
| tweets | c(0, 2, 3, 3, 4, 4, 5) | c(0, 3, 2, 2, 2, 1, 1) |
| beta_daily_rate | c(0.459, 0.459, 0.459, 0.459, 0.459, 0.459, 0.459) | c(0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24) |
| beta_mean | 0.459 | 0.24 |
| gamma | 0.29 | 0.443 |
| death_prob | 0.009 | 0.01 |
| tweet_rate | 0.234 | 0.306 |
| days2death | 10 | 10 |
| reports | c("graph_sim", "plot") | c("graph_sim", "plot") |
| d_in_interval | NA | NA |
| tweets_in_interval | NA | NA |
| fit | NA | NA |
| n_patient_zero | 10 | 10 |
Above are transposed output for two draws for slight variations of the Brazil hand parameterization.
cat(pull_section("rbind", here::here('R', 'run_eval.R'), num_lines = TRUE))
20: run_data_df <- brazil_actual_df #rbind(brazil_sim_df, brazil_actual_df)
We bind the Brazil run with the draws resulting in 3 rows:
| sim_run_id | 1 | 2 | 3 |
| description | Brazil 1 approximation | Brazil 1 approximation | Brazil 1 approximation |
| seed | 93435 | 93435 | 93435 |
| n_pop | 1000 | 1000 | 1000 |
| n_days | 7 | 7 | 7 |
| model_to_run | none | none | none |
| compute_likelihood | NA | NA | NA |
| use_tweets | NA | NA | NA |
| s | c(990, 987, 984, 981, 977, 974, 970) | c(990, 985, 980, 974, 966, 958, 948) | c(990, 988, 986, 984, 983, 982, 981) |
| i | c(10, 10, 11, 11, 12, 12, 13) | c(10, 12, 14, 16, 19, 22, 26) | c(10, 8, 6, 5, 4, 3, 3) |
| r | c(0, 3, 5, 8, 11, 14, 17) | c(0, 3, 6, 10, 15, 20, 26) | c(0, 4, 8, 11, 13, 15, 16) |
| t | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) |
| d | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) |
| tweets | c(0, 1, 1, 1, 1, 1, 1) | c(0, 2, 3, 3, 4, 4, 5) | c(0, 3, 2, 2, 2, 1, 1) |
| beta_daily_rate | c(0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3) | c(0.459, 0.459, 0.459, 0.459, 0.459, 0.459, 0.459) | c(0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24) |
| beta_mean | 0.3 | 0.459 | 0.24 |
| gamma | 0.255 | 0.29 | 0.443 |
| death_prob | 0.01 | 0.009 | 0.01 |
| tweet_rate | 0.1 | 0.234 | 0.306 |
| days2death | 10 | 10 | 10 |
| reports | c("graph_sim", "plot") | c("graph_sim", "plot") | c("graph_sim", "plot") |
| d_in_interval | NA | NA | NA |
| tweets_in_interval | NA | NA | NA |
| fit | NA | NA | NA |
| n_patient_zero | 10 | 10 | 10 |
At this point we have three sets of data, one hand configured to mirror Brazil data and two that vary simulation parameters from the hand configured model. Now we apply modeling configurations with subroutines from R/modeling_configs.R.
cat(pull_section("model_stan_baseline", here::here('R', 'run_eval.R'), num_lines = TRUE))
22: run_df <- model_stan_baseline(run_data_df) #in R/modeling_configs.R
There are two model configurations which apply to the three rows uniformly. Hence, we end up with 6 configurations total. One model applies tweets to the task, and one does not. The final result is:
| sim_run_id | 1 | 2 | 3 |
| description | Brazil 1 approximation use tweets | Brazil 1 approximation use tweets | Brazil 1 approximation use tweets |
| seed | 93435 | 93435 | 93435 |
| n_pop | 1000 | 1000 | 1000 |
| n_days | 7 | 7 | 7 |
| model_to_run | baseline | baseline | baseline |
| compute_likelihood | 1 | 1 | 1 |
| use_tweets | NA | NA | NA |
| s | c(990, 987, 984, 981, 977, 974, 970) | c(990, 985, 980, 974, 966, 958, 948) | c(990, 988, 986, 984, 983, 982, 981) |
| i | c(10, 10, 11, 11, 12, 12, 13) | c(10, 12, 14, 16, 19, 22, 26) | c(10, 8, 6, 5, 4, 3, 3) |
| r | c(0, 3, 5, 8, 11, 14, 17) | c(0, 3, 6, 10, 15, 20, 26) | c(0, 4, 8, 11, 13, 15, 16) |
| t | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) |
| d | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) | c(0, 0, 0, 0, 0, 0, 0) |
| tweets | c(0, 1, 1, 1, 1, 1, 1) | c(0, 2, 3, 3, 4, 4, 5) | c(0, 3, 2, 2, 2, 1, 1) |
| beta_daily_rate | c(0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3) | c(0.459, 0.459, 0.459, 0.459, 0.459, 0.459, 0.459) | c(0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24) |
| beta_mean | 0.3 | 0.459 | 0.24 |
| gamma | 0.255 | 0.29 | 0.443 |
| death_prob | 0.01 | 0.009 | 0.01 |
| tweet_rate | 0.1 | 0.234 | 0.306 |
| days2death | 10 | 10 | 10 |
| reports | c("graph_sim", "plot") | c("graph_sim", "plot") | c("graph_sim", "plot") |
| d_in_interval | NA | NA | NA |
| tweets_in_interval | NA | NA | NA |
| fit | NA | NA | NA |
| n_patient_zero | 10 | 10 | 10 |
| apply_twitter_data | 1 | 1 | 1 |
| ode_solver | block | block | block |
The specific reporting can be controlled from the run_df as well.
cat(pull_section("reports <-", here::here('R','run_eval.R'), num_lines = TRUE))
25: #run_df$reports <- list(c('graph_sim', 'graph_tweets', 'graph_d', 'plot'))
After each run_df iteration the listed graphing and print routines are controlled via the above. This list is all available, they are explained in the next section.
Each row of run_df is a complete specification of both data and model configuration. The code simply iterates over each row, fits it with the appropriate model/data configuration (if specified) and adds the results by computing simulated days data for tweets and deaths using the .2 to .8 central sample’s interval for the model run. See R/util.R for the implementation of countPredictionsInQuantile.
cat(pull_section('# run models', here::here('R','run_eval.R')))
# run models
j <- 0
while (j < nrow(run_df)) {
j <- j + 1
fit <- NA
if (run_df[j,]$model_to_run == 'baseline') {
stan_data <-
list(n_days = run_df[j,]$n_days,
sDay1 = run_df[j,]$n_pop - 1,
iDay1 = 1,
rDay1 = 0,
dDay1 = 0,
NPop = run_df[j,]$n_pop,
tweets = unlist(run_df[j,]$tweets),
deaths = unlist(run_df[j,]$d),
compute_likelihood = run_df[j,]$compute_likelihood,
run_twitter = run_df[j,]$use_tweets,
run_block_ODE = ifelse(run_df[j,]$ode_solver == 'block', 1, 0),
run_rk45_ODE = ifelse(run_df[j,]$ode_solver == 'rk45', 1, 0),
scale = 1,
center = 1)
model <- cmdstan_model(here::here("stan", "baseline.stan"))
fit <- model$sample(data=stan_data,
parallel_chains = 4,
iter_warmup = 1000,
iter_sampling = 1000,
chains = 4,
seed = 4857)
run_df[j,]$fit = list(fit)
}
else if (run_df[j,]$model_to_run == 'UNINOVE_Brazil') {
stan_data_2 <- list(n_days = run_df[j,]$n_days,
y0 = c(run_df[j,]$n_pop - run_df[j,]$n_patient_zero,
run_df[j,]$n_patient_zero, 0, 0, 0), # one more zero here
t0 = 0,
ts = 1:run_df[j,]$n_days,
compute_likelihood = run_df[j,]$compute_likelihood,
use_twitter = run_df[j,]$apply_twitter_data,
death_count = unlist(run_df[j,]$d),
symptomaticTweets = unlist(run_df[j,]$tweets),
prior_beta_mean = 0,
prior_beta_std = 10,
prior_omega_mean = 0,
prior_omega_std = 10,
prior_dI_mean = 0,
prior_dI_std = 10,
prior_dT_mean = 0,
prior_dT_std = 10,
prior_twitter_lambda = 1
)
model2 <- cmdstan_model(here::here("stan", "tweet_sirtd_negbin_ODE.stan"))
fit <- model2$sample(data=stan_data_2,
parallel_chains = 4,
iter_warmup = 1000,
iter_sampling = 1000,
chains = 4,
seed = 4857)
}
if (run_df[j,]$model_to_run != 'none') {
d_tweets_in_interval = countPredictionsInQuantile(fit = fit,
run_df = run_df,
j = j, print = TRUE)
run_df[j,]$d_in_interval = d_tweets_in_interval[1]
run_df[j,]$tweets_in_interval = d_tweets_in_interval[2]
}
# run models
Only two models are available at present.
At this point results will likely get more idiosyncratic to experiment needs. The below if statements check the listed values for reports and apply the corresponding reporting. Note that each run is reported with no attempt to collect them into a collection–we intend to add that functionality eventually.There are the following graphing options:
d or deceased from the Stan model. If the likelihood is computed then this is a posterior predictive check, if the likelihood is not computed then it is a prior predictive check. There is a 20%/80% central interval ribbon plot as well.All subroutines are in the R/Util.R file.
cat(pull_section('# section 6', here::here('R','run_eval.R')))
# section 6
plot <- ggplot(data = NULL, aes(x = day, y = count))
if ('graph_data' %in% unlist(run_df[j,]$reports)) {
plot <- graph_real_data(data_df = run_df[j,], plot = plot)
}
if ('graph_sim' %in% unlist(run_df[j,]$reports)) {
plot <- graph_sim_data(data_df = run_df[j,], hide_s = TRUE, plot = plot)
}
if ('graph_ODE' %in% unlist(run_df[j,]$reports)) {
plot <- graph_ODE(data_df = run_df[j,], fit = fit, hide_s = TRUE,
plot = plot)
}
if ('graph_tweets' %in% unlist(run_df[j,]$reports)) {
plot <- plot_predictions(plot = plot, prediction_label = 'pred_tweets',
fit = fit,
show_ribbon = TRUE)
}
if ('graph_d' %in% unlist(run_df[j,]$reports)) {
plot <- plot_predictions(plot = plot, prediction_label = 'pred_deaths',
fit = fit,
show_ribbon = TRUE)
}
plot = plot + theme(legend.position = "none")
if (length(unlist(run_df[j,]$reports)) > 0) {
print(plot)
}
# section 6
A parameter recovery report finalizes the reporting options per run:
cat(pull_section('# section 7', here::here('R','run_eval.R')))
# section 7
if ('param_recovery' %in% run_df[j,]$reports) {
cat(param_recovery(data_df = run_df[j,], fit = fit))
}
# section 7
which reports on whether the simulation parameters and those from the model.
A summary across all the runs is done next by reporting relevant columns of the run_df:
cat(pull_section('# section 8', here::here('R','run_eval.R')))
# section 8
summary_cols = c('sim_run_id', 'model_to_run', 'beta_mean', 'gamma', 'death_prob',
'tweet_rate', 'days2death', 'description',
'd_in_interval', 'tweets_in_interval','n_days')
print(run_df[,summary_cols])
# section 8
There are many more result views available that we will eventually integrate into this document. They include: