Model overview

This document exists in the repository at https://github.com/codatmo/model_template as index.Rmd and rendered in html as index.html. The releases can be downloaded at https://github.com/codatmo/model_template/releases which includes this file.

The goal of this example model is to provide a template for reproducing COVID models in the CoDatMo framework. The model presented is a three parameter regression with artificial data.

This model is part of the CoDatMo (Co)vid (Dat)a (Mo)deling site (https://codatmo.github.io/) which is intended to replicate and make available important COVID models written in Bayesian modeling languages like Stan or PyMC3.

Validation checklist items

Model released on github: https://github.com/codatmo/model_template
Problem description
Model description: Research goals, references, supplementary description as necessary.
Data
- Data generating process
- Data munging and examination
Stan program
Running model
- Small data set to validate model execution (not done)
- Run on availble data
- Examine model output
Model validation
- Report posterior diagnostics
- Prior predictive check parameters/predictions
- Parameter recovery with simulated data
- Posterior predictive check
- Cross validation (not done)
- Simulation based calibration (SBC) (not done)

Problem description

This toy example attempts to model the relationship between phone calls to government services (111 calls) that report COVID-19 symptoms and hospital admissions that test positive for COVID-19 14 days later.

Model description

The model runs a simple linear regression with three parameters being estimated:

\[ y_n \sim \operatorname{normal}(\alpha + \beta X_n, \, \sigma) \\ \alpha \sim \operatorname{normal}(0,1000) \\ \beta \sim \operatorname{half\_normal}(0,2) \\ \sigma \sim \operatorname{normal}(500,300) \]

The data is not centered or scaled for this example which leads to some oddly parameterized priors. This is not recommended practice but it helps keep the example simple.

Data

The data are artificially generated with generating values drawn randomly from priors and resides in data/data.R. The generating script is at generate_data.R. The random seed should be removed or changed if the code is rerun with the expectation of getting different data. Usually we don’t have access to generating parameters so we won’t consider them here. Later we will generate data as part of model validation.

Data generating process

The data generating process is as follows: Person calls 111 reporting COVID symptoms as determined by the call center. We expect a baseline admission rate for COVID at hospitals and some percentage of the people who called will eventually be admitted to the hospital 14 days later.

library(dagitty)
graph <- dagitty("dag {x_calls_111 -> y_cvd_hosp}")
coordinates(graph) <- list(x=c(x_calls_111=1, y_cvd_hosp=2),
                           y=c(x_calls_111=0, y_cvd_hosp=0))
plot(graph)

The above causal graph treats 111 calls as causal for hospital admissions which is certainly incorrect. A more complete causal model would be sensitive to the underlying COVID rate but we are keeping the model simple.

Below the data are loaded.

source(file="data/data.R")
# vars are: 'n_days_data','x_calls_111', 'y_cvd_hosp'
data <- data.frame(x_calls_111, y_cvd_hosp)

head(data)

  x_calls_111 y_cvd_hosp
1         228        184
2         820        584
3         998        480
4         743        466
5         820        454
6          35        101

Data munging and examination

There is no data cleaning or processing. We will graph it however.

library(ggplot2)
ggplot(data) + aes(x=x_calls_111,y=y_cvd_hosp) + geom_point()

The conversion to Stan input given the above data is as follows:

stan_data <- list(N=nrow(data), x_calls_111=data$x_calls_111, y_cvd_hosp=data$y_cvd_hosp,
                  compute_likelihood=1, compute_prediction=0)

Note that the variables compute_likelihood= and compute_prediction= control Stan program execution with 1 meaning the relevant code is run and 0 meaning it will not be run.

Stan program

The Stan model is located at model_template/stan/linear_regression.stan:

data {
  int<lower = 0> N;           // number of data elements
  vector[N] x_calls_111;      // predictor vector
  vector[N] y_cvd_hosp;      // outcomes vector
  int<lower=0, upper=1> compute_likelihood; 
  int<lower=0, upper=1> compute_prediction;
}

transformed data {
  int P = compute_prediction ? N : 0; //controls whether predictions are accumulated, see generated quantites{}
}

parameters {
  real beta_coef_111_call;
  real alpha_intercept;
  real<lower = 0> sigma_sd; 
}

model {
  alpha_intercept ~ normal(0,1000);
  beta_coef_111_call ~ normal(0, 2);
  sigma_sd ~ normal(500, 300);
  if (compute_likelihood == 1) {
    for (n in 1:N) {
      y_cvd_hosp[n] ~ normal(alpha_intercept + beta_coef_111_call * x_calls_111[n], 
                                sigma_sd); // likelihood
    }
  }
}

generated quantities {
  vector[P] y_cvd_hosp_pred; // if P==0 then variable is not accumulated in posterior draws
    if (compute_prediction == 1) {
      for (p in 1:P) {
        y_cvd_hosp_pred[p] = normal_rng(alpha_intercept + beta_coef_111_call*x_calls_111[p],
                                        sigma_sd);
      }
    }
}

Running model

library(cmdstanr)
model <- cmdstan_model(file.path("stan","linear_regression.stan"))
stan_data <- list(N=nrow(data), x_calls_111=data$x_calls_111, y_cvd_hosp=data$y_cvd_hosp,
                  compute_likelihood=1, compute_prediction=0)
fit <- model$sample(data=stan_data, seed=999, chains=4)

Running MCMC with 4 sequential chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.4 seconds.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 0.3 seconds.
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.4 seconds.
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 0.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.3 seconds.
Total execution time: 1.9 seconds.

Viewing a text based summary of the fit:

fit$cmdstan_summary()

Inference for Stan model: linear_regression_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (0.22, 0.16, 0.25, 0.15) seconds, 0.77 seconds total
Sampling took (0.15, 0.14, 0.13, 0.16) seconds, 0.58 seconds total

                     Mean     MCSE  StdDev    5%   50%   95%    N_Eff  N_Eff/s    R_hat

lp__                 -430  3.3e-02     1.2  -433  -430  -429     1438     2467      1.0
accept_stat__        0.93  1.1e-02   0.098  0.73  0.97   1.0  8.5e+01  1.5e+02  1.0e+00
stepsize__           0.36  3.9e-02   0.055  0.27  0.40  0.40  2.0e+00  3.4e+00  4.0e+13
treedepth__           2.8  1.1e-01    0.73   2.0   3.0   4.0  4.3e+01  7.4e+01  1.0e+00
n_leapfrog__          9.1  7.2e-01     4.6   3.0   7.0    15  4.0e+01  6.9e+01  1.0e+00
divergent__          0.00      nan    0.00  0.00  0.00  0.00      nan      nan      nan
energy__              432  4.7e-02     1.7   430   431   435  1.4e+03  2.4e+03  1.0e+00

beta_coef_111_call   0.51  4.0e-04   0.016  0.49  0.51  0.54     1559     2674      1.0
alpha_intercept        68  2.3e-01     9.2    52    68    83     1527     2619      1.0
sigma_sd               47  7.6e-02     3.4    41    46    52     2009     3446      1.0

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

Viewing the posteriors graphically yeilds via the very useful bayesplot package:

library(bayesplot)
mcmc_hist(fit$draws(variables=c("sigma_sd", "beta_coef_111_call", "alpha_intercept")))

Model validation

Below are the diagnostics used to help validate the model.

Run posterior diagnostics

There are standard diagnostics that look for errors in the posterior.

fit$cmdstan_diagnose()

Processing csv files: /var/folders/n7/p0bt_2fs5j1g9tmcrcgqpj4w0000gn/T/RtmpiRcrVF/linear_regression-202103082000-1-27219e.csv, /var/folders/n7/p0bt_2fs5j1g9tmcrcgqpj4w0000gn/T/RtmpiRcrVF/linear_regression-202103082000-2-27219e.csv, /var/folders/n7/p0bt_2fs5j1g9tmcrcgqpj4w0000gn/T/RtmpiRcrVF/linear_regression-202103082000-3-27219e.csv, /var/folders/n7/p0bt_2fs5j1g9tmcrcgqpj4w0000gn/T/RtmpiRcrVF/linear_regression-202103082000-4-27219e.csv

Checking sampler transitions treedepth.
Treedepth satisfactory for all transitions.

Checking sampler transitions for divergences.
No divergent transitions found.

Checking E-BFMI - sampler transitions HMC potential energy.
E-BFMI satisfactory for all transitions.

Effective sample size satisfactory.

Split R-hat values satisfactory all parameters.

Processing complete, no problems detected.

Treedepth warnings passed
Divergence check passed
E-BFMI satisfactory
R-hat values satisfactory

Prior predictive check

The prior predictive check estimates the model parameters without the likelihood being used. The resulting draws are then used to predict new data via predictive application of the likelihood given the draws. Note that compute_likelihood = 0 prevents the likelihood being computed in the model.

library(cmdstanr)
library(rstan)
library(ggplot2)
library(bayesplot)
library(tidyr)

model <- cmdstan_model(file.path("stan", "linear_regression.stan"))
stan_data <- list(N=100, x_calls_111=data$x_calls_111, y_cvd_hosp=y_cvd_hosp,
                  compute_likelihood=0, compute_prediction=1)

fit <- model$sample(data=stan_data, seed=999, chains=4)

Running MCMC with 4 sequential chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.2 seconds.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 0.2 seconds.
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.2 seconds.
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 0.2 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 0.9 seconds.

mcmc_hist(fit$draws(variables=c("sigma_sd", "beta_coef_111_call", "alpha_intercept")))

Above we see the posterior of the priors without seeing any data. Generally some justification for prior distributions is expected. The above are weakly informative in that they cover a broad range of plausible values.

Plotting simulations from priors

Draws from the above posteriors allow for generation of simulated output data given input data x_calls_111. Note that compute_likelihood=0 prevents adding information from the data.

rs_fit <- rstan::read_stan_csv(fit$output_files())
rs_ex <- rstan::extract(rs_fit)
random_draws <- sample(1:nrow(rs_ex$y_cvd_hosp_pred), 10, replace=FALSE)
draws <- data.frame(t(rs_ex$y_cvd_hosp_pred[random_draws,]))
names(draws) <- random_draws
draw_names <- colnames(draws)

p_data2 <- cbind(data,draws)

p_long_data <- gather(p_data2,draw,y_sim,draw_names)

p <- ggplot(data=p_long_data, aes(x=x_calls_111)) +
            geom_point(aes(y=y_sim, group=draw, color=draw), size=.5) +
            geom_line(aes(y=y_cvd_hosp), color="black", size=.5)
print(p)

Actual data is plotted as a black line for context. At this point some comforting statements are made that the prior’s informativeness is minimal and that Bayes himself would bless the entire effort were he alive to do so.

Parameter recovery with simulated data

Parameter recovery establishes that for some small set of values the model reasons properly. We pick a draw from the above distributions, simulate data with it and then attempt to recover the parameters that we simulated with. We can look at the above graph and pick expected outliers or close to actual data. For this we pick 2959 as a middle-of-the-road example.

#Pick one arbitrary draw from the prior distribution
draw <- 2959
stan_data <- list(N=100, 
                  x_calls_111=data$x_calls_111,
                  y_cvd_hosp=rs_ex$y_cvd_hosp_pred[draw,],
                  compute_likelihood=1, compute_prediction=0)
fit <- model$sample(data=stan_data, seed=999, chains=4)

Running MCMC with 4 sequential chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.5 seconds.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 0.5 seconds.
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.6 seconds.
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 0.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.5 seconds.
Total execution time: 2.5 seconds.

fit$cmdstan_summary()

Inference for Stan model: linear_regression_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (0.40, 0.36, 0.42, 0.33) seconds, 1.5 seconds total
Sampling took (0.15, 0.14, 0.16, 0.14) seconds, 0.59 seconds total

                     Mean     MCSE  StdDev    5%   50%   95%    N_Eff  N_Eff/s    R_hat

lp__                 -726  3.0e-02     1.2  -728  -725  -724     1623     2759     1.00
accept_stat__        0.93  2.1e-03    0.10  0.70  0.97   1.0  2.5e+03  4.2e+03  1.0e+00
stepsize__           0.39  3.2e-02   0.046  0.31  0.41  0.43  2.0e+00  3.4e+00  2.9e+13
treedepth__           2.7  7.6e-02    0.73   2.0   3.0   4.0  9.1e+01  1.6e+02  1.0e+00
n_leapfrog__          8.6  4.9e-01     4.4   3.0   7.0    15  8.3e+01  1.4e+02  1.0e+00
divergent__          0.00      nan    0.00  0.00  0.00  0.00      nan      nan      nan
energy__              727  4.4e-02     1.7   725   727   731  1.5e+03  2.6e+03  1.0e+00

beta_coef_111_call    6.1  7.2e-03    0.29   5.6   6.1   6.5     1580     2688      1.0
alpha_intercept      1415  4.2e+00     167  1143  1419  1691     1589     2702      1.0
sigma_sd              856  1.2e+00      58   768   852   955     2429     4131     1.00

Samples were drawn using hmc with nuts.
For each parameter, N_Eff is a crude measure of effective sample size,
and R_hat is the potential scale reduction factor on split chains (at 
convergence, R_hat=1).

report <- paste(sprintf("\nDraw number %d", draw),
sprintf("actual alpha_intercept=%.2f", rs_ex$alpha_intercept[draw]),
sprintf("actual beta_coef_111_call=%.2f", rs_ex$beta_coef_111_call[draw]),
sprintf("actual sigma_sd=%.2f", rs_ex$sigma_sd[draw]), sep="\n")
cat(report)


Draw number 2959
actual alpha_intercept=1700.66
actual beta_coef_111_call=5.88
actual sigma_sd=766.31

The mean estimates for alpha_intercept and sigma_sd fit within the 5% to to 95% interval.The beta_coef_111_call is just outside the interval.

Posterior predictive check

Like the prior predictive check but we include actual data in estimating the parameters. Note that both compute_liklihood=1, compute_prediction=1 and that the actual data is supplied in place of simlated data from above.

stan_data <- list(N=nrow(data), x_calls_111=data$x_calls_111, y_cvd_hosp=y_cvd_hosp,
                  compute_likelihood=1, compute_prediction=1)

fit <- model$sample(data=stan_data, seed=999, chains=4)

Running MCMC with 4 sequential chains...

Chain 1 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 1 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 1 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 1 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 1 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 1 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 1 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 1 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 1 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 1 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 1 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 1 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 1 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 1 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 1 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 1 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 1 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 1 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 1 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 1 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 1 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 1 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 1 finished in 0.5 seconds.
Chain 2 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 2 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 2 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 2 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 2 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 2 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 2 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 2 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 2 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 2 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 2 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 2 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 2 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 2 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 2 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 2 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 2 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 2 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 2 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 2 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 2 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 2 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 2 finished in 0.5 seconds.
Chain 3 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 3 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 3 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 3 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 3 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 3 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 3 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 3 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 3 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 3 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 3 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 3 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 3 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 3 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 3 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 3 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 3 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 3 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 3 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 3 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 3 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 3 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 3 finished in 0.6 seconds.
Chain 4 Iteration:    1 / 2000 [  0%]  (Warmup) 
Chain 4 Iteration:  100 / 2000 [  5%]  (Warmup) 
Chain 4 Iteration:  200 / 2000 [ 10%]  (Warmup) 
Chain 4 Iteration:  300 / 2000 [ 15%]  (Warmup) 
Chain 4 Iteration:  400 / 2000 [ 20%]  (Warmup) 
Chain 4 Iteration:  500 / 2000 [ 25%]  (Warmup) 
Chain 4 Iteration:  600 / 2000 [ 30%]  (Warmup) 
Chain 4 Iteration:  700 / 2000 [ 35%]  (Warmup) 
Chain 4 Iteration:  800 / 2000 [ 40%]  (Warmup) 
Chain 4 Iteration:  900 / 2000 [ 45%]  (Warmup) 
Chain 4 Iteration: 1000 / 2000 [ 50%]  (Warmup) 
Chain 4 Iteration: 1001 / 2000 [ 50%]  (Sampling) 
Chain 4 Iteration: 1100 / 2000 [ 55%]  (Sampling) 
Chain 4 Iteration: 1200 / 2000 [ 60%]  (Sampling) 
Chain 4 Iteration: 1300 / 2000 [ 65%]  (Sampling) 
Chain 4 Iteration: 1400 / 2000 [ 70%]  (Sampling) 
Chain 4 Iteration: 1500 / 2000 [ 75%]  (Sampling) 
Chain 4 Iteration: 1600 / 2000 [ 80%]  (Sampling) 
Chain 4 Iteration: 1700 / 2000 [ 85%]  (Sampling) 
Chain 4 Iteration: 1800 / 2000 [ 90%]  (Sampling) 
Chain 4 Iteration: 1900 / 2000 [ 95%]  (Sampling) 
Chain 4 Iteration: 2000 / 2000 [100%]  (Sampling) 
Chain 4 finished in 0.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.5 seconds.
Total execution time: 2.4 seconds.

rs_fit <- rstan::read_stan_csv(fit$output_files())
rs_ex <- rstan::extract(rs_fit)
random_draws <- sample(1:nrow(rs_ex$y_cvd_hosp_pred), 10, replace=FALSE)
draws <- data.frame(t(rs_ex$y_cvd_hosp_pred[random_draws,]))
names(draws) <- random_draws
draw_names <- colnames(draws)

p_data2 <- cbind(data,draws)

p_long_data <- gather(p_data2,draw,y_sim,draw_names)

p <- ggplot(data=p_long_data, aes(x=x_calls_111)) +
            geom_point(aes(y=y_sim, group=draw, color=draw), size=.5) +
            geom_line(aes(y=y_cvd_hosp), color="black", size=.5)
print(p)

Actual data as a black line, 10 draws from the posterior shown. Clearly data helps.

Model reproduction checklist

Breck Baldwin

3/8/2021