Skip to contents

Note on the printed output blocks. Stan / brms model fits take several minutes to run and are therefore not executed at vignette build time. The numerical outputs shown after each eval = FALSE chunk in this vignette are realistic illustrations – representative of what you would see on a real run, but not produced by the chunks above. To reproduce the exact numbers, copy the code into an interactive R session.

The Bayesian principle: response variables are never imputed

A core design rule in hbsaems is that multiple imputation only ever imputes auxiliary (predictor) variables, never the response. This is not a limitation – it is the statistically correct Bayesian treatment:

  • For a Bayesian hierarchical model with response YY, the posterior for the parameters θ\theta marginalises the missing outcomes YmisY_{\text{mis}}: p(θYobs,X)=p(θYobs,Ymis,X)p(YmisYobs,X,θ)dYmis. p(\theta \mid Y_{\text{obs}}, X) = \int p(\theta \mid Y_{\text{obs}}, Y_{\text{mis}}, X) \, p(Y_{\text{mis}} \mid Y_{\text{obs}}, X, \theta) \, \mathrm{d}Y_{\text{mis}}.
  • Plugging in a mice-imputed value for YY would replace this integral with a single point substitute, deflate the posterior uncertainty, and bias estimates if the imputation model is misspecified.
  • The correct way to “fill in” missing YY values is joint Bayesian imputation via brms::mi(), which samples YmisY_{\text{mis}} alongside θ\theta. This is the handle_missing = "model" strategy described below.

The three strategies thus have different scopes:

Strategy What it imputes Required completeness
"deleted" Nothing – drops rows with missing YY Auxiliary XX must be complete
"multiple" Only auxiliary XX via mice Any YY missingness handled separately (see below)
"model" **XX and YY jointly via brms::mi() Continuous outcomes only

When handle_missing = NULL (default), the package auto-selects a strategy based on the family’s supports_mi flag.

Missing-data patterns in SAE

Three patterns are common:

  1. Missing response, complete covariates – some areas were not sampled at all, but you have administrative auxiliaries for them.
  2. Complete response, missing covariates – the survey reached every area, but a census variable is missing for some.
  3. Missing in both – a mix of the above.

The package’s reaction to each pattern depends on which handle_missing strategy is in effect. The next three sections walk through each strategy and the patterns it handles.

Strategy 1: "deleted" (listwise deletion on Y only)

When handle_missing = "deleted":

  • Rows where the response YY is missing are dropped before model fitting.
  • If any auxiliary is missing, the function refuses with an informative error – listwise deletion of XX would discard information that more principled strategies can recover.
library(hbsaems)
data("data_fhnorm")

# Pattern 1: missing Y only, X complete
data_miss_y      <- data_fhnorm
data_miss_y$y[c(3, 14, 27)] <- NA

fit_deleted <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_y,
  handle_missing = "deleted",
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)
handle_missing = 'deleted': 3 row(s) with missing response variable removed
from model fitting.

If you pass data with missing XX to "deleted", the call is rejected:

data_miss_x <- data_fhnorm
data_miss_x$x1[6:8] <- NA

hbm(brms::bf(y ~ x1 + x2 + x3),
    data           = data_miss_x,
    re             = ~ (1 | regency),
    handle_missing = "deleted")
Error: Option `handle_missing = 'deleted'` requires all auxiliary (predictor)
variables to be complete. Missing values were detected in: x1. Consider
`handle_missing = 'multiple'` to impute missing predictors.

The dropped rows (under the legitimate Y-only case) are retained in fit_deleted$data so that sae_predict() can still produce predictions for those areas using their auxiliary values.

Strategy 2: "multiple" (mice on X only)

When handle_missing = "multiple", mice imputes the missing auxiliary variables mm times, fits the brms model on each imputed dataset, and pools the posterior draws via Rubin’s rules (via brms::brm_multiple).

The response is never imputed by mice. How the package handles each pattern:

Pattern 2a: Missing X only – normal mice workflow

data_miss_x         <- data_fhnorm
data_miss_x$x1[6:8] <- NA

fit_mi <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_x,
  handle_missing = "multiple",
  m              = 5,                       # 5 imputations
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)
Missing predictor variable(s): x1. Applying multiple imputation (mice)
with m = 5 imputations.

Compiling the C++ model
Fitting imputed model 1
Start sampling
...
Fitting imputed model 5
Start sampling
...

(Posterior draws pooled via Rubin's rules.)

The pooled posterior is returned in the usual hbmfit interface. Diagnostics are per-imputation; the global Rhat / ESS view is not informative for pooled chains – see ?brm_multiple for details.

Pattern 2b: Missing X and Y (or only Y) – response not imputed

If YY also has missing values, the package issues a warning explaining that mice will not impute YY and the rows with missing YY are excluded from fitting but retained in the returned object:

data_miss_both <- data_fhnorm
data_miss_both$y[2:3]   <- NA           # missing Y
data_miss_both$x1[6:8]  <- NA           # missing X

fit_mi_both <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_both,
  handle_missing = "multiple"
)
Warning: Missing values detected in response variable(s): y.
Multiple imputation via `mice` applies ONLY to auxiliary predictor
variables (X). Response variable(s) are NEVER imputed inside a
Bayesian model -- missing Y rows will be excluded from model fitting
but are retained in the returned object for prediction via hbsae().
If you wish to model missingness in Y jointly with the parameters,
consider `handle_missing = 'model'` (continuous outcomes only).

Missing predictor variable(s): x1. Applying multiple imputation (mice)
with m = 5 imputations.

Pattern 2c: Missing Y only – auto-conversion

If handle_missing = "multiple" is given but only Y is missing (XX is complete), there is nothing for mice to impute. The package then auto-converts to a more appropriate strategy:

  • Continuous family (e.g. gaussian, lognormal): auto-converts to handle_missing = "model" and adds | mi() to the formula LHS so that missing YY values are jointly sampled with the parameters (the correct Bayesian approach).
  • Discrete family (e.g. binomial, poisson): mi() is not supported – falls back to handle_missing = "deleted", dropping the rows with missing YY.
data_miss_y_only <- data_fhnorm
data_miss_y_only$y[2:3] <- NA           # only Y missing

# Gaussian family -> auto-converts to "model" with | mi() on LHS
fit_auto1 <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_y_only,
  handle_missing = "multiple"
)
handle_missing = 'multiple' was specified but only the response variable
(Y) is missing and X is complete. mice imputes predictor variables only and
cannot impute Y. Automatically converting to handle_missing = 'model':
'| mi()' will be added to the formula LHS so that brms jointly estimates
missing Y with the model parameters (the correct Bayesian approach).

Controlling mice

Pass options through mice_args:

fit_mi2 <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_x,
  handle_missing = "multiple",
  m              = 10,
  mice_args      = list(method = "pmm", maxit = 20),
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)

Strategy 3: "model" (joint Bayesian imputation)

When handle_missing = "model", missing values in both response and auxiliary variables are sampled jointly with the model parameters via brms::mi(). The user must specify imputation sub-models explicitly in the formula:

data_miss_both <- data_fhnorm
data_miss_both$y[2:3]   <- NA
data_miss_both$x1[6:8]  <- NA

fit_model <- hbm(
  formula        = brms::bf(y  | mi() ~ mi(x1) + x2 + x3) +
                   brms::bf(x1 | mi() ~ x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_both,
  handle_missing = "model",
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)
Compiling Stan program...
Start sampling...
  • y | mi() ~ ... tells brms to treat missing YY values as parameters to be sampled.
  • mi(x1) on the RHS tells brms to use the imputed x1x_1 (also sampled) as a predictor.
  • The second bf(x1 | mi() ~ x2 + x3) is the imputation sub-model for x1x_1.

The result is a multivariate brms model in which all missing values are sampled alongside the regression coefficients; posterior uncertainty in the imputations is automatically propagated.

Restriction: handle_missing = "model" only works for continuous distributions (the supports_mi flag in the family registry). For discrete families (binomial, Poisson, …), brms::mi() is not supported – use "multiple" for missing XX and accept listwise deletion for missing YY.

Auto-selection (when handle_missing = NULL)

If you omit handle_missing, the package looks at whether any missingness exists and selects based on the family’s supports_mi flag:

# Build a data frame with some missing values to demonstrate
data("data_lnln")
data_with_some_na <- data_lnln
data_with_some_na$x1[c(3, 14, 27)] <- NA    # missing covariate

fit_auto <- hbm_lnln(
  response  = "y_obs",
  auxiliary = c("x1", "x2", "x3"),
  area_var  = "district",
  data      = data_with_some_na,
  # handle_missing not given
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)
Missing values detected. Auto-selecting handle_missing = 'multiple'
(family 'lognormal', supports_mi = FALSE).

The actual decision rule used in the code is:

Any NA detected in response or auxiliary?
  └── family.supports_mi == TRUE?   ->  "model"   (joint mi(); user must
  │                                                supply mi() formula
  │                                                terms when missing X)
  └── otherwise                      ->  "multiple" (mice on X only;
                                                     missing Y handled
                                                     per Strategy 2 logic)

The auto-selection is intentionally simple and does not branch on “Y-only” vs “X-only” patterns: those subtleties are handled inside the chosen strategy (see Pattern 2c above for the case where "multiple" detects Y-only missingness and auto-converts to "model" for continuous families or to "deleted" for discrete families).

You can always override by passing an explicit handle_missing = "...".

Comparing strategies (when fitting all three is feasible)

loo_compare <- loo::loo_compare(
  loo::loo(fit_deleted$brmsfit),
  loo::loo(fit_mi$brmsfit),
  loo::loo(fit_model$brmsfit)
)
loo_compare
                  elpd_diff se_diff
fit_model         0.0       0.0
fit_mi           -2.1       2.4
fit_deleted     -10.7       3.8

Note that fit_deleted and fit_mi here refer to fits on the same pattern (missing Y only, missing X only respectively), so the comparison is over comparable likelihood scales.

Practical guidance

  • Small fraction missing (< 5%) and likely MCAR, missing in Y only"deleted" is fine.
  • Missing in X (with or without missing Y):
    • Continuous family – prefer "model" (joint Bayesian imputation, most principled).
    • Discrete family – use "multiple" (mice on X; rows with missing Y are dropped from fitting but retained in the result for prediction).
  • Missing in response only, you have administrative auxiliaries for those areas – prefer "deleted" plus sae_predict() (this is more transparent than imputing the response).
  • Reproducibility – set seed for both the wrapper call and (if using mice) within mice_args to make imputations reproducible.

References

  • Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
  • van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC Press.
  • Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1-28. Section on mi().
  • Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Chapter 25.