Handling Missing Data in hbsaems

Note on the printed output blocks. Stan / brms model fits take several minutes to run and are therefore not executed at vignette build time. The numerical outputs shown after each eval = FALSE chunk in this vignette are realistic illustrations – representative of what you would see on a real run, but not produced by the chunks above. To reproduce the exact numbers, copy the code into an interactive R session.

The Bayesian principle: response variables are never imputed

A core design rule in hbsaems is that multiple imputation only ever imputes auxiliary (predictor) variables, never the response. This is not a limitation – it is the statistically correct Bayesian treatment:

For a Bayesian hierarchical model with response $Y$ , the posterior for the parameters $\theta$ marginalises the missing outcomes $Y_{\text{mis}}$ : $p(\theta \mid Y_{\text{obs}}, X) = \int p(\theta \mid Y_{\text{obs}}, Y_{\text{mis}}, X) \, p(Y_{\text{mis}} \mid Y_{\text{obs}}, X, \theta) \, \mathrm{d}Y_{\text{mis}}.$
Plugging in a mice-imputed value for $Y$ would replace this integral with a single point substitute, deflate the posterior uncertainty, and bias estimates if the imputation model is misspecified.
The correct way to “fill in” missing $Y$ values is joint Bayesian imputation via brms::mi(), which samples $Y_{\text{mis}}$ alongside $\theta$ . This is the handle_missing = "model" strategy described below.

The three strategies thus have different scopes:

Strategy	What it imputes	Required completeness
`"deleted"`	Nothing – drops rows with missing $Y$	Auxiliary $X$ must be complete
`"multiple"`	Only auxiliary $X$ via `mice`	Any $Y$ missingness handled separately (see below)
`"model"`	** $X$ and $Y$ jointly via `brms::mi()`	Continuous outcomes only

When handle_missing = NULL (default), the package auto-selects a strategy based on the family’s supports_mi flag.

Missing-data patterns in SAE

Three patterns are common:

Missing response, complete covariates – some areas were not sampled at all, but you have administrative auxiliaries for them.
Complete response, missing covariates – the survey reached every area, but a census variable is missing for some.
Missing in both – a mix of the above.

The package’s reaction to each pattern depends on which handle_missing strategy is in effect. The next three sections walk through each strategy and the patterns it handles.

Strategy 1: `"deleted"` (listwise deletion on Y only)

When handle_missing = "deleted":

Rows where the response $Y$ is missing are dropped before model fitting.
If any auxiliary is missing, the function refuses with an informative error – listwise deletion of $X$ would discard information that more principled strategies can recover.

library(hbsaems)
data("data_fhnorm")

# Pattern 1: missing Y only, X complete
data_miss_y      <- data_fhnorm
data_miss_y$y[c(3, 14, 27)] <- NA

fit_deleted <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_y,
  handle_missing = "deleted",
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)

handle_missing = 'deleted': 3 row(s) with missing response variable removed
from model fitting.

If you pass data with missing $X$ to "deleted", the call is rejected:

data_miss_x <- data_fhnorm
data_miss_x$x1[6:8] <- NA

hbm(brms::bf(y ~ x1 + x2 + x3),
    data           = data_miss_x,
    re             = ~ (1 | regency),
    handle_missing = "deleted")

Error: Option `handle_missing = 'deleted'` requires all auxiliary (predictor)
variables to be complete. Missing values were detected in: x1. Consider
`handle_missing = 'multiple'` to impute missing predictors.

The dropped rows (under the legitimate Y-only case) are retained in fit_deleted$data so that sae_predict() can still produce predictions for those areas using their auxiliary values.

Strategy 2: `"multiple"` (mice on X only)

When handle_missing = "multiple", mice imputes the missing auxiliary variables $m$ times, fits the brms model on each imputed dataset, and pools the posterior draws via Rubin’s rules (via brms::brm_multiple).

The response is never imputed by mice. How the package handles each pattern:

Pattern 2a: Missing X only – normal mice workflow

data_miss_x         <- data_fhnorm
data_miss_x$x1[6:8] <- NA

fit_mi <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_x,
  handle_missing = "multiple",
  m              = 5,                       # 5 imputations
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)

Missing predictor variable(s): x1. Applying multiple imputation (mice)
with m = 5 imputations.

Compiling the C++ model
Fitting imputed model 1
Start sampling
...
Fitting imputed model 5
Start sampling
...

(Posterior draws pooled via Rubin's rules.)

The pooled posterior is returned in the usual hbmfit interface. Diagnostics are per-imputation; the global Rhat / ESS view is not informative for pooled chains – see ?brm_multiple for details.

Pattern 2b: Missing X and Y (or only Y) – response not imputed

If $Y$ also has missing values, the package issues a warning explaining that mice will not impute $Y$ and the rows with missing $Y$ are excluded from fitting but retained in the returned object:

data_miss_both <- data_fhnorm
data_miss_both$y[2:3]   <- NA           # missing Y
data_miss_both$x1[6:8]  <- NA           # missing X

fit_mi_both <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_both,
  handle_missing = "multiple"
)

Warning: Missing values detected in response variable(s): y.
Multiple imputation via `mice` applies ONLY to auxiliary predictor
variables (X). Response variable(s) are NEVER imputed inside a
Bayesian model -- missing Y rows will be excluded from model fitting
but are retained in the returned object for prediction via hbsae().
If you wish to model missingness in Y jointly with the parameters,
consider `handle_missing = 'model'` (continuous outcomes only).

Missing predictor variable(s): x1. Applying multiple imputation (mice)
with m = 5 imputations.

Pattern 2c: Missing Y only – auto-conversion

If handle_missing = "multiple" is given but only Y is missing ( $X$ is complete), there is nothing for mice to impute. The package then auto-converts to a more appropriate strategy:

Continuous family (e.g. gaussian, lognormal): auto-converts to handle_missing = "model" and adds | mi() to the formula LHS so that missing $Y$ values are jointly sampled with the parameters (the correct Bayesian approach).
Discrete family (e.g. binomial, poisson): mi() is not supported – falls back to handle_missing = "deleted", dropping the rows with missing $Y$ .

data_miss_y_only <- data_fhnorm
data_miss_y_only$y[2:3] <- NA           # only Y missing

# Gaussian family -> auto-converts to "model" with | mi() on LHS
fit_auto1 <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_y_only,
  handle_missing = "multiple"
)

handle_missing = 'multiple' was specified but only the response variable
(Y) is missing and X is complete. mice imputes predictor variables only and
cannot impute Y. Automatically converting to handle_missing = 'model':
'| mi()' will be added to the formula LHS so that brms jointly estimates
missing Y with the model parameters (the correct Bayesian approach).

Controlling mice

Pass options through mice_args:

fit_mi2 <- hbm(
  formula        = brms::bf(y ~ x1 + x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_x,
  handle_missing = "multiple",
  m              = 10,
  mice_args      = list(method = "pmm", maxit = 20),
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)

Strategy 3: `"model"` (joint Bayesian imputation)

When handle_missing = "model", missing values in both response and auxiliary variables are sampled jointly with the model parameters via brms::mi(). The user must specify imputation sub-models explicitly in the formula:

data_miss_both <- data_fhnorm
data_miss_both$y[2:3]   <- NA
data_miss_both$x1[6:8]  <- NA

fit_model <- hbm(
  formula        = brms::bf(y  | mi() ~ mi(x1) + x2 + x3) +
                   brms::bf(x1 | mi() ~ x2 + x3),
  re             = ~ (1 | regency),
  data           = data_miss_both,
  handle_missing = "model",
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)

Compiling Stan program...
Start sampling...

y | mi() ~ ... tells brms to treat missing $Y$ values as parameters to be sampled.
mi(x1) on the RHS tells brms to use the imputed $x_1$ (also sampled) as a predictor.
The second bf(x1 | mi() ~ x2 + x3) is the imputation sub-model for $x_1$ .

The result is a multivariate brms model in which all missing values are sampled alongside the regression coefficients; posterior uncertainty in the imputations is automatically propagated.

Restriction: handle_missing = "model" only works for continuous distributions (the supports_mi flag in the family registry). For discrete families (binomial, Poisson, …), brms::mi() is not supported – use "multiple" for missing $X$ and accept listwise deletion for missing $Y$ .

Auto-selection (when `handle_missing = NULL`)

If you omit handle_missing, the package looks at whether any missingness exists and selects based on the family’s supports_mi flag:

# Build a data frame with some missing values to demonstrate
data("data_lnln")
data_with_some_na <- data_lnln
data_with_some_na$x1[c(3, 14, 27)] <- NA    # missing covariate

fit_auto <- hbm_lnln(
  response  = "y_obs",
  auxiliary = c("x1", "x2", "x3"),
  area_var  = "district",
  data      = data_with_some_na,
  # handle_missing not given
  chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)

Missing values detected. Auto-selecting handle_missing = 'multiple'
(family 'lognormal', supports_mi = FALSE).

The actual decision rule used in the code is:

Any NA detected in response or auxiliary?
  └── family.supports_mi == TRUE?   ->  "model"   (joint mi(); user must
  │                                                supply mi() formula
  │                                                terms when missing X)
  └── otherwise                      ->  "multiple" (mice on X only;
                                                     missing Y handled
                                                     per Strategy 2 logic)

The auto-selection is intentionally simple and does not branch on “Y-only” vs “X-only” patterns: those subtleties are handled inside the chosen strategy (see Pattern 2c above for the case where "multiple" detects Y-only missingness and auto-converts to "model" for continuous families or to "deleted" for discrete families).

You can always override by passing an explicit handle_missing = "...".

Comparing strategies (when fitting all three is feasible)

loo_compare <- loo::loo_compare(
  loo::loo(fit_deleted$brmsfit),
  loo::loo(fit_mi$brmsfit),
  loo::loo(fit_model$brmsfit)
)
loo_compare

                  elpd_diff se_diff
fit_model         0.0       0.0
fit_mi           -2.1       2.4
fit_deleted     -10.7       3.8

Note that fit_deleted and fit_mi here refer to fits on the same pattern (missing Y only, missing X only respectively), so the comparison is over comparable likelihood scales.

Practical guidance

Small fraction missing (< 5%) and likely MCAR, missing in Y only – "deleted" is fine.
Missing in X (with or without missing Y):
- Continuous family – prefer "model" (joint Bayesian imputation, most principled).
- Discrete family – use "multiple" (mice on X; rows with missing Y are dropped from fitting but retained in the result for prediction).
Missing in response only, you have administrative auxiliaries for those areas – prefer "deleted" plus sae_predict() (this is more transparent than imputing the response).
Reproducibility – set seed for both the wrapper call and (if using mice) within mice_args to make imputations reproducible.

References

Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC Press.
Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel Models Using Stan. Journal of Statistical Software, 80(1), 1-28. Section on mi().
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Chapter 25.