Handling Missing Data in hbsaems
Source:vignettes/articles/hbsaems-handle-missing.Rmd
hbsaems-handle-missing.RmdNote on the printed output blocks. Stan /
brmsmodel fits take several minutes to run and are therefore not executed at vignette build time. The numerical outputs shown after eacheval = FALSEchunk in this vignette are realistic illustrations – representative of what you would see on a real run, but not produced by the chunks above. To reproduce the exact numbers, copy the code into an interactive R session.
The Bayesian principle: response variables are never imputed
A core design rule in hbsaems is that multiple
imputation only ever imputes auxiliary (predictor) variables, never the
response. This is not a limitation – it is the statistically
correct Bayesian treatment:
- For a Bayesian hierarchical model with response , the posterior for the parameters marginalises the missing outcomes :
- Plugging in a mice-imputed value for would replace this integral with a single point substitute, deflate the posterior uncertainty, and bias estimates if the imputation model is misspecified.
- The correct way to “fill in” missing
values is joint Bayesian imputation via
brms::mi(), which samples alongside . This is thehandle_missing = "model"strategy described below.
The three strategies thus have different scopes:
| Strategy | What it imputes | Required completeness |
|---|---|---|
"deleted" |
Nothing – drops rows with missing | Auxiliary must be complete |
"multiple" |
Only auxiliary
via mice
|
Any missingness handled separately (see below) |
"model" |
**
and
jointly via brms::mi()
|
Continuous outcomes only |
When handle_missing = NULL (default), the package
auto-selects a strategy based on the family’s
supports_mi flag.
Missing-data patterns in SAE
Three patterns are common:
- Missing response, complete covariates – some areas were not sampled at all, but you have administrative auxiliaries for them.
- Complete response, missing covariates – the survey reached every area, but a census variable is missing for some.
- Missing in both – a mix of the above.
The package’s reaction to each pattern depends on which
handle_missing strategy is in effect. The next three
sections walk through each strategy and the patterns it handles.
Strategy 1: "deleted" (listwise deletion on Y
only)
When handle_missing = "deleted":
- Rows where the response is missing are dropped before model fitting.
- If any auxiliary is missing, the function refuses with an informative error – listwise deletion of would discard information that more principled strategies can recover.
library(hbsaems)
data("data_fhnorm")
# Pattern 1: missing Y only, X complete
data_miss_y <- data_fhnorm
data_miss_y$y[c(3, 14, 27)] <- NA
fit_deleted <- hbm(
formula = brms::bf(y ~ x1 + x2 + x3),
re = ~ (1 | regency),
data = data_miss_y,
handle_missing = "deleted",
chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)handle_missing = 'deleted': 3 row(s) with missing response variable removed
from model fitting.
If you pass data with missing
to "deleted", the call is rejected:
data_miss_x <- data_fhnorm
data_miss_x$x1[6:8] <- NA
hbm(brms::bf(y ~ x1 + x2 + x3),
data = data_miss_x,
re = ~ (1 | regency),
handle_missing = "deleted")Error: Option `handle_missing = 'deleted'` requires all auxiliary (predictor)
variables to be complete. Missing values were detected in: x1. Consider
`handle_missing = 'multiple'` to impute missing predictors.
The dropped rows (under the legitimate Y-only case) are retained in
fit_deleted$data so that sae_predict() can
still produce predictions for those areas using their auxiliary
values.
Strategy 2: "multiple" (mice on X only)
When handle_missing = "multiple", mice
imputes the missing auxiliary variables
times, fits the brms model on each imputed dataset, and pools the
posterior draws via Rubin’s rules (via
brms::brm_multiple).
The response is never imputed by mice. How the package handles each pattern:
Pattern 2a: Missing X only – normal mice workflow
data_miss_x <- data_fhnorm
data_miss_x$x1[6:8] <- NA
fit_mi <- hbm(
formula = brms::bf(y ~ x1 + x2 + x3),
re = ~ (1 | regency),
data = data_miss_x,
handle_missing = "multiple",
m = 5, # 5 imputations
chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)Missing predictor variable(s): x1. Applying multiple imputation (mice)
with m = 5 imputations.
Compiling the C++ model
Fitting imputed model 1
Start sampling
...
Fitting imputed model 5
Start sampling
...
(Posterior draws pooled via Rubin's rules.)
The pooled posterior is returned in the usual hbmfit
interface. Diagnostics are per-imputation; the global Rhat /
ESS view is not informative for pooled chains – see
?brm_multiple for details.
Pattern 2b: Missing X and Y (or only Y) – response not imputed
If also has missing values, the package issues a warning explaining that mice will not impute and the rows with missing are excluded from fitting but retained in the returned object:
data_miss_both <- data_fhnorm
data_miss_both$y[2:3] <- NA # missing Y
data_miss_both$x1[6:8] <- NA # missing X
fit_mi_both <- hbm(
formula = brms::bf(y ~ x1 + x2 + x3),
re = ~ (1 | regency),
data = data_miss_both,
handle_missing = "multiple"
)Warning: Missing values detected in response variable(s): y.
Multiple imputation via `mice` applies ONLY to auxiliary predictor
variables (X). Response variable(s) are NEVER imputed inside a
Bayesian model -- missing Y rows will be excluded from model fitting
but are retained in the returned object for prediction via hbsae().
If you wish to model missingness in Y jointly with the parameters,
consider `handle_missing = 'model'` (continuous outcomes only).
Missing predictor variable(s): x1. Applying multiple imputation (mice)
with m = 5 imputations.
Pattern 2c: Missing Y only – auto-conversion
If handle_missing = "multiple" is given but only
Y is missing
(
is complete), there is nothing for mice to impute. The package then
auto-converts to a more appropriate strategy:
-
Continuous family (e.g.
gaussian,lognormal): auto-converts tohandle_missing = "model"and adds| mi()to the formula LHS so that missing values are jointly sampled with the parameters (the correct Bayesian approach). -
Discrete family (e.g.
binomial,poisson):mi()is not supported – falls back tohandle_missing = "deleted", dropping the rows with missing .
data_miss_y_only <- data_fhnorm
data_miss_y_only$y[2:3] <- NA # only Y missing
# Gaussian family -> auto-converts to "model" with | mi() on LHS
fit_auto1 <- hbm(
formula = brms::bf(y ~ x1 + x2 + x3),
re = ~ (1 | regency),
data = data_miss_y_only,
handle_missing = "multiple"
)handle_missing = 'multiple' was specified but only the response variable
(Y) is missing and X is complete. mice imputes predictor variables only and
cannot impute Y. Automatically converting to handle_missing = 'model':
'| mi()' will be added to the formula LHS so that brms jointly estimates
missing Y with the model parameters (the correct Bayesian approach).
Strategy 3: "model" (joint Bayesian imputation)
When handle_missing = "model", missing values in both
response and auxiliary variables are sampled jointly with the model
parameters via brms::mi(). The user must specify imputation
sub-models explicitly in the formula:
data_miss_both <- data_fhnorm
data_miss_both$y[2:3] <- NA
data_miss_both$x1[6:8] <- NA
fit_model <- hbm(
formula = brms::bf(y | mi() ~ mi(x1) + x2 + x3) +
brms::bf(x1 | mi() ~ x2 + x3),
re = ~ (1 | regency),
data = data_miss_both,
handle_missing = "model",
chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)Compiling Stan program...
Start sampling...
-
y | mi() ~ ...tells brms to treat missing values as parameters to be sampled. -
mi(x1)on the RHS tells brms to use the imputed (also sampled) as a predictor. - The second
bf(x1 | mi() ~ x2 + x3)is the imputation sub-model for .
The result is a multivariate brms model in which all missing values are sampled alongside the regression coefficients; posterior uncertainty in the imputations is automatically propagated.
Restriction: handle_missing = "model"
only works for continuous distributions (the
supports_mi flag in the family registry). For discrete
families (binomial, Poisson, …), brms::mi() is not
supported – use "multiple" for missing
and accept listwise deletion for missing
.
Auto-selection (when handle_missing = NULL)
If you omit handle_missing, the package looks at whether
any missingness exists and selects based on the
family’s supports_mi flag:
# Build a data frame with some missing values to demonstrate
data("data_lnln")
data_with_some_na <- data_lnln
data_with_some_na$x1[c(3, 14, 27)] <- NA # missing covariate
fit_auto <- hbm_lnln(
response = "y_obs",
auxiliary = c("x1", "x2", "x3"),
area_var = "district",
data = data_with_some_na,
# handle_missing not given
chains = 4, iter = 4000, warmup = 2000, cores = 4, seed = 1
)Missing values detected. Auto-selecting handle_missing = 'multiple'
(family 'lognormal', supports_mi = FALSE).
The actual decision rule used in the code is:
Any NA detected in response or auxiliary?
└── family.supports_mi == TRUE? -> "model" (joint mi(); user must
│ supply mi() formula
│ terms when missing X)
└── otherwise -> "multiple" (mice on X only;
missing Y handled
per Strategy 2 logic)
The auto-selection is intentionally simple and does
not branch on “Y-only” vs “X-only” patterns: those
subtleties are handled inside the chosen strategy (see Pattern 2c above
for the case where "multiple" detects Y-only missingness
and auto-converts to "model" for continuous families or to
"deleted" for discrete families).
You can always override by passing an explicit
handle_missing = "...".
Comparing strategies (when fitting all three is feasible)
loo_compare <- loo::loo_compare(
loo::loo(fit_deleted$brmsfit),
loo::loo(fit_mi$brmsfit),
loo::loo(fit_model$brmsfit)
)
loo_compare elpd_diff se_diff
fit_model 0.0 0.0
fit_mi -2.1 2.4
fit_deleted -10.7 3.8
Note that fit_deleted and fit_mi here refer
to fits on the same pattern (missing Y only, missing X only
respectively), so the comparison is over comparable likelihood
scales.
Practical guidance
-
Small fraction missing (< 5%) and likely MCAR, missing in
Y only –
"deleted"is fine. -
Missing in X (with or without missing Y):
- Continuous family – prefer
"model"(joint Bayesian imputation, most principled). - Discrete family – use
"multiple"(mice on X; rows with missing Y are dropped from fitting but retained in the result for prediction).
- Continuous family – prefer
-
Missing in response only, you have administrative
auxiliaries for those areas – prefer
"deleted"plussae_predict()(this is more transparent than imputing the response). -
Reproducibility – set
seedfor both the wrapper call and (if using mice) withinmice_argsto make imputations reproducible.
References
- Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley.
- van Buuren, S. (2018). Flexible Imputation of Missing Data. CRC Press.
- Bürkner, P.-C. (2017). brms: An R Package for Bayesian Multilevel
Models Using Stan. Journal of Statistical Software, 80(1),
1-28. Section on
mi(). - Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Chapter 25.