Advanced Features: Spatial, Benchmarking, Custom Families, Smooth Terms
Source:vignettes/articles/advanced-features.Rmd
advanced-features.RmdThis vignette demonstrates the advanced features available in
hbsaems v1.0.0, complementing the
complete-workflow vignette which focuses on the core
fitting pipeline.
Topics covered:
-
Pre-fit data quality checking
(
check_data) - Spatial weight matrices – building from shapefiles + theory checks
- Multi-domain benchmarking – district to regency with raking
- Fully Bayesian benchmarking – correct uncertainty after adjustment
- Smooth terms – splines and Gaussian processes
- Shrinkage priors – horseshoe and R2D2 (with cascade)
- Custom distributions – the model registry
-
Model averaging with LOO weights
(
model_average) -
Prior sensitivity analysis
(
prior_sensitivity) -
Hierarchical area random effects
(
area_varas a vector)
1. Pre-fit data quality checking
Before fitting any model, run check_data() to verify the
variables exist, identify missing patterns, and get a recommendation for
the handle_missing strategy.
data("data_fhnorm")
chk <- check_data(
data = data_fhnorm,
response = "y",
auxiliary = c("x1", "x2", "x3"),
spatial_var = "province"
)
print(chk)
#> HBSAE Data Check [hbsaems_data_check]
#> ---------------------------------------
#> Observations : 50
#> Missing pattern : none (data complete)
#> Issues : none
#>
#> - No missing values detected. handle_missing can stay NULL.check_data() returns an object of class
c("hbsaems_data_check", "hbsaems_check"), so you can test
it with the generic predicate:
is.hbsaems_check(chk) # TRUEWhen the response has missing values
Missing values in Y often correspond to non-sample
areas – domains for which you want a prediction but have no
direct estimate. check_data() highlights this
explicitly:
d2 <- data_fhnorm
d2$y[1:5] <- NA
chk2 <- check_data(d2, response = "y", auxiliary = c("x1", "x2", "x3"))
chk2$non_sample_warningIf those rows really are non-sample areas, keep
them: fit the model on the complete-Y subset and use
sae_predict(..., newdata = na_rows).
2. Spatial weight matrices
Building W from a shapefile
build_spatial_weight() constructs a CAR-ready binary
matrix or a SAR-ready row-standardised matrix from a polygon
shapefile:
# Queen contiguity + binary -> for CAR (Besag 1974)
M_car <- build_spatial_weight(
shp = "path/to/areas.shp",
for_model = "car" # implies type = "queen", style = "B"
)
# K-nearest + row-standardised -> for SAR (Anselin 1988)
M_sar <- build_spatial_weight(
shp = "path/to/areas.shp",
for_model = "sar", # implies type = "knn", style = "W"
k = 4
)Distance-band and rook contiguity are also supported – see
?build_spatial_weight.
Theoretical compatibility check
check_spatial_weight() runs five checks against the
requirements of the chosen model class:
check_spatial_weight(M_car, spatial_model = "car")
#> Spatial Weight Matrix Diagnostic
#> ---------------------------------
#> Square : TRUE
#> Zero diagonal : TRUE
#> Symmetric : TRUE
#> Detected style : B
#> Isolated areas : 0
#> Components : 1
#> Matrix is theoretically compatible.This catches the common mistakes: non-symmetric matrices fed into
CAR, isolated areas under ICAR, etc. It also runs automatically inside
hbm() whenever spatial_model is set.
3. Multi-domain benchmarking: district to regency
A common SAE scenario: you have estimates per district (kecamatan in
Indonesian) and need them to roll up to a published regency-level
(kabupaten in Indonesian) total – for example, from BPS, the Indonesian
central statistics office. sae_benchmark() with
method = "raking" handles this:
# Suppose we have SAE estimates per district, with a regency column
# tagging which regency each district belongs to.
estimates <- sae_predict(fit_hbm)
# Official regency-level totals (e.g.\ from BPS publications)
T_regency <- c(Bogor = 110, Sukabumi = 145, Cianjur = 145)
bm <- sae_benchmark(
predictions = estimates,
target = T_regency, # one target per regency
weights = my_data$population, # population per district
groups = my_data$regency, # mapping district -> regency
method = "raking"
)
bm$benchmark_info$converged # raking with one group var: converges in 1 sweepThe adjustment factor is constant within each regency but different across regencies – exactly what multi-domain benchmarking requires.
4. Fully Bayesian benchmarking (v1.0.0)
The default sae_benchmark() mode applies the adjustment
to the posterior mean only – the SD column is
approximated as unchanged. This understates uncertainty for
multiplicative ratios and over- or understates it for additive
shifts.
The fully Bayesian mode applies the benchmark to every posterior draw and recomputes SD, RSE, and quantiles correctly:
# Pass posterior = TRUE if predictions object carries draws
bm_bayes <- sae_benchmark(
predictions = estimates,
target = 1000,
method = "ratio",
posterior = TRUE, # NEW in v1.0.0
probs = c(0.025, 0.5, 0.975)
)
bm_bayes$result_table
#> Prediction SD RSE_percent Q025 Q50 Q975
#> 1 ... ... ... ... ... ...
# Or supply your own draws matrix (D x n)
draws <- posterior_predict(fit$model)
bm_bayes2 <- sae_benchmark(estimates, target = 1000,
method = "ratio",
posterior = draws)Why this matters
For multiplicative ratio benchmarking, the adjustment factor itself is a random variable computed from the same draws: Since and are negatively correlated (a draw with larger area- value implies a smaller ), the post-benchmark variance is strictly smaller than the naive approximation. Fully Bayesian benchmarking captures this exactly.
5. Smooth terms: splines and Gaussian processes
hbsaems exposes the full brms canonical
API for smooth terms. Replace any linear predictor with a
smooth function:
5.1 Penalised regression splines (mgcv backend)
# Default: thin-plate regression spline, basis dimension chosen by mgcv
fit_spline <- hbm(
formula = brms::bf(y ~ x1 + x2 + x3),
data = data_fhnorm,
re = ~ (1 | regency), # area-level random effect (Fay-Herriot)
sampling_variance = "D", # known sampling variance (Fay-Herriot)
nonlinear = c("x2", "x3"),
nonlinear_type = "spline",
spline_k = 7L, # basis dim (-1 = auto)
spline_bs = "tp", # "tp" (default), "cr", "cs", "ps"
control = list(adapt_delta = 0.99),
chains = 2, iter = 2000, refresh = 0
)Choosing the basis (spline_bs):
| Basis | When to use |
|---|---|
"tp" |
Thin-plate (default). Good general choice; rotationally invariant. |
"cr" |
Cubic regression spline. More numerically stable when auxiliary variables are highly correlated. |
"cs" |
Cubic with shrinkage. Allows variable selection — smooths on irrelevant covariates shrink to zero. |
"ps" |
P-splines. Strong choice for monotonic relationships. |
5.2 Gaussian processes
For SAE applications with more than ~100 areas, always use the Hilbert-space approximate GP (HSGP) of Riutort-Mayol et al. (2023):
# RECOMMENDED: HSGP with Matérn 5/2 covariance
fit_gp <- hbm(
formula = brms::bf(y ~ x1 + x2),
data = data_fhnorm,
re = ~ (1 | regency),
sampling_variance = "D", # known sampling variance (Fay-Herriot)
nonlinear = "x1",
nonlinear_type = "gp",
gp_k = 20L, # basis dim — REQUIRED for n > 100
gp_cov = "matern25", # more stable than "exp_quad"
gp_c = 1.25, # boundary scale (brms default)
control = list(adapt_delta = 0.99),
chains = 2, iter = 2000, refresh = 0
)GP argument reference:
| Argument | Meaning | Recommendation |
|---|---|---|
gp_k |
Number of basis functions for HSGP | ceiling(min(n/5, 25)) |
gp_cov |
Covariance function |
"matern25" for SAE; "exp_quad"
default |
gp_c |
HSGP boundary-scale factor |
1.25 (brms default); increase if GP truncates at
edges |
Note: exact GP warning. When gp_k = NA
(the default), an exact GP is fitted — which scales
.
For
,
hbsaems emits an immediate warning recommending an HSGP
basis size. Always set gp_k in production SAE
work.
5.3 Same arguments in distribution-specific wrappers
All wrappers (hbm_lnln, hbm_binlogitnorm,
hbm_betalogitnorm) and the flexible factory
hbm_flex() accept the same smooth-term arguments. Or use
the hbm_nonlinear() config helper:
# Build the smooth-term spec once, reuse across models
nl <- hbm_nonlinear(c("x1"), type = "gp",
k = 20, gp_cov = "matern25")
fit <- hbm_flex(family_key = "lognormal",
response = "y_obs",
auxiliary = c("x1", "x2"),
area_var = "district",
data = data_lnln,
nl) # spliced in automatically5.4 Convergence advice for smooth-term SAE models
Hierarchical Bayesian SAE models with smooth terms are prone to a few
recurring issues. Run convergence_check() after every fit;
the recommendations below address the most common pathologies.
1. Divergent transitions (most common). Cause is usually the “funnel” between the area-level standard deviation and the area intercepts. Fix:
- Set
control = list(adapt_delta = 0.99, max_treedepth = 12). - Tighten the prior on the area SD:
set_prior("normal(0, 0.5)", class = "sd").
2. Low effective sample size (ESS < 1000):
- Increase
iterto 6000 withwarmup = 3000. - Centre and scale the auxiliary variables before fitting.
- Run
prior_check()(sample_prior = “only”) to rule out prior–data conflict.
3. Exact GP at
— always hits max_treedepth and/or diverges. Set
gp_k. See §5.2 above.
4. Spline collinearity. When auxiliary variables are correlated:
- Switch to
spline_bs = "cr"for better numerical conditioning. - Use
spline_bs = "cs"if you want shrinkage-driven selection on the smooth terms.
5. Spatial random effects. See
vignette("hbsaems-spatial") for the BYM2
recommendation.
6. Shrinkage priors
hbsaems exposes brms-canonical global-local shrinkage
priors. Both horseshoe (Piironen & Vehtari 2017)
and R2D2 (Zhang et al. 2022) are available, and both
can be combined with splines and GPs.
6.1 Horseshoe (Piironen & Vehtari 2017)
fit_hs <- hbm(
formula = brms::bf(y ~ x1 + x2 + x3 + x4 + x5),
data = my_data,
re = ~ (1 | regency),
prior_type = "horseshoe",
hs_df = 1, # local half-t df (1 = original HS)
hs_df_global = 1, # global half-t df
hs_df_slab = 4, # slab half-t df (regularised HS+)
hs_scale_slab = 2, # slab scale
hs_autoscale = TRUE, # scale with residual sigma (Gaussian only)
chains = 2, iter = 2000, refresh = 0
)6.3 Cascading shrinkage across smooth and GP terms
When a model combines parametric coefficients with non-parametric
smooth terms (s()) or Gaussian processes
(gp()), the principled approach is to shrink all
components jointly. hbsaems does this
automatically using the brms-canonical main = TRUE
pattern:
# Model with linear coefficients + spline + Gaussian process
fit_combined <- hbm(
formula = brms::bf(y ~ x1 + x2 + x3),
data = my_data,
re = ~ (1 | regency),
nonlinear = c("x2", "x3"),
nonlinear_type = "spline", # spline for x2 AND x3
prior_type = "horseshoe", # cascades automatically
chains = 2, iter = 2000, refresh = 0
)Internally hbsaems builds the following brms prior
structure:
horseshoe(df = 1, df_global = 1, df_slab = 4, scale_slab = 2, main = TRUE) # class "b"
horseshoe(df = 1, df_global = 1, df_slab = 4, scale_slab = 2) # class "sds"The main = TRUE flag on the b prior is what
tells brms to apply the horseshoe jointly to
coefficients and to the smooth/GP standard deviations. Without this
cascade, the spline wiggliness and GP variance would be regularised by
their own (uninformative) default priors, undermining the shrinkage from
the parametric part.
The same cascade works for R2D2 and any combination of
s() + gp() terms:
| Formula contains | Parameter classes regularised |
|---|---|
y ~ x1 + x2 |
b |
y ~ x1 + s(x2) |
b, sds
|
y ~ x1 + gp(x2, k=10) |
b, sdgp
|
y ~ s(x1) + gp(x2, k=10) |
b, sds, sdgp
|
6.4 When to disable autoscale
The hs_autoscale / r2d2_autoscale flags
control whether brms re-scales the prior using the residual standard
deviation
.
This is the right thing to do for continuous responses
(Gaussian, lognormal) but
is undefined for discrete families. For Beta, binomial, or Poisson
responses, set autoscale = FALSE:
fit_bin_hs <- hbm_binlogitnorm(
response = "y",
trials = "n",
auxiliary = c("x1", "x2", "x3", "x4", "x5"),
area_var = "district",
data = data_binlogitnorm,
prior_type = "horseshoe",
hs_autoscale = FALSE, # binomial: no residual SD to scale by
chains = 2, iter = 2000, refresh = 0
)Both priors work with any family in the registry (Gaussian, Beta, binomial, lognormal, loglogistic, custom…) and combine with splines / GPs without extra setup.
7. Custom distributions: the model registry
The model registry (R/models-registry.R) catalogues
every model spec hbsaems knows about. Each spec bundles a likelihood
family, a default link, optional auxiliary-parameter hyperpriors,
missing-data support flags, and response-domain checks. Registered
models are usable via hbm(), the convenience wrappers, and
the flexible factory hbm_flex().
Inspecting the registry
list_hbsae_models()
#> [1] "Beta" "beta-binomial" "binomial" "categorical" ...
get_hbsae_model("beta")
#> $family : "Beta"
#> $link : "logit"
#> $discrete : FALSE
#> $supports_mi : TRUE
#> $aux_param_hyperprior: NULL # phi uses brms default
# gamma(0.01, 0.01) since v1.0.0Adding a custom model (Tier-1: pure metadata)
For distributions that just need a name, link, and domain check:
register_hbsae_model(
key = "gamma_log",
family = "Gamma",
link = "log",
discrete = FALSE,
supports_mi = TRUE,
response_check = function(y) all(y > 0, na.rm = TRUE),
response_check_msg = "Gamma response must be strictly positive."
)
# Use immediately
fit_gamma <- hbm_flex(
family_key = "gamma_log",
response = "expenditure",
auxiliary = c("age", "income", "education"),
area_var = "kecamatan", # area-level random effect
data = my_household_data
)Adding a custom model (Tier-2: auxiliary parameter hyperprior)
For distributions with auxiliary parameters (e.g. for Beta, shape for Gamma):
register_hbsae_model(
key = "gamma_with_hyperprior",
family = "Gamma",
link = "log",
response_check = function(y) all(y > 0, na.rm = TRUE),
# Inject Stan code via stanvars + a prior on the shape parameter
aux_param_hyperprior = function(args, data) {
if (!isTRUE(args$add_shape_prior)) return(NULL)
list(
stanvars = brms::stanvar(
scode = "real<lower=0> shape_a;\n real<lower=0> shape_b;",
block = "parameters") +
brms::stanvar(
scode = "shape_a ~ gamma(1, 1);\n shape_b ~ gamma(1, 1);",
block = "model"),
prior = brms::set_prior("gamma(shape_a, shape_b)", class = "shape")
)
}
)
# Trigger the hyperprior via aux_args
fit <- hbm_flex(
family_key = "gamma_with_hyperprior",
response = "y",
auxiliary = c("x1", "x2"),
area_var = "area_id", # area-level random effect
data = my_data,
aux_args = list(add_shape_prior = TRUE)
)8. Bayesian model averaging with LOO weights
When several candidate models pass convergence and
posterior-predictive checks, model averaging lets you
hedge across the candidates rather than committing to a single best
model. model_average() supports three weighting
strategies:
method |
Source of weights | When to use |
|---|---|---|
"manual" |
User-supplied weights vector |
When external constraints (e.g. official survey methodology) impose specific weights |
"stacking" |
loo::loo_model_weights(method="stacking") |
Recommended default; optimises log-score over the simplex (Yao et al. 2018) |
"pseudobma" |
loo::loo_model_weights(method="pseudobma") |
Robust when one model is clearly better than the others |
# Fit two competing models
fit_linear <- hbm(brms::bf(y ~ x1 + x2 + x3),
data = my_data, re = ~(1 | regency))
fit_smooth <- hbm(brms::bf(y ~ x1 + x2 + x3),
data = my_data,
re = ~(1 | regency),
nonlinear = c("x2", "x3"),
nonlinear_type = "spline")
# Stacking (recommended Bayesian model averaging)
avg_stack <- model_average(fit_linear, fit_smooth,
method = "stacking")
# Inspect the computed weights
attr(avg_stack, "weights")
#> [1] 0.34 0.66
attr(avg_stack, "weight_method")
#> [1] "stacking"The returned hbsae_results object can be passed to
sae_benchmark() exactly like any single-model estimate.
9. Prior sensitivity analysis
After convergence has been verified, the next question is: .
prior_sensitivity() wraps the modern power-scaling
diagnostics of Kallioinen et al.
(2024) which detect:
- Prior-data conflict – the posterior moves non-negligibly when the prior is up- or down-weighted. Often indicates an overly informative or misspecified prior.
- Weak likelihood – the posterior is dominated by the prior. Common in SAE for areas with few sampled units.
fit <- hbm(brms::bf(y ~ x1 + x2 + x3),
data = my_data,
re = ~(1 | regency),
prior_type = "horseshoe",
chains = 4, iter = 4000)
# Report power-scaling diagnostics for every monitored parameter
ps <- prior_sensitivity(fit)
print(ps)
#> # A tibble: 8 x 4
#> variable prior likelihood diagnosis
#> <chr> <dbl> <dbl> <chr>
#> 1 b_Intercept 0.012 0.821 -
#> 2 b_x1 0.156 0.683 -
#> 3 b_x2 0.834 0.412 prior-data conflict
#> ...A non-“-” diagnosis flags a parameter that warrants closer
inspection; in this hypothetical output, the coefficient on
x2 shows a prior-data conflict, suggesting the horseshoe
prior is too strong for that covariate.
prior_sensitivity() requires the package; install with
install.packages("priorsense").
A realistic workflow combining everything:
# 1. Check data
chk <- check_data(my_data, response = "y",
auxiliary = c("x1", "x2", "x3"),
spatial_var = "kecamatan")
# 2. Build CAR matrix from shapefile
M <- build_spatial_weight("kecamatan.shp",
for_model = "car",
id_col = "kec_code")
# 3. Configure with helpers (v1.0.0+); pass bundles directly to hbm()
priors <- hbm_priors(prior_type = "horseshoe", hs_df_slab = 4)
nl <- hbm_nonlinear(c("x1"), type = "spline", k = 5)
ctrl <- hbm_control(chains = 4, iter = 4000, cores = 4,
adapt_delta = 0.95)
# 4. Fit
fit <- hbm(
formula = brms::bf(y ~ x1 + x2 + x3),
data = my_data,
spatial_var = "kecamatan", spatial_model = "car", M = M,
handle_missing = chk$recommended_method,
priors, nl, ctrl
)
# 5. Predict at all areas (including non-sample)
est <- sae_predict(fit)
# 6. Benchmark to regency totals (fully Bayesian)
T_kab <- c(Bogor = 110, Sukabumi = 145, Cianjur = 145)
bm <- sae_benchmark(est,
target = T_kab,
weights = my_data$populasi,
groups = my_data$regency,
method = "raking",
posterior = TRUE)
# 7. Use the corrected uncertainty
bm$result_table10. Hierarchical area random effects
Many SAE designs have multiple geographic levels –
e.g. regencies within provinces, or districts within counties within
province. hbsaems supports this directly through the
area_var argument, which now accepts a character
vector of column names from the highest level down:
# Two-level hierarchy: regency nested within province
fit <- hbm_lnln(
response = "y_obs",
auxiliary = c("x1", "x2", "x3"),
data = my_data,
area_var = c("province", "regency"), # province first!
area_re_structure = "nested", # default
sampling_variance = "psi_i"
)Internally the helper builds the brms / lme4 formula
(1 | province / regency), which Stan expands to two
independent normal random intercepts:
so the linear predictor for area in province becomes . This is the canonical multi-stage SAE specification (Rao & Molina 2015, Ch. 8).
When to use "crossed" instead of
"nested"? Use crossed when the two grouping
variables share levels across each other – e.g.
urban / rural across all provinces. Pass
area_re_structure = "crossed":
fit_crossed <- hbm_lnln(
response = "y_obs",
auxiliary = c("x1", "x2"),
data = my_data,
area_var = c("province", "urbanrural"),
area_re_structure = "crossed", # NOT nested
sampling_variance = "psi_i"
)which becomes (1 | province) + (1 | urbanrural).
Three or more levels. The same syntax extends arbitrarily:
hbm_lnln(
response = "y",
auxiliary = c("x1", "x2"),
data = my_data,
area_var = c("province", "regency", "district", "village"),
# area_re_structure = "nested" # default
sampling_variance = "psi_i"
)produces (1 | province / regency / district / village)
which brms expands to four nested random effects.
References
- Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. JRSS-B, 36(2), 192-225.
- Anselin, L. (1988). Spatial Econometrics: Methods and Models. Kluwer.
- Pfeffermann, D. (2013). New important developments in small area estimation. Statistical Science, 28(1), 40-68.
- Piironen, J. & Vehtari, A. (2017). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic J. of Statistics, 11(2), 5018-5051.
- Zhang, Y. D., Naughton, B. P., Bondell, H. D., & Reich, B. J. (2022). Bayesian Regression Using a Prior on the Model Fit: The R2-D2 Shrinkage Prior. JASA, 117(538), 862-874.
- Riutort-Mayol, G., Burkner, P.-C., Andersen, M. R., Solin, A., & Vehtari, A. (2023). Practical Hilbert space approximate Bayesian Gaussian processes for probabilistic programming. Statistics and Computing, 33, 17.
- Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (2018). Using stacking to average Bayesian predictive distributions. Bayesian Analysis, 13(3), 917-1007.
- Kallioinen, N., Paananen, T., Burkner, P.-C., & Vehtari, A. (2024). Detecting and diagnosing prior and likelihood sensitivity with power-scaling. Statistics and Computing, 34, 57.
- Vehtari, A., Gelman, A., Simpson, D., Carpenter, B., & Burkner, P.-C. (2021). Rank-normalization, folding, and localization: An improved R-hat for assessing convergence of MCMC. Bayesian Analysis, 16(2), 667-718.