Skip to contents

Performs three independent checks on the supplied dataset and returns a structured hbsaems_data_check object that summarises the results. This function is intended to be called before hbm or any of the distribution-specific wrappers.

Usage

check_data(
  data,
  response,
  auxiliary = NULL,
  area_var = NULL,
  spatial_var = NULL,
  M = NULL,
  trials = NULL,
  n_var = NULL,
  predictors = NULL,
  group = NULL,
  sre = NULL
)

Arguments

data

A data.frame.

response

Character. Name of the response variable in data.

auxiliary

Character vector. Names of the auxiliary variables (the SAE 'X' covariates).

area_var

Optional character. Name of the area (random-effect grouping) variable.

spatial_var

Optional character. Name of the spatial-grouping variable (column in data that indexes rows of the spatial weight matrix M).

M

Optional spatial weight matrix to dimension-check against data.

trials

Optional character. Name of the trials variable (binomial models).

n_var

Optional character. Name of the sample-size variable (beta / lognormal direct-estimator models).

predictors

Deprecated. Use auxiliary instead.

group

Deprecated. Use area_var instead.

sre

Deprecated. Use spatial_var instead.

Value

An object of class hbsaems_data_check with components:

n_obs

Number of rows in the data.

missing_summary

Named integer vector: per-variable count of NA.

missing_pattern

Character: "none", "y_only", "x_only", or "both".

dimension_check

Named list of dimension diagnostics.

non_sample_warning

Character or NULL – a hint to investigate whether NA-Y rows are non-sample (out-of-sample) areas.

recommended_method

Character: suggested handle_missing value ("deleted", "multiple", "model", or NA when no missing values are present).

recommendation_text

Human-readable rationale.

issues

Character vector of fatal errors (length 0 if OK).

Details

1. Variable presence

Verifies that response, every name in auxiliary, and the optional area_var / spatial_var / trials / n_var columns exist in data. Missing variables are reported in $issues.

2. Missing-value pattern

The pattern is one of:

"none"

All listed columns are complete.

"y_only"

Only the response has NAs.

"x_only"

Only the auxiliary variables have NAs.

"both"

Both Y and X have NAs.

Based on the pattern, a strategy is recommended:

"y_only"

First, check whether the NA-Y rows are non-sample (out-of-sample) areas – domains for which a prediction is desired but no direct estimate exists. If yes, do not delete them; fit on the complete-Y subset and pass the NA-Y rows to sae_predict via the newdata argument. If they are merely missing observations within sampled areas, use handle_missing = "deleted".

"x_only"

handle_missing = "multiple" – multiple imputation via mice.

"both" (continuous outcome)

handle_missing = "model" – joint Bayesian imputation via brms::mi().

"both" (discrete outcome, binomial)

handle_missing = "multiple".

3. Dimension check

When M is supplied, verifies that it is square and that nrow(M) matches the number of distinct levels in data[[spatial_var]] (or nrow(data) when spatial_var is NULL).

See also

Examples

data("data_fhnorm")

# 1. Complete data -> no warnings, no recommendation
chk <- check_data(data_fhnorm,
                  response   = "y",
                  auxiliary  = c("x1", "x2", "x3"))
print(chk)
#> 
#> HBSAE Data Check  [hbsaems_data_check]
#> ---------------------------------------
#>  Observations    : 100 
#>  Missing pattern : none (data complete) 
#>  Issues          : none
#> 
#>  - No missing values detected. handle_missing can stay NULL. 
#> 
#> Use summary() for full details.
#> 

# 2. Missing-Y pattern -> recommends checking for non-sample areas
d <- data_fhnorm
d$y[1:5] <- NA
chk2 <- check_data(d, response = "y",
                      auxiliary  = c("x1", "x2", "x3"))
summary(chk2)
#> 
#> ===== HBSAE Data Check Summary =====
#> 
#> Observations: 100 
#> 
#> Missing values per variable:
#>  Variable NA_count    Pct
#>         y        5   5.0%
#>        x1        0   0.0%
#>        x2        0   0.0%
#>        x3        0   0.0%
#> 
#> Non-sample-area note:
#> Response 'y' has 5 missing value(s) (5.0% of rows).
#>   IMPORTANT: Inspect these rows -- are they NON-SAMPLE areas
#>   (out-of-sample domains for which you want PREDICTIONS rather
#>   than estimates)?
#>     * If YES (non-sample): keep these rows in the data, fit the
#>       model on the complete-Y subset, and use sae_predict() with
#>       newdata = <NA-Y rows> to obtain area-level predictions.
#>     * If NO  (truly missing within sampled areas): use
#>       handle_missing = 'deleted' (or 'model' for continuous Y
#>       and 'multiple' for discrete Y).
#> 
#> Recommendation:
#>   handle_missing =‘deleted’
#>   Only the response is missing. Recommended: handle_missing = 'deleted'. BUT first verify whether the NA-Y rows are non-sample
#>   areas (see non_sample_warning).

# 3. Missing-X-only -> recommends multiple imputation
d2 <- data_fhnorm
d2$x1[10:15] <- NA
chk3 <- check_data(d2, response = "y",
                      auxiliary  = c("x1", "x2", "x3"))
chk3$recommended_method
#> [1] "multiple"

# 4. Spatial dimension check
data("adjacency_matrix_car")
chk4 <- check_data(data_fhnorm[1:5, ],
                   response   = "y",
                   auxiliary  = c("x1", "x2", "x3"),
                   M          = adjacency_matrix_car)
chk4$dimension_check
#> $n_areas_data
#> [1] 5
#> 
#> $n_areas_M
#> [1] 5
#> 
#> $M_is_square
#> [1] TRUE
#> 
#> $dim_match
#> [1] TRUE
#>