Implements the rescaled bootstrap method for variance estimation in survey data, supporting both stratified simple random sampling and multistage complex designs.
Usage
rescaled_bootstrap(
data,
y,
strata,
N_h = NULL,
psu = NULL,
weights = NULL,
estimator,
by_strata = TRUE,
B = 200,
m_h = NULL,
seed = NULL,
verbose = TRUE
)Arguments
- data
A data frame containing the survey data.
- y
A character string specifying the variable name to be used for the target variable.
- strata
A character string specifying the stratification variable.
- N_h
Optional vector of stratum population sizes, used for the finite population correction (FPC). Can be a single value (applied to all strata) or one value per stratum.
- psu
Optional character string specifying the Primary Sampling Unit (PSU) variable. Required for multistage complex designs.
- weights
Optional character string specifying the sampling weight variable. Required for complex designs with unequal inclusion probabilities.
- estimator
A function that computes the statistic of interest, accepting arguments
estimator(y, weights)for complex designs orestimator(y)for simple designs.- by_strata
Logical; if
TRUE, variances are computed separately by stratum.- B
Integer; number of bootstrap replicates (default = 200).
- m_h
Optional vector of bootstrap sample sizes per stratum (PSUs for complex designs). If
NULL, defaults to \(m_h = \lfloor (n_h - 2)^2 / (n_h - 1) \rfloor\).- seed
Optional integer for reproducibility.
- verbose
Logical; if
TRUE(default), displays a progress bar during bootstrap iterations.
Value
A list containing:
- variance
Bootstrap variance estimate
- boot_estimates
Vector of B bootstrap estimates
- B
Number of bootstrap replicates
- by_strata
Logical;
TRUEif variance is computed separately by stratum.- design
Character string:
"two-stage stratified"or"stratified SRS".- strata_info
Data frame with number of observations/PSUs per stratum.
- call
The matched function call.
Details
The rescaled bootstrap is a resampling technique designed for complex survey data that preserves stratification and primary sampling unit (PSU) structure, providing consistent variance estimation for both smooth and non-smooth statistics. The methodology is based on Rao and Wu (1988) and Rao et al. (1992) .
(1) Stratified Simple Random Sampling design
Consider a finite population divided into \(H\) strata, each of size \(N_h\), with a sample of size \(n_h\) selected independently in each \(h\) stratum. Suppose to be interested in some \(\theta\) parameter, with \(\hat{\theta}\) sampling estimator. For each \(b\) bootstrap replicate, \(b = 1, \ldots, B\) and stratum \(h\):
Draw a bootstrap sample of size \(m_h\) with replacement from the \(n_h\) sampled units. By default, \(m_h = \lfloor (n_h - 2)^2 / (n_h - 1) \rfloor \approx n_h - 3\).
Compute rescaled bootstrap values: $$ \tilde{y}_{hj}^{*(b)} = \bar{y}_h + \sqrt{\frac{m_h(1-f_h)}{n_h - 1}} (y_{hj}^{*(b)} - \bar{y}_h), $$ where \(y_{hj}^{*(b)}\) is the bootstrap observation, \(1 - f_h\) is the FPC, with \(f_h = n_h / N_h\), and \(\bar{y}_h\) is the sample stratum mean.
Compute the statistic of interest \(\hat{\theta}^{*(b)}_h\) using rescaled values.
The variance is then estimated by the bootstrap variance.
(2) Two-Stage Stratified Sampling design
For designs with PSUs and sampling weights:
Within each stratum \(h\), draw \(m_h\) PSUs with replacement from the \(n_h\) sampled PSUs. By default, \(m_h = \lfloor (n_h - 2)^2 / (n_h - 1) \rfloor \approx n_h - 3\).
Let \(m_{hi}^{(b)}\) denote the number of times PSU \(i\) is selected in replicate \(b\). Each observation in the \(i\)-th PSU is assigned a rescaled bootstrap weight: $$ w_{hij}^{*(b)} = \left[ 1 - c_h + c_h \frac{n_h}{m_h} m_{hi}^{(b)} \right] w_{hij}, \qquad c_h = \sqrt{\frac{m_h}{n_h - 1}}. $$ \(w_{hij}\) is the sampling weight associated to individual \(j\) in PSU \(i\) in stratum \(h\)
The statistic \(\hat{\theta}^{*(b)}_h\) is computed using the rescaled weights.
The variance is then estimated by the bootstrap variance.
Multiple estimators
The estimator argument accepts any function with signature
f(y, weights) (complex design) or f(y) (simple design),
including functions from this package and user-defined ones.
When estimator returns a named numeric vector, variances are
computed for all outputs simultaneously from the same bootstrap replicates,
so the resulting standard errors are directly comparable across indicators.
References
Rao J, Wu C (1988). “Resampling inference with complex survey data.” Journal of the American Statistical Association, 83, 231–241.
Rao J, Wu C, Yue K (1992). “Some recent work on resampling methods for complex surveys.” Survey methodology, 18, 209–217.
Kolenikov S (2010). “Resampling variance estimation for complex survey data.” The Stata Journal, 10, 165–199.
Scarpa S, Ferrante MR, Sperlich S (2025). “Inference for the quantile ratio inequality index in the context of survey data.” Journal of Survey Statistics and Methodology. doi:10.1093/jssam/smaf024 .
See also
For a convenience wrapper that automatically computes all package inequality
indicators and their standard errors in a single call, see inequantiles.
Examples
# \donttest{
data(synthouse)
# ================================================================
# Example 1: Stratified Simple Random Sampling (SRS)
# ================================================================
# Use NUTS2 as strata
set.seed(123)
# Simulate population sizes per stratum (for FPC)
N_values <- sample(2000:5000, length(unique(synthouse$NUTS2)), replace = TRUE)
names(N_values) <- sort(unique(synthouse$NUTS2))
# Define a simple mean estimator
mean_estimator <- function(y) mean(y, na.rm = TRUE)
# Apply the rescaled bootstrap under stratified SRS
boot_srs <- rescaled_bootstrap(
data = synthouse,
y = "eq_income",
strata = "NUTS2",
N_h = N_values,
estimator = mean_estimator,
by_strata = TRUE,
B = 50, # small number for illustration
seed = 123,
verbose = FALSE
)
# View results
boot_srs$variance
#> C01 C02 C03 C04 C05 C06 N01 N02
#> 499863.9 573118.2 671053.7 545446.5 775512.9 475065.6 341097.8 350873.9
#> N03 N04 N05 N06 NE01 NE02 NE03 NE04
#> 307397.0 232378.7 320968.3 398986.7 509534.9 371716.5 416917.1 551225.8
#> NE05 NE06 NO01 NO02 NO03 NO04 NO05 NO06
#> 447320.5 337993.0 374534.2 305472.6 758880.7 398586.5 782366.0 1053463.7
#> S01 S02 S03 S04 S05 S06
#> 257089.8 416672.4 301026.8 420146.3 342398.1 536383.0
# ================================================================
# Example 2: Two-stage Complex Design
# ================================================================
# Estimate the QRI estimator sampling variance.
boot_complex <- rescaled_bootstrap(
data = synthouse,
y = "eq_income",
strata = "NUTS2",
psu = "municipality",
weights = "weight",
estimator = qri,
by_strata = TRUE,
B = 50,
seed = 456,
verbose = FALSE
)
# Display variance and bootstrap estimates
summary(boot_complex$variance)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0003333 0.0006152 0.0007907 0.0008676 0.0010729 0.0016656
# Strata and PSU summary
# ================================================================
# Example 3: Multiple estimators in a single bootstrap loop
# ================================================================
# Create a function returning a named vector of estimates,
# including package functions and user-defined ones. All indicators share
# the same bootstrap replicates, ensuring directly comparable standard errors.
multi_estimator <- function(y, weights) {
c(
w_mean = sum(y * weights) / sum(weights), # custom: weighted mean
qri = qri(y, weights = weights), # package function
qsr = share_ratio(y, weights = weights) # package function
)
}
boot_multi <- rescaled_bootstrap(
data = synthouse,
y = "eq_income",
strata = "NUTS2",
psu = "municipality",
weights = "weight",
estimator = multi_estimator,
by_strata = FALSE,
B = 50,
seed = 42,
verbose = FALSE
)
# One variance per indicator, all from the same replicates
boot_multi$variance
#> w_mean qri qsr
#> 1.292902e+05 2.151047e-05 3.087562e-02
# }
# ================================================================
# Note:
# These examples use small B for speed. For actual analysis,
# use B >= 200 for stable estimates.
# ================================================================