Quantile estimation with complex sampling data

Computes quantiles for weighted or unweighted data, allowing for sampling weights and several interpolation types. The method extends the standard quantile definitions of Hyndman and Fan (1996) and Harrell and Davis (1982) estimator to the case of complex survey data by incorporating sampling weights into the cumulative distribution function and interpolation points, as proposed in Scarpa et al. (2025) .

Usage

csquantile(y, weights = NULL, probs = seq(0, 1, 0.1), type = 4, na.rm = FALSE)

Arguments

y: Numeric vector of observations.
weights: Optional numeric vector of sampling weights; if NULL (default), all observations are equally weighted.
probs: Numeric vector of probabilities (default: seq(0, 1, 0.1)).
type: Quantile estimation type: integer 4–9 or "HD" for Harrell–Davis (default: 4).
na.rm: Logical; if TRUE, missing values are removed before computing. Default: FALSE. Higher-level functions in this package handle NA removal before calling csquantile(), so the default is kept FALSE to avoid redundant filtering.

Value

A named numeric vector of estimated quantiles corresponding to probs.

Details

Consider a random sample $s$ of size $n$. Let $y_1, \ldots, y_n$ be the sample observations from a finite population, with order statistics $y_{(1)} \le \ldots \le y_{(n)}$ and corresponding sampling weights $w_1, \ldots, w_n$. Define the cumulative weights $W_j = \sum_{i \le j} w_i$ and the total weight $W_n = \sum_{i=1}^n w_i$. The weighted quantile estimator is computed as a linear interpolation between adjacent order statistics:

$$ \widehat{Q}(p) = y_{(k-1)} + (y_{(k)} - y_{(k-1)}) \frac{p - \widehat{r}_{k-1}}{\widehat{r}_k - \widehat{r}_{k-1}}, $$

where $\widehat{r}_k$ denotes the estimated cumulative distribution function (the “plotting position”), and the order $k$ is such that $W_{k-1} - m_{k-1} < W_n p < W_k - m_k$, with $m_k$ determined by the interpolation method.

The table below summarizes the six interpolation types (4–9) extended from Hyndman and Fan (1996) to incorporate sampling weights, as described in Scarpa et al. (2025) .

Type	Estimator $\widehat{r}_k$	Interpolation $\widehat{m}_k$	Selection rule for $k$
4	$W_k / W_n$	0	$W_{k-1} \le W_n p < W_k$
5	$(W_k - \tfrac{1}{2} w_k) / W_n$	$w_k / 2$	$W_{k-1} - \tfrac{w_{k-1}}{2} \le W_n p < W_k - \tfrac{w_k}{2}$
6	$W_k / (W_n + w_n)$	$w_n p$	$W_{k-1} \le (W_n + w_n)p < W_k$
7	$W_{k-1} / W_{n-1}$	$w_k - w_n p$	$W_{k-2} \le W_{n-1}p < W_{k-1}$
8	$(W_k - \tfrac{1}{3}w_k) / (W_n + \tfrac{w_n}{3})$	$\tfrac{w_k}{3} + \tfrac{w_n}{3}p$	$W_{k-1} - \tfrac{w_{k-1}}{3} \le (W_n - \tfrac{w_n}{3})p < W_k - \tfrac{w_k}{3}$
9	$(W_k - \tfrac{3}{8}w_k) / (W_n + \tfrac{1}{4}w_n)$	$\tfrac{3}{8}w_k + \tfrac{w_n}{4}p$	$W_{k-1} - \tfrac{3w_{k-1}}{8} \le (W_n + \tfrac{w_n}{4})p < W_k - \tfrac{3w_k}{8}$

The function supports several interpolation rules (types 4–9) and extends the quantile definitions in Hyndman and Fan (1996) to incorporate sampling weights. For unweighted data, the function returns the standard R quantiles.

The Harrell–Davis estimator ("HD") is extended to the weighted case as proposed in Kreutzmann (2018) , by redefining the weighting coefficients $\widehat{\mathcal{W}}_j(p)$ for order statistics as:

$$ \widehat{\mathcal{W}}_j(p) = b_{(W_j / W_n)}\{(W_n + w_n)p, W_n - (W_n + w_n)p + w_n\} - b_{(W_{j-1}/W_n)}\{(W_n + w_n)p, W_n - (W_n + w_n)p + w_n\}, $$

where $b_x(a,b)$ denotes the incomplete beta function.

The resulting quantile estimator is $\widehat{Q}_{HD}(p) = \sum_{j \in s} \widehat{\mathcal{W}}_j(p) y_{(j)}.$ For unweighted data, the function returns the Harrell-Davis quantile estimator.

References

Harrell FE, Davis CE (1982). “A new distribution-free quantile estimator.” Biometrika, 69, 635–640.

Hyndman RJ, Fan Y (1996). “Sample quantiles in statistical packages.” The American Statistician, 50, 361–365.

Kreutzmann AK (2018). “Estimation of sample quantiles: challenges and issues in the context of income and wealth distributions.” AStA Wirtschafts-und Sozialstatistisches Archiv, 12, 245–270.

Scarpa S, Ferrante MR, Sperlich S (2025). “Inference for the quantile ratio inequality index in the context of survey data.” Journal of Survey Statistics and Methodology. doi:10.1093/jssam/smaf024 .

Examples

data(synthouse)
y <- synthouse$eq_income
w <- synthouse$weight

# Unweighted quantiles
csquantile(y, probs = c(0.25, 0.5, 0.75), type = 6)
#>      25%      50%      75% 
#> 12910.48 20429.20 32529.02 

# Weighted quantiles
csquantile(y, weights = w, probs = c(0.25, 0.5, 0.75), type = 6)
#>      25%      50%      75% 
#> 12353.29 20014.17 32222.93 

# Harrell-Davis estimator
csquantile(y, weights = w, probs = c(0.25, 0.5, 0.75), type = "HD")
#>      25%      50%      75% 
#> 12352.14 20017.37 32213.50

Type	Estimator \(\widehat{r}_k\)	Interpolation \(\widehat{m}_k\)	Selection rule for \(k\)
4	\(W_k / W_n\)	0	\(W_{k-1} \le W_n p < W_k\)
5	\((W_k - \tfrac{1}{2} w_k) / W_n\)	\(w_k / 2\)	\(W_{k-1} - \tfrac{w_{k-1}}{2} \le W_n p < W_k - \tfrac{w_k}{2}\)
6	\(W_k / (W_n + w_n)\)	\(w_n p\)	\(W_{k-1} \le (W_n + w_n)p < W_k\)
7	\(W_{k-1} / W_{n-1}\)	\(w_k - w_n p\)	\(W_{k-2} \le W_{n-1}p < W_{k-1}\)
8	\((W_k - \tfrac{1}{3}w_k) / (W_n + \tfrac{w_n}{3})\)	\(\tfrac{w_k}{3} + \tfrac{w_n}{3}p\)	\(W_{k-1} - \tfrac{w_{k-1}}{3} \le (W_n - \tfrac{w_n}{3})p < W_k - \tfrac{w_k}{3}\)
9	\((W_k - \tfrac{3}{8}w_k) / (W_n + \tfrac{1}{4}w_n)\)	\(\tfrac{3}{8}w_k + \tfrac{w_n}{4}p\)	\(W_{k-1} - \tfrac{3w_{k-1}}{8} \le (W_n + \tfrac{w_n}{4})p < W_k - \tfrac{3w_k}{8}\)