Quantile Estimator for Grouped (Binned) Data

Computes quantiles from grouped frequency data using linear interpolation within the quantile class.

Usage

quantile_grouped(
  freq,
  lower_bounds,
  upper_bounds,
  probs = 0.5,
  midpoints = NULL
)

Arguments

freq: Numeric vector of class frequencies (counts). Must be non-negative.
lower_bounds: Numeric vector of lower class bounds. Must be strictly increasing.
upper_bounds: Numeric vector of upper class bounds. Must be strictly increasing and greater than corresponding lower_bounds.
probs: Numeric vector of probabilities (between 0 and 1) for which to compute the quantiles. Default is 0.5 (median).
midpoints: Optional numeric vector of class midpoints. Used only as fallback when a quantile class has zero frequency. If NULL, the midpoint is computed as the arithmetic mean of class bounds.

Value

A vector of estimated quantiles on grouped data corresponding to probs. Returns NA if total frequency is zero or missing.

Details

Consider grouped data divided into $J$ classes with known boundaries. Let:

$L_j$ be the lower bound of the $j$-th quantile class
$U_j$ be the upper bound of the $j$-th quantile class
$h_j = U_j - L_j$ be the $j$-th quantile class width
$C_{j-1}$ be the cumulative frequency up to the previous class
$f_j$ be the frequency within the quantile class $j$
$N = \sum_{i=1}^{k} f_i$ be the total frequency

The quantile class for the $p$-th quantile is the first class $j$ such that:

$$j = min\{i: C_i \geq pN \}$$.

The $p$-th quantile $Q(p)$ is then estimated by linear interpolation within the quantile class:

$$\widetilde{Q(p)} = L_j + \frac{(pN - C_{j-1})}{f_j} \cdot h_j$$

The method assumes a uniform distribution of observations within each class interval. This is a standard approach for grouped data when individual observations are not available.

Handling Open-Ended Classes

When dealing with administrative or tax data, the first class is often defined as negative income (or incomes below zero) and the last class as incomes above a certain threshold. In such cases, we have -Inf as the lower bound of the first class and Inf as the upper bound of the last class.

If Inf values are present in the given bounds, the function imputes reasonable bounds using the specified method:

For open left class (first lower bound = -Inf): The imputed first lower bound is given by: $$L_1^* = U_1 - h_2$$

where $U_1$ is the upper bound of the first class and $h_2 = U_2 - L_2$ is the width of the second class. This assumes the first class has the same width as the second class.

For open right class (last upper bound = Inf):

The imputed upper bound is given by: $$U_J^* = L_J + h_{J-1}$$

where $L_J$ is the lower bound of the last class and $h_{J-1} = U_{J-1} - L_{J-1}$ is the width of the second-to-last class. This assumes the last class has the same width as the penultimate class.

Special Cases

If the quantile class has zero frequency, the function returns the class midpoint as a fallback.
If total frequency is zero or NA, the function returns NA for all requested quantiles.

Examples

# Basic usage: compute quartiles
freq <- c(5, 8, 10, 4, 3)
lower <- c(0, 10, 20, 30, 40)
upper <- c(10, 20, 30, 40, 50)

quantile_grouped(freq, lower, upper, probs = c(0.25, 0.5, 0.75))
#>   25 %   50 %   75 % 
#> 13.125 22.000 29.500 

# Compute deciles
quantile_grouped(freq, lower, upper, probs = seq(0.1, 0.9, by = 0.1))
#>  10 %  20 %  30 %  40 %  50 %  60 %  70 %  80 %  90 % 
#>  6.00 11.25 15.00 18.75 22.00 25.00 28.00 32.50 40.00 

# With custom midpoints
midpts <- c(5, 15, 25, 35, 45)
quantile_grouped(freq, lower, upper, probs = 0.5, midpoints = midpts)
#> 50 % 
#>   22 

# Income distribution example
income_freq <- c(120, 180, 150, 80, 40, 20, 10)
income_lower <- c(0, 15000, 30000, 45000, 60000, 80000, 100000)
income_upper <- c(15000, 30000, 45000, 60000, 80000, 100000, 150000)

# Compute median income
quantile_grouped(income_freq, income_lower, income_upper, probs = 0.5)
#>  50 % 
#> 30000 

# Compute income quintiles
quantile_grouped(income_freq, income_lower, income_upper,
                   probs = seq(0.2, 0.8, by = 0.2))
#>  20 %  40 %  60 %  80 % 
#> 15000 25000 36000 50625