Synthetic Household Survey Data — synthouse • inequantiles

A realistic synthetic dataset based on the empirical structure of real IT-SILC (Italian Survey on Income and Living Conditions) 2024 data.

Usage

synthouse

Format

A data frame with 20,034 rows (individuals nested in 10,099 households) and 17 variables covering demographic, socio-economic, and geographic information:

person_id

Character. Unique person identifier, composed of the household ID followed by a person index within the household (format: HH000001P1, HH000001P2, HH000002P1, ...)

hh_id

Character. Household identifier. All individuals in the same household share this ID (format: HH000001, HH000002, ...)

NUTS1

Character. NUTS1 region code (5 macro-regions):

NUTS2

Character. NUTS2 region code (30 regions, format: N01-N06, S01-S06, ...)

NUTS3

Character. NUTS3 province code (120 provinces, format: N01001-N01004, ...)

municipality

Character. Municipality code (1,079 municipalities, format: N010010001-N010010008, ...)

age

Integer. Age in years (0-85)

age_class

Factor. Age class with 7 levels: "0-14", "15-17", "18-24", "25-34", "35-49", "50-64", "65+"

gender

Integer. Gender code:

1 = Male
2 = Female

education_level

Character. Education level (adults 18+ only, NA for minors):

"Low" = No education, primary, or lower secondary (ISCED 0-2)
"Medium" = Upper secondary or post-secondary non-tertiary (ISCED 3-5)
"High" = Tertiary education (ISCED 6-8)

employment_status

Character. Main activity status:

"Employed" = In employment
"Unemployed" = Unemployed
"Retired" = Retired
"Student" = Student or pupil
"Other" = Other (unable to work, domestic tasks, etc.)

hh_size

Integer. Household size (number of members): 1-7

hh_type

Character. Household type:

"Single" = One-person household
"Couple" = Two adults without children
"Single_parent" = Single parent with children
"Family" = Household with children (2+ adults)
"Other" = Other household types

eq_income

Numeric. Equivalised disposable household income in euros. This is the total household income divided by the OECD modified equivalence scale. All household members share the same equivalised income.

hh_income

Numeric. Total disposable household income in euros before equivalisation. All household members share the same total income.

oecd_scale

Numeric. OECD modified equivalence scale for the household:

First adult (14+): weight = 1.0
Other adults (14+): weight = 0.5 each
Children (< 14): weight = 0.3 each

Formula: modif_oecd_scale = 1.0 + 0.5 × (n_adults - 1) + 0.3 × n_children

weight

Numeric. Sampling weight (inverse inclusion probability). Represents the number of individuals in the population represented by this sample unit. All household members share the same weight.

Details

The synthetic dataset was generated to reproduce key characteristics of IT-SILC data, but contains fictional values; it is therefore suitable for methodology illustration and testing, not for policy analysis. It is primarily intended to demonstrate the computation of quantile-based inequality indicators provided by the inequantiles package, such as quantiles, quantile-based indicators, influence functions, and variance estimation.

Geographic variables follow a hierarchical NUTS structure with realistic proportions across macro-regions and were created randomly; they do not correspond to real codes. Individual characteristics (age, gender, education, ...) were assigned randomly based on conditional empirical distributions from IT-SILC. Income was generated using a regression model fitted to IT-SILC data:

$$ \log(\mathit{eq\_income}) \sim \mathit{education\_head} + \mathit{n\_employed} + \mathit{age\_head} + \mathit{age\_head}^2 + \mathit{hh\_size}. $$

where the suffix _head identifies variables measured for the household head (e.g., education_head is the education level of the household head, age_head is their age). Sampling weights follow a lognormal distribution fitted to IT-SILC.

Key Statistics:

Sample size: 20,034 individuals in 10,099 households
Average household size: ~1.99 (matching IT-SILC)
Estimated population: 15,749,925 individuals (the sum of the weights)
Geographic coverage: 5 macro-regions, 30 NUTS2, 120 NUTS3, 1,079 municipalities

References

Eurostat (2024). EU Statistics on Income and Living Conditions (EU-SILC): Methodology. https://ec.europa.eu/eurostat/

Examples

# Load the dataset
data(synthouse)

# Basic structure
str(synthouse)
#> tibble [20,034 × 17] (S3: tbl_df/tbl/data.frame)
#>  $ person_id        : chr [1:20034] "HH000001P1" "HH000001P2" "HH000001P3" "HH000001P4" ...
#>  $ hh_id            : chr [1:20034] "HH000001" "HH000001" "HH000001" "HH000001" ...
#>  $ NUTS1            : chr [1:20034] "N" "N" "N" "N" ...
#>  $ NUTS2            : chr [1:20034] "N01" "N01" "N01" "N01" ...
#>  $ NUTS3            : chr [1:20034] "N01005" "N01005" "N01005" "N01005" ...
#>  $ municipality     : chr [1:20034] "N010050010" "N010050010" "N010050010" "N010050010" ...
#>  $ age              : num [1:20034] 39 38 15 13 37 54 55 61 56 69 ...
#>  $ age_class        : Factor w/ 7 levels "0-14","15-17",..: 5 5 2 1 5 6 6 6 6 7 ...
#>  $ gender           : num [1:20034] 1 2 1 2 2 2 1 2 1 1 ...
#>  $ education_level  : chr [1:20034] "Low" "Medium" NA NA ...
#>  $ employment_status: chr [1:20034] "Employed" "Employed" "Student" "Student" ...
#>  $ hh_size          : int [1:20034] 4 4 4 4 1 2 2 2 2 2 ...
#>  $ hh_type          : chr [1:20034] "Family" "Family" "Family" "Family" ...
#>  $ eq_income        : num [1:20034] 10431 10431 10431 10431 36588 ...
#>  $ hh_income        : num [1:20034] 23991 23991 23991 23991 36588 ...
#>  $ oecd_scale       : num [1:20034] 2.3 2.3 2.3 2.3 1 1.5 1.5 1.5 1.5 1.5 ...
#>  $ weight           : num [1:20034] 83.7 83.7 83.7 83.7 167.2 ...
head(synthouse)
#> # A tibble: 6 × 17
#>   person_id  hh_id    NUTS1 NUTS2 NUTS3   municipality   age age_class gender
#>   <chr>      <chr>    <chr> <chr> <chr>   <chr>        <dbl> <fct>      <dbl>
#> 1 HH000001P1 HH000001 N     N01   N01005  N010050010      39 35-49          1
#> 2 HH000001P2 HH000001 N     N01   N01005  N010050010      38 35-49          2
#> 3 HH000001P3 HH000001 N     N01   N01005  N010050010      15 15-17          1
#> 4 HH000001P4 HH000001 N     N01   N01005  N010050010      13 0-14           2
#> 5 HH000002P1 HH000002 NE    NE06  NE06004 NE060040007     37 35-49          2
#> 6 HH000003P1 HH000003 N     N05   N05003  N050030007      54 50-64          2
#> # ℹ 8 more variables: education_level <chr>, employment_status <chr>,
#> #   hh_size <int>, hh_type <chr>, eq_income <dbl>, hh_income <dbl>,
#> #   oecd_scale <dbl>, weight <dbl>

# Summary statistics
summary(synthouse$eq_income)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    1523   12912   20429   25791   32529  298003 
summary(synthouse$age)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>    0.00   36.00   45.00   45.98   63.00   85.00 

# Number of households and individuals
length(unique(synthouse$hh_id))  # Households
#> [1] 10099
nrow(synthouse)                  # Individuals
#> [1] 20034

# Average household size
mean(table(synthouse$hh_id))
#> [1] 1.983761

# Distribution of household types
table(unique(synthouse[, c("hh_id", "hh_type")])$hh_type)
#> 
#>        Couple        Family         Other        Single Single_parent 
#>          2910          1537          1161          4329           162 

# Age distribution
table(synthouse$age_class)
#> 
#>  0-14 15-17 18-24 25-34 35-49 50-64   65+ 
#>  2011   557    47  1779  8291  2504  4845 

# Weighted quantiles
csquantile(synthouse$eq_income,
           weights = synthouse$weight,
           probs = c(0.25, 0.5, 0.75),
           type = 6)
#>      25%      50%      75% 
#> 12353.29 20014.17 32222.93 

# Quantile Ratio Index
qri(synthouse$eq_income,
    weights = synthouse$weight,
    type = 6)
#> [1] 0.5690895