A realistic synthetic dataset based on the empirical structure of real IT-SILC
(Italian Survey on Income and Living Conditions) 2024 data. The synthouse dataset contains
20,034 individuals nested in 10,099 households, with detailed
demographic, socio-economic, and geographic information.
Format
A data frame with 20,034 rows (individuals), 10,099 households and 15 variables:
- person_id
Character. Unique person identifier (format: P00000001, P00000002, ...)
- hh_id
Character. Household identifier. All individuals in the same household share this ID (format: HH000001, HH000002, ...)
- NUTS1
Character. NUTS1 region code (5 macro-regions):
N
S
NE
NO
C
- NUTS2
Character. NUTS2 region code (30 regions, format: N01-N06, S01-S06, ...)
- NUTS3
Character. NUTS3 province code (120 provinces, format: N01001-N01004, ...)
- municipality
Character. Municipality code (1,079 municipalities, format: N010010001-N010010008, ...)
- age
Integer. Age in years (0-80)
- age_class
Factor. Age class with 7 levels: "0-14", "15-17", "18-24", "25-34", "35-49", "50-64", "65+"
- gender
Integer. Gender code:
1 = Male
2 = Female
- education_level
Character. Education level (adults 18+ only, NA for minors):
"Low" = No education, primary, or lower secondary (ISCED 0-2)
"Medium" = Upper secondary or post-secondary non-tertiary (ISCED 3-5)
"High" = Tertiary education (ISCED 6-8)
- employment_status
Character. Main activity status:
"Employed" = In employment
"Unemployed" = Unemployed
"Retired" = Retired
"Student" = Student or pupil
"Other" = Other (unable to work, domestic tasks, etc.)
- hh_size
Integer. Household size (number of members): 1-7
- hh_type
Character. Household type:
"Single" = One-person household
"Couple" = Two adults without children
"Single_parent" = Single parent with children
"Family" = Household with children (2+ adults)
"Other" = Other household types
- eq_income
Numeric. Equivalised disposable household income in euros. This is the total household income divided by the OECD modified equivalence scale. All household members share the same equivalised income.
- hh_income
Numeric. Total disposable household income in euros before equivalisation. All household members share the same total income.
- oecd_scale
Numeric. OECD modified equivalence scale for the household:
First adult (14+): weight = 1.0
Other adults (14+): weight = 0.5 each
Children (< 14): weight = 0.3 each
Formula: modif_oecd_scale = 1.0 + 0.5 × (n_adults - 1) + 0.3 × n_children
- weight
Numeric. Sampling weight (inverse inclusion probability). Represents the number of individuals in the population represented by this sample unit. All household members share the same weight.
Details
The synthetic dataset was generated to reproduce key characteristics of
IT-SILC data. It is primarily intended to demonstrate the computation of
quantile-based inequality indicators provided by the inequantiles package,
such as quantiles, influence functions, and the quantile ratio index (QRI).
The data structure mirrors that of IT-SILC but contains fictional values, therefore it
is suitable for methodology illustration and testing, not for policy analysis.
Geographic variables follow a hierarchical NUTS structure with realistic proportions across macro-regions and were created randomly, they do not correspond to real codes. Individual characteristics (age, gender, education, ..) were assigned randomly based on conditional empirical distributions from IT-SILC. Income was generated using a regression model fitted to IT-SILC data: $$log(eq_income) ~ education_head + n_employed + age_head + I(age_head^2) + hh_size$$. Sampling weights follow a lognormal distribution fitted to IT-SILC.
Key Statistics:
Sample size: 20,034 individuals in 10,099 households
Average household size: ~1.99 (matching IT-SILC)
Estimated population: 15,749,925 individuals (the sum of the weights)
Geographic coverage: 5 macro-regions, 30 NUTS2, 120 NUTS3, 1,079 municipalities
References
Eurostat (2024). "EU Statistics on Income and Living Conditions (EU-SILC) methodology". https://ec.europa.eu/eurostat/
Examples
# Load the dataset
data(synthouse)
# Basic structure
str(synthouse)
#> tibble [20,034 × 17] (S3: tbl_df/tbl/data.frame)
#> $ person_id : chr [1:20034] "HH000001P1" "HH000001P2" "HH000001P3" "HH000001P4" ...
#> $ hh_id : chr [1:20034] "HH000001" "HH000001" "HH000001" "HH000001" ...
#> $ NUTS1 : chr [1:20034] "N" "N" "N" "N" ...
#> $ NUTS2 : chr [1:20034] "N01" "N01" "N01" "N01" ...
#> $ NUTS3 : chr [1:20034] "N01005" "N01005" "N01005" "N01005" ...
#> $ municipality : chr [1:20034] "N010050010" "N010050010" "N010050010" "N010050010" ...
#> $ age : num [1:20034] 39 38 15 13 37 54 55 61 56 69 ...
#> $ age_class : Factor w/ 7 levels "0-14","15-17",..: 5 5 2 1 5 6 6 6 6 7 ...
#> $ gender : num [1:20034] 1 2 1 2 2 2 1 2 1 1 ...
#> $ education_level : chr [1:20034] "Low" "Medium" NA NA ...
#> $ employment_status: chr [1:20034] "Employed" "Employed" "Student" "Student" ...
#> $ hh_size : int [1:20034] 4 4 4 4 1 2 2 2 2 2 ...
#> $ hh_type : chr [1:20034] "Family" "Family" "Family" "Family" ...
#> $ eq_income : num [1:20034] 10431 10431 10431 10431 36588 ...
#> $ hh_income : num [1:20034] 23991 23991 23991 23991 36588 ...
#> $ oecd_scale : num [1:20034] 2.3 2.3 2.3 2.3 1 1.5 1.5 1.5 1.5 1.5 ...
#> $ weight : num [1:20034] 83.7 83.7 83.7 83.7 167.2 ...
head(synthouse)
#> # A tibble: 6 × 17
#> person_id hh_id NUTS1 NUTS2 NUTS3 municipality age age_class gender
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <fct> <dbl>
#> 1 HH000001P1 HH000001 N N01 N01005 N010050010 39 35-49 1
#> 2 HH000001P2 HH000001 N N01 N01005 N010050010 38 35-49 2
#> 3 HH000001P3 HH000001 N N01 N01005 N010050010 15 15-17 1
#> 4 HH000001P4 HH000001 N N01 N01005 N010050010 13 0-14 2
#> 5 HH000002P1 HH000002 NE NE06 NE06004 NE060040007 37 35-49 2
#> 6 HH000003P1 HH000003 N N05 N05003 N050030007 54 50-64 2
#> # ℹ 8 more variables: education_level <chr>, employment_status <chr>,
#> # hh_size <int>, hh_type <chr>, eq_income <dbl>, hh_income <dbl>,
#> # oecd_scale <dbl>, weight <dbl>
# Summary statistics
summary(synthouse$eq_income)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1523 12912 20429 25791 32529 298003
summary(synthouse$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 36.00 45.00 45.98 63.00 85.00
# Number of households and individuals
length(unique(synthouse$hh_id)) # Households
#> [1] 10099
nrow(synthouse) # Individuals
#> [1] 20034
# Average household size
mean(table(synthouse$hh_id))
#> [1] 1.983761
# Distribution of household types
table(unique(synthouse[, c("hh_id", "hh_type")])$hh_type)
#>
#> Couple Family Other Single Single_parent
#> 2910 1537 1161 4329 162
# Age distribution
table(synthouse$age_class)
#>
#> 0-14 15-17 18-24 25-34 35-49 50-64 65+
#> 2011 557 47 1779 8291 2504 4845
# Weighted quantiles
csquantile(synthouse$eq_income,
weights = synthouse$weight,
probs = c(0.25, 0.5, 0.75),
type = 6)
#> 25% 50% 75%
#> 12353.29 20014.17 32222.93
# Quantile Ratio Index
qri(synthouse$eq_income,
weights = synthouse$weight,
type = 6)
#> [1] 0.5690895