A realistic synthetic dataset based on the empirical structure of real IT-SILC (Italian Survey on Income and Living Conditions) 2024 data.
Format
A data frame with 20,034 rows (individuals nested in 10,099 households) and 17 variables covering demographic, socio-economic, and geographic information:
- person_id
Character. Unique person identifier, composed of the household ID followed by a person index within the household (format: HH000001P1, HH000001P2, HH000002P1, ...)
- hh_id
Character. Household identifier. All individuals in the same household share this ID (format: HH000001, HH000002, ...)
- NUTS1
Character. NUTS1 region code (5 macro-regions):
N
S
NE
NO
C
- NUTS2
Character. NUTS2 region code (30 regions, format: N01-N06, S01-S06, ...)
- NUTS3
Character. NUTS3 province code (120 provinces, format: N01001-N01004, ...)
- municipality
Character. Municipality code (1,079 municipalities, format: N010010001-N010010008, ...)
- age
Integer. Age in years (0-85)
- age_class
Factor. Age class with 7 levels: "0-14", "15-17", "18-24", "25-34", "35-49", "50-64", "65+"
- gender
Integer. Gender code:
1 = Male
2 = Female
- education_level
Character. Education level (adults 18+ only, NA for minors):
"Low" = No education, primary, or lower secondary (ISCED 0-2)
"Medium" = Upper secondary or post-secondary non-tertiary (ISCED 3-5)
"High" = Tertiary education (ISCED 6-8)
- employment_status
Character. Main activity status:
"Employed" = In employment
"Unemployed" = Unemployed
"Retired" = Retired
"Student" = Student or pupil
"Other" = Other (unable to work, domestic tasks, etc.)
- hh_size
Integer. Household size (number of members): 1-7
- hh_type
Character. Household type:
"Single" = One-person household
"Couple" = Two adults without children
"Single_parent" = Single parent with children
"Family" = Household with children (2+ adults)
"Other" = Other household types
- eq_income
Numeric. Equivalised disposable household income in euros. This is the total household income divided by the OECD modified equivalence scale. All household members share the same equivalised income.
- hh_income
Numeric. Total disposable household income in euros before equivalisation. All household members share the same total income.
- oecd_scale
Numeric. OECD modified equivalence scale for the household:
First adult (14+): weight = 1.0
Other adults (14+): weight = 0.5 each
Children (< 14): weight = 0.3 each
Formula: modif_oecd_scale = 1.0 + 0.5 × (n_adults - 1) + 0.3 × n_children
- weight
Numeric. Sampling weight (inverse inclusion probability). Represents the number of individuals in the population represented by this sample unit. All household members share the same weight.
Details
The synthetic dataset was generated to reproduce key characteristics of
IT-SILC data, but contains fictional values; it is therefore suitable for
methodology illustration and testing, not for policy analysis.
It is primarily intended to demonstrate the computation of quantile-based
inequality indicators provided by the inequantiles package,
such as quantiles, quantile-based indicators, influence functions, and
variance estimation.
Geographic variables follow a hierarchical NUTS structure with realistic proportions across macro-regions and were created randomly; they do not correspond to real codes. Individual characteristics (age, gender, education, ...) were assigned randomly based on conditional empirical distributions from IT-SILC. Income was generated using a regression model fitted to IT-SILC data:
$$ \log(\mathit{eq\_income}) \sim \mathit{education\_head} + \mathit{n\_employed} + \mathit{age\_head} + \mathit{age\_head}^2 + \mathit{hh\_size}. $$
where the suffix _head identifies variables measured for the
household head (e.g., education_head is the education level of
the household head, age_head is their age).
Sampling weights follow a lognormal distribution fitted to IT-SILC.
Key Statistics:
Sample size: 20,034 individuals in 10,099 households
Average household size: ~1.99 (matching IT-SILC)
Estimated population: 15,749,925 individuals (the sum of the weights)
Geographic coverage: 5 macro-regions, 30 NUTS2, 120 NUTS3, 1,079 municipalities
References
Eurostat (2024). EU Statistics on Income and Living Conditions (EU-SILC): Methodology. https://ec.europa.eu/eurostat/
Examples
# Load the dataset
data(synthouse)
# Basic structure
str(synthouse)
#> tibble [20,034 × 17] (S3: tbl_df/tbl/data.frame)
#> $ person_id : chr [1:20034] "HH000001P1" "HH000001P2" "HH000001P3" "HH000001P4" ...
#> $ hh_id : chr [1:20034] "HH000001" "HH000001" "HH000001" "HH000001" ...
#> $ NUTS1 : chr [1:20034] "N" "N" "N" "N" ...
#> $ NUTS2 : chr [1:20034] "N01" "N01" "N01" "N01" ...
#> $ NUTS3 : chr [1:20034] "N01005" "N01005" "N01005" "N01005" ...
#> $ municipality : chr [1:20034] "N010050010" "N010050010" "N010050010" "N010050010" ...
#> $ age : num [1:20034] 39 38 15 13 37 54 55 61 56 69 ...
#> $ age_class : Factor w/ 7 levels "0-14","15-17",..: 5 5 2 1 5 6 6 6 6 7 ...
#> $ gender : num [1:20034] 1 2 1 2 2 2 1 2 1 1 ...
#> $ education_level : chr [1:20034] "Low" "Medium" NA NA ...
#> $ employment_status: chr [1:20034] "Employed" "Employed" "Student" "Student" ...
#> $ hh_size : int [1:20034] 4 4 4 4 1 2 2 2 2 2 ...
#> $ hh_type : chr [1:20034] "Family" "Family" "Family" "Family" ...
#> $ eq_income : num [1:20034] 10431 10431 10431 10431 36588 ...
#> $ hh_income : num [1:20034] 23991 23991 23991 23991 36588 ...
#> $ oecd_scale : num [1:20034] 2.3 2.3 2.3 2.3 1 1.5 1.5 1.5 1.5 1.5 ...
#> $ weight : num [1:20034] 83.7 83.7 83.7 83.7 167.2 ...
head(synthouse)
#> # A tibble: 6 × 17
#> person_id hh_id NUTS1 NUTS2 NUTS3 municipality age age_class gender
#> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <fct> <dbl>
#> 1 HH000001P1 HH000001 N N01 N01005 N010050010 39 35-49 1
#> 2 HH000001P2 HH000001 N N01 N01005 N010050010 38 35-49 2
#> 3 HH000001P3 HH000001 N N01 N01005 N010050010 15 15-17 1
#> 4 HH000001P4 HH000001 N N01 N01005 N010050010 13 0-14 2
#> 5 HH000002P1 HH000002 NE NE06 NE06004 NE060040007 37 35-49 2
#> 6 HH000003P1 HH000003 N N05 N05003 N050030007 54 50-64 2
#> # ℹ 8 more variables: education_level <chr>, employment_status <chr>,
#> # hh_size <int>, hh_type <chr>, eq_income <dbl>, hh_income <dbl>,
#> # oecd_scale <dbl>, weight <dbl>
# Summary statistics
summary(synthouse$eq_income)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1523 12912 20429 25791 32529 298003
summary(synthouse$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00 36.00 45.00 45.98 63.00 85.00
# Number of households and individuals
length(unique(synthouse$hh_id)) # Households
#> [1] 10099
nrow(synthouse) # Individuals
#> [1] 20034
# Average household size
mean(table(synthouse$hh_id))
#> [1] 1.983761
# Distribution of household types
table(unique(synthouse[, c("hh_id", "hh_type")])$hh_type)
#>
#> Couple Family Other Single Single_parent
#> 2910 1537 1161 4329 162
# Age distribution
table(synthouse$age_class)
#>
#> 0-14 15-17 18-24 25-34 35-49 50-64 65+
#> 2011 557 47 1779 8291 2504 4845
# Weighted quantiles
csquantile(synthouse$eq_income,
weights = synthouse$weight,
probs = c(0.25, 0.5, 0.75),
type = 6)
#> 25% 50% 75%
#> 12353.29 20014.17 32222.93
# Quantile Ratio Index
qri(synthouse$eq_income,
weights = synthouse$weight,
type = 6)
#> [1] 0.5690895