340 likes | 449 Views
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008) Sample results expected accuracy in the Italian Population and Housing Census. Giancarlo Carbonetti, Marco Fortini Istat – Italian National Statistical Institute General Censuses Directorate May 13th 2008. Outline.
E N D
Joint UNECE/Eurostat Meeting on Population and Housing Censuses (13-15 May 2008)Sample results expected accuracy in the Italian Population and Housing Census Giancarlo Carbonetti, Marco Fortini Istat – Italian National Statistical InstituteGeneral Censuses Directorate May 13th 2008
Joint UNECE Eurostat Meeting Outline • Introduction • Some aspects related to the use of samples of households for long form enumerations • Sampling strategies • Simulation study • Some results • Conclusions
Joint UNECE Eurostat Meeting Introduction - 1 Main critical issue of the last Census • Huge organizational (and economical) effort of Municipal Census Offices • sudden and time-concentrated increase of workload • for largest municipalities, massive network of enumerators and coordinators to be trained and managed • lack of adequately skilled resources, high turn over rates Main objectives for the next Census • to improve the census operations efficiency • to reduce the municipalities workload • to keep an high level of quality
Joint UNECE Eurostat Meeting Introduction - 2 • Innovations proposed to reach the objectives • the use of population registers • mail out of census forms • mixed mode of data collection mainly based on mail and web • Expected consequences with the innovations • the increase of “back office” work • the reduction of enumerators number (“front office” work) • How it is possible • increasing the response rates A proposal: the use of a “short form” version of the questionnaire is considered to reach high response rates.
Joint UNECE Eurostat Meeting Introduction - 3 Consequences of the use of short form • increasing the response rates • reducing as much as possible the response time delay This approach risks information loss!!! How to preserve the richness of the census information • by a selection of a sample of households to which a “long form” version of the questionnaire is supplied Strategy: the simultaneously use of short and long forms.
Joint UNECE Eurostat Meeting Some aspects related to the use of samples of households for long form enumeration - 1 Which type of information can be surveyed by means of a sample of long forms and which must be collected on the whole population? • The overall set of census variables is partitioned into two subsets • the demographic variables (gender, date of birth, marital status, nationality, …) • the remaining variables (educational level, occupational status, commuting) • Short form accounts for merely the first set of variables whereas long form accounts for the whole set
Joint UNECE Eurostat Meeting Some aspects related to the use of samples of households for long form enumeration - 2 Which is the population municipality threshold under which the sampling strategy cannot be adopted? • An option we are taking into consideration is to sample in municipalities with more than 5,000 inhabitants • long forms will be submitted to a sample of households • short forms will be administered to remaining households • In municipalities smaller than 5,000 inhabitants long forms will be submitted to the whole population
Joint UNECE Eurostat Meeting Some aspects related to the use of samples of households for long form enumeration - 3 Which domains have to be considered to plan the sample and to produce accurate estimates? • New “census domains” have been defined • an appropriate methodology was adopted to build up census domains by aggregating the smallest census areas • the new “areas” are referred to sub-municipal level • Accuracy of sampling estimates for different territorial levels • a similar precision is expected for estimates among areas • higher precision is expected for larger territorial reference (from sub-municipal to nationwide level)
Joint UNECE Eurostat Meeting Some aspects related to the use of samples of households for long form enumeration - 4 • Which statistical methodology performs the most accurate estimation? • … in terms of … • sampling design • use of appropriate lists • efficient estimation methods • sampling error assessment The answer to this question is the aim of the study of which some results will be presented.
Joint UNECE Eurostat Meeting Sampling strategies Two different sampling designs have been tested • Simple Random Sampling of HOUseholds from Administrative Registers (SRSHOU) managed by municipalities • Area Frame Sampling based on a Simple Random Sampling of ENumeration Areas (SRSENA) which implies a complete data collection of households dwelling in the selected enumeration areas (from Digital Geocoded Database) Different studies have been conducted • To compare the two different approaches (with a sampling ratio of about one third of the whole population considered) • To evaluate in the SRSHOU the improvement of the estimates precision for increasing sampling ratio (10%, 15%, 20%, 33%) • To introduce some stratifications of the units involved
Joint UNECE Eurostat Meeting Simulation study - 1 The sampling strategies were compared to each other through Monte Carlo sampling replications (carried out on 2001 Italian Census data) in order to assess the sampling error defined by the coefficient of variation (CV) which represents an accuracy measurement of the sampling estimates. Main features of the sampling designs • Domains: the “new areas” referred to sub-municipal districts • Target variables: “variables” related to cross-classification of educational level, employment status and commuting with demographic variables • Sampling units: “households” or “enumeration area” • Estimator: “calibrated estimators” by using final weights properly modified so to make the sample more representative
Joint UNECE Eurostat Meeting Simulation study - 2 • Because of the strong differences among the Italian municipalities, 40 of them with different population size and from different regions of Italy were considered
Joint UNECE Eurostat Meeting Simulation study - 3 Amount of units involved by the simulation study
Joint UNECE Eurostat Meeting Scatter plot of cvand p (estimates) for each census area. SRSHOU design (sampling ratio=33%). City of Perugia. 3% 2% 1%
Joint UNECE Eurostat Meeting Distribution of median cv for classes of p for SRSHOU design and SRSENA design (both with sampling ratio=33%). Comparison of 4 municipalities. THIS IS DUE TO THE CLUSTER EFFECT
Joint UNECE Eurostat Meeting Loss of efficiency (in terms of CV for classes of p) of estimation with SRSENA with respect to SRSHOU design (both with sampling ratio=33%). Comparison of 4 municipalities. [CV(SRSHOU_s.r. 33%)-CV(SRSENA_s.r. 33%)]
Joint UNECE Eurostat Meeting Distribution of median cv for classes of p. Comparison of 4 different sampling ratios with the SRSHOU design.
Joint UNECE Eurostat Meeting Gain of efficiency (in terms of CV for classes of p) of estimation with SRSHOU design by increasing sampling ratio from 10% to 33% . [CV(SRSHOU_s.r. 10%)-CV(SRSHOU_s.r. N%)]x100/[CV(SRSHOU_s.r. 10%)] Gain between 21-23 percent Gain between 33-38 percent Gain between 53-58 percent
Joint UNECE Eurostat Meeting Distribution of median cv for five classes of p and three classes of area (according to population size). Comparison of 4 different sampling ratios with the SRSHOU design.
Joint UNECE Eurostat Meeting Median CV for some classes of p and for three classes of area (according to population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU design. Graph referred to area size less than 10,000 inhabitants.
Joint UNECE Eurostat Meeting Median CV for some classes of p and for three classes of area (according to population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU design. Graph referred to area size between 10,000 and 12,000 inhabitants. The gain of efficiency (in terms of CV) for census areas with size between 10,000 and 12,000 with respect to census areas with less than 10,000 is about 14-20 percent. Similar results are obtained for all tested sampling ratios.
Joint UNECE Eurostat Meeting Median CV for some classes of p and for three classes of area (according to population size). Comparison of 4 different sampling ratios (s.r.) with the SRSHOU design. Graph referred to area size more than 12,000 inhabitants. The gain of efficiency (in terms of CV) for census areas with size more than 12,000 with respect to census areas with less than 10,000 is about 22-28 percent. As before, similar results are obtained for all tested sampling ratios.
Joint UNECE Eurostat Meeting Distribution of the estimates referred to areas larger than 12,000 inhabitants for classes of cv. Comparison of percentage frequencies for 4 different sampling ratios with the SRSHOU design. HA – high accuracy MA – medium accuracy LA – low accuracy
Joint UNECE Eurostat Meeting Distribution of the estimates referred to areas larger than 12,000 inhabitants for classes of cv. Comparison of percentage frequencies for 4 different sampling ratios with the SRSHOU design - 2 HA - high accuracy MA - medium accuracy LA - low accuracy
Estimates of p referred to territory given by aggregation of areas. Joint UNECE Eurostat Meeting Generic sampled area a Territory RS given by aggregation of K sampled areas Percentage expected reduction of CV in RS Territory R given by aggregation of sampled areas and not sampled areas Percentage expected reduction of CV in R Quote of sub-population of R elegible for drawing the LF sample.
Joint UNECE Eurostat Meeting Conclusions - 1 • As expected, the most accurate estimates were obtained for: • simple random sampling of households from administrative registers • largest sampling ratio • Better efficiency of estimates for largest areas (>12,000 inhabitants) • this result could represent a suggestion for planning the sampling design by defining larger census areas (of about 15,000 people) • The estimates referred to large domains given by aggregation of areas show high accuracy • the accuracy increases with the domain’s number • in case in which a part of the large domain is totally surveyed, the estimates show a further increasing in accuracy
Joint UNECE Eurostat Meeting Conclusions - 2 • However area frame sampling is only slightly less efficient than SRSHOU, thus it could be adopted where reliable administrative registers are not available • Sampling ratio will be chosen considering trade-off between: • needed financial savings • accuracy required at different territorial domains • Further analyses will be conducted on small area estimation techniques to produce more accurate estimates for: • smallest territorial levels • rare populations
Joint UNECE Eurostat Meeting Thank you for your attention and …
Joint UNECE Eurostat Meeting … have a good lunch!!!
Joint UNECE Eurostat Meeting Simulation study - 4 • Cross-classification cells • educational level, employment status, commuting and gender • 90 simple estimation cells • Calibration constraints defined by cross-classifying gender by age, and gender by marital status • Computational algorithm implemented by SAS code for each municipality and for each alternative sampling design: • step 1) selection of a sample (of households or enumeration areas) • step 2) computation of final weights • step 3) estimation of the relative frequencyp for each target cell • step 4) iteration of steps 1), 2) and 3) for 1,000 sampling replications • step 5) computation of sampling distribution mean and standard error for each one of the 90 frequency cells
Joint UNECE Eurostat Meeting Evaluation criterion: the coefficient of variation In order to compare the sampling strategies has been considered as evaluation criterion the coefficient of variation CV : which represents an accuracy measurement of the sampling estimates. Consequently, the percentage maximum expected error can be computed: Δ% ≈ 1.96 · CV which is implied (with a probability of 0.95) by the estimation method. • The distribution of the empirical CV’s for all the 90 target cells was determined. • After having classified the target cells depending on their value p , CV’s distribution related to the cells in the same p group has been studied.
Estimates of p referred to territory given by aggregation of areas. Case 1: aggregation of sampled areas. Joint UNECE Eurostat Meeting Estimate referred to the generic sampled area a Estimate referred to the territory RS given by aggregation of K sampled areas where for K>100 → red%>90% for K>30 → red%>80% for K>5 → red%>50% Percentage expected reduction of CV ─Percentage expected reduction of CV Number of areas K
Estimates of p referred to territory given by aggregation of areas. Case 2: aggregation of sampled and not sampled areas. Joint UNECE Eurostat Meeting Territory RNS of Not Sampled areas: long form to all the households. Territory RS referred to Sampled areas: long form to a sample of households. Sub-population of R elegible for drawing the LF sample. 100 400 Percentage expected reduction of CV Number of areas K ─γ=1 ─γ=0.7 ─γ=0.6 ─γ=0.5