1 / 22

Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals. Maria Cristina Casciano, Laura Corallo, Daniela Ichim. Multiple releases: MFR and PUF Subsampling allocation: reduce the risk of disclosure selection: pre-defined quality standards Results

sermons
Download Presentation

Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sampling as a way to reduce risk and create a Public Use File maintaining weighted totals Maria Cristina Casciano, Laura Corallo, Daniela Ichim

  2. Multiple releases: MFR and PUF Subsampling allocation: reduce the risk of disclosure selection: pre-defined quality standards Results Career of Doctorate Holders Survey Further work Outline

  3. Multiple surveys Multiple … Multiple countries Multiple countries Multiple countries Multiple countries Multiple releases SURVEY1 TABLES1 PUF1 MFR1 OTHER1 MS1 Multiple releases SURVEY2 TABLES2 PUF2 MFR2 OTHER2 Multiple releases SURVEYX TABLESX PUFX MFRX OTHERX Multiple releases SURVEY1 TABLES1 PUF1 MFR1 OTHER1 Multiple releases MS2 SURVEY2 TABLES2 PUF2 MFR2 OTHER2 Multiple releases SURVEYX TABLESX PUFX MFRX OTHERX Multiple releases SURVEY1 TABLES1 PUF1 MFR1 OTHER1 Multiple releases MS27 SURVEY2 TABLES2 PUF2 MFR2 OTHER2 Multiple releases SURVEYX TABLESX PUFX MFRX OTHERX

  4. ESSnet on SDC harmonisation and common tools WP1: test the comparability concept Istat, Destatis, Statistics Austria multiple countries Comparability HOW • 1Assessmentof effects of different practices on predefined statistics • 2Definition of a threshold to define when action is needed • 3 setting a process for choosing acceptable practices

  5. Multiple releases SURVEY1 TABLES1 PUF1 MFR1 OTHER1 • A particular harmonisation dimension • Hierarchical structure • Utility • Risk of disclosure

  6. Multiple releaseshierarchical structure Less aggregated information More restrictive license + - MFR PUF - + Less restrictive license More aggregated information UNIQUE PRODUCTION PROCESS!

  7. PUF-MFR • MFR • definition of a disclosure scenario • risk assessment R1 • risk limitation w.r.t. • adopted disclosure scenario • some data utility requirements • PUF • harmonized with the MFR (e.g. weighted totals) • reduced the risk of disclosure • random sample • internal consistency of records • some (other) data utility requirements (CV and weighted totals – precision and accuracy)

  8. Datadescription DoctorateHolders CDH 2009 Survey Year t Year t-5 Year t-3 Focus on the characterisation of the occupational status of the PhD holders: job satisfaction labour market entry usefulness of the PhD for obtaining a job type of contract type of work earnings Estimates by PhD scientific area, by gender and by region

  9. Datadescription 72% resp Adjustment for non-responses via calibration 18500 PhD Holders (Census) 28% No resp weights obtained by constraining on known marginal distributions: 12964 respondents Citizenship (2 categories) PhD Scientific Area (14 categories) Gender Region

  10. PUF-subsampling Simple random sampling Utility: Weighted totals may always be preserved by calibration Risk: how many units at risk are sampled? Example (MFR-CDH): 12964 units, 24.7% of units at risk

  11. key variables stratification utility scenario allocation auxiliary Subsampling disclosure dissemination calibration domains users totals sample size quality

  12. Optimal allocation of units to be sampled in each domain according to Bethel’s approach (Risk minimization) Selection of a fixed sizebalancedsample (CUBE method) (Data utility maximization) PUF-subsampling: proposal

  13. 1. Bethel’s approach (1989) • Cost function to minimize: • Expected Coefficient of Variation (CV) of the estimates of the total of variable P in domain jd equal or lower than prefixed thresholds: • nhandCh related to the risk to be reduced  Optimal allocation: nh*

  14. A sampling design s is said to be balanced on the auxiliary variables if and only if the balancing equations given by: are satisfied, where X is the vector of known population totals, is the H.-T. estimator 2. Balanced sampling • exact estimates for pre-defined variables

  15. Balanced sampling: the CUBE method (011) (111) (110) (010) Geometrically each vertex of the hypercube is a sample: The balancing equations define a sub-space of RN named K. The problem is to choose a vertex (sample) of the N-cube that remains in the sub-space of constraints K K p (101) (000) (100) Cube method (Deville & Tillé,2004): • Flight phase: it’s a random walk starting from the vector p and moving in the intersection of the cube C and K. It stops at the vertex of intersection of C and K, if this vertex exists. • Landing phase: At the end of the flight phase, if a sample is not exactly determined in C∩K, a sample is selected as close as possible to the constraints space K.

  16. 1. determination of the optimal strata sizes in terms of reduction of the overall risk (cost function), keeping the CV level of the estimates below a 5% threshold for three combinations of the allocation and domain variables Allocation variables: Occup, JobS, Contract, Work, Income Domain variables: Gender, Region, Scientific Area, Year of Completion 2. six possible settings, corresponding to different choices of the parameters: a.Risk R1 used as the minimization cost of the algorithm b.Risk R1 used as a stratification variable c. include all units of the strata containing no units at risk Implementation

  17. Allocations (CV* = 5%)

  18. Allocations

  19. Selection of samples of fixed size from the CDH survey: Utility constraints on: the population size N the optimal sample size n the marginal frequency distributions by Gender, Year of Doctorate Completion and Scientific Area 18 equations CUBE algorithm: I. Input Vector p is the optimal one determined by Bethel II. Flight phase ends with no exact solution III. Landing phase starts: selection of a sample which ensures a low difference to the balance, according to the distance between p* to p Balanced sample

  20. Results Median of absolute relative errors

  21. Results

  22. 1. the relationship between coefficients of variation and disclosure risk, together with different options of including the risk of disclosure in the sampling design; 2. the introduction of an utility-priority approach into the way to deal with the balancing equations; 3. the usage of other data utility constraints to be investigated. Further work

More Related