1 / 37

The Application of the Concept of Uniqueness for Creating Public Use Microdata Files

The Application of the Concept of Uniqueness for Creating Public Use Microdata Files. Jay J. Kim, U.S. National Center for Health Statistics Dong M . Jeong, Korea National Statistic al Office. Contents. Introduction Intruders and Disclosure Measures of Disclosure Risk

tfrazier
Download Presentation

The Application of the Concept of Uniqueness for Creating Public Use Microdata Files

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Application of the Concept of Uniqueness for CreatingPublic Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong, Korea National Statistical Office

  2. Contents • Introduction • Intruders and Disclosure • Measures of Disclosure Risk 1. Narrow Definition of Disclosure Risk 2. Broader Definition of Disclosure Risk • Evaluation of Definition of Disclosure Risk • Concluding Remarks

  3. 1. Introduction. • Government agencies release microdata files from their survey data or administrative records data. • Large amounts of information on individuals is available to many organizations and data users, who can become “intruders”. • If a public use microdata file (PUMF) is released, intruders can try to match their records with the ones from the PUMF and gain access to new information.

  4. Intruders use common variables between PUMF and their files for linking the records on two files, which are called “key variables” or “matching variables”. • In the U.S., laws such as Title 13 stipulates protection of the confidentiality of many types of data. • Thus, the data disseminating agencies must protect the confidentiality of the individuals on the PUMFs. On the other hand, they should not ignore the data users’ needs, i.e., the utility of the data files.

  5. Here, we develop probability models quantifying disclosure risk for a microdata file. • This is a modification of the Marsh, et al (1991) procedure. • The model can use population and sample “uniques” only, or it can also include population twins or triplets. • We will show the results of applying the probability model - using population and sample uniques only - for creating disclosure-limited microdata files using the 2005 Korean demographic census data.

  6. 2. Intruders and Disclosure • Potential intruders: i). Organizational intruders, e.g., credit card companies, mortgage departments of banks, insurance companies, credit bureaus, trade associations, etc. ii). Individual intruders: with readily available high powered computers,anyone can assemble his own database using information in the public domain and become an intruder.

  7. Two types of disclosure: i). Identity disclosure – identification. If the intruder is a journalist and tries to embarrass the data disseminating agencies, his claim that he has been successful in identifying someone on their PUMF would be sufficient. If the intruder publicizes the findings in the news media, it could have a devastating effect on the agencies’ data collection efforts.

  8. ii). Attribute disclosure; After identification is made, one can gain new sensitive information. For defining a measure of disclosure risk, we will consider that identity disclosure is the same as disclosure.

  9. 3. Measures of Disclosure Risk • Define P(a) = the probability of key variables being recorded identically in both PUMF and intruder’s file; P(b|a) = the probability that an individual appears in a PUMF is the same asthe sampling fraction for that individual in the PUMF;

  10. P(c|a,b) = the probability of population unique; and P(d|a,b,c) = the probability of verifying population unique. • Marsh, et al (1991) defined the probability of correct identification of an individual as P(a) P(b|a) P(c|a,b) P(d|a,b,c)

  11. We modify the Marsh, et al’s model. • We assume in Marsh, et al’s formula that i). There are no recording or classification errors for the values of the key variables, i.e., P(a)= 1. ii). We can verify correctlypopulation uniqueness with certainty, i.e., P(d|a,b,c) = 1.

  12. Disclosure can occur when all the following 5 conditions are met: i). An individual is unique in a population based on key variables. If the intruder’s file is a 100 percent population file, he can establish uniqueness of a certain individual by using his file. ii). The individual is on the PUMF.

  13. iii). The individual is on intruder’s file. An intruder can have information on key variables for a specific person and try to examine whether that person appears in the PUMF. In this case, intruder’s file has a single record. iv). The individual is unique on PUMF AND v). The individual is unique on intruder’s file.

  14. Define A = an individual of interest; = PUMF; = an intruder’s file; = unique class in the population;

  15. = unique class in PUMF; and = unique class in intruder’s file.

  16. 3.1 A Narrow Definition of Disclosure Risk This definition depends on the population and sample uniques only. 3.1.1 Assume an Intruder does Phising (Fishing) Expedition.

  17. The probability of correct identification: (1) If an individual is a population unique, it would also be a sample unique, i.e.,

  18. Equation (1) reduces to which can be further re-expressed as follows: (2)

  19. The event that A is unique in population is independent of whether A is selected in sample or not. Thus, equation (2) reduces to (3) The event that A is in the PUMF is usually independent of the event that A is in the intruder’s file. In this case, equation (3) can be simplified as (4)

  20. However, a survey can be a subset of another survey. For example, U.S. Census Bureau’s PUMF is a subset of their census sample. Thus if is a subset of and equation (3) becomes (5) Also, (6)

  21. 3.1.2 Assuming an Intruder Already Knows That A is in PUMF If the intruder has response knowledge, then Thus, from equation (4), the disclosure risk will be

  22. 3.2 Broader Definition of Disclosure Risk • Even if an individual is not unique in the population, he still can be identified with additional information. • Suppose C individuals in the population have the same values of the key variables and matching to any one of them is equally likely.

  23. Define = Equivalence class of size C in the population. Then the probability of correct identification is,

  24. 4. Evaluation of Disclosure Risk • We used the measures of disclosure risk developed here in creating PUMS from the 2005 Korean census data. • We show the results of the applications on the 2005 census data from Choongchung (CC) Province. • Masking scheme used is to coarse (group) categories.

  25. Korea National Statistical Office (KNSO) creates the 2 percent PUMFs by taking a 20 percent subsample of the 10 percent census sample, (0.1 x 0.2 = 0.02). : 2 percent PUMF. : 10 percent census sample.

  26. Table 1. Population Size, and Number of Households and Housing Units – CC Province

  27. Key variables used: gender (2); age (111); marital status (4 ); relationship to householder (14); household type (5 ); tenure (6 ); building type of residence (12); and type of housing and number of floors of the building (12). • The probability of a population unique is calculated using the 100 percent census file. • Without grouping, the number of uniques is 9,664. It is 0.54 % of 1.8 million.

  28. If we assume that the intruder has a 10 percent census sample file,the disclosure risk is However, whole blocks are selected in the 10 percent census sample, thus residents in the sample blocks know that their neighbors are also in the sample.To those who have response knowledge, the disclosure risk is

  29. Table 2. Number of Unique Persons before Grouping Categories

  30. Table 3. Number of Uniques with 5 Year Intervals for Age

  31. Table 4. Number of Uniques with Grouped Age and Relationship Categories

  32. Table 5. Number of Uniques with Grouped Age, Relationship and Marital Status Categories

  33. Table 6. Two different groupings in the number of categories

  34. Probability of unique = .028 % for both groupings. If we assume the intruder has the 10 percent census sample file, the disclosure risk is 0.0000056 < 1 in 100,000. If we assume response knowledge, the disclosure risk goes up to 0.000028.

  35. 5. Concluding Remarks • We developed comprehensive probability models quantifying disclosure risk for microdata files and applied them to the Korean census data. • Using the models, we measured the disclosure risks for the original census data. The risks were too high.

  36. We grouped categories of the key variables and re-calculated the disclosure risks. The risks were lowered to a satisfactory level. • For creating their official 2 percent PUMFs from the census data, KNSO used the approaches mentioned here including the measures of disclosure risks and grouping categories.

  37. Thank you very much ! Jay J. Kim Dong M. Jeong jkim5@cdc.govjedomy@nso.go.kr Disclaimer: This paper represents the views of the authors and should not be interpreted as representing the views, policies or practices of the Centers for Disease Control and Prevention, National Center for Health Statistics.

More Related