310 likes | 480 Views
STATISTICAL CONFIDENTIALITY IN LONGITUDINAL LINKED DATA: OBJECTIVES AND ATTRIBUTES Mario Trottini University of Alicante (Spain) mario.trottini@ua.es. Joint UNECE/Eurostat Work Session on Statistical Confidentiality, Geneva 9-11 November 2005. Problem Definition. Great Research Potential.
E N D
STATISTICAL CONFIDENTIALITY IN LONGITUDINAL LINKED DATA: OBJECTIVES AND ATTRIBUTES Mario Trottini University of Alicante (Spain) mario.trottini@ua.es Joint UNECE/Eurostat Work Session on Statistical Confidentiality, Geneva 9-11 November 2005
Problem Definition Great Research Potential Two related issues: ???? Longitudinal Linked Microdata : “Microdata that contain observations from two or more related sampling frame, with measurements for multiple time periods for all units of observation”(Abowd and Woodcock 2004) • How to create the data set ? • How to disseminate the data ?
Problem Definition Great Research Potential Two related issues: ???? Longitudinal Linked Microdata : “Microdata that contain observations from two or more related sampling frame, with measurements for multiple time periods for all units of observation”(Abowd and Woodcock 2004) • How to create the data set ? • How to disseminate the data ?
Ideal Data Dissemination Procedure Three Objectives “Maximize usefulness” “Maximize safety” “Minimize Cost” How to measure achievement? How to find a suitable balance? Data Dissemination: Why is It Difficult? • Should allow legitimate users to perform statistical analyses as if the were using the original data 2. “Control” the risk of misuses of the data by potential intruders 3. Be operational Two issues: (i) Objectives are too ambiguous (ii) Objectives are conflicting
A SOLUTION REQUIRES: candidate data dissemination procedures • “cost” • “usefulness” • “safety” interpretation of: • “cost” (C) • “usefulness” (DU) • “safety” (DS) measures of: ???? DS1- 1 DU1+2 C1 DS1 DU1 C1 Data Dissemination as a Decision Problem Step(1)Identify the alternatives Step(2)Structuring the objectives Step(3)Define suitable attributes Step(4)Assessing the trade-off between the fundamental objectives
A SOLUTION REQUIRES: Data Dissemination as a Decision Problem Step(1)Identify the alternatives Step(2)Structuring the objectives Step(3)Define suitable attributes Step(4)Assessing the trade-off between the fundamental objectives
Outline • Identify the alternatives: review of existing data • dissemination procedures • Structuring the objectives: • - Theory • - Current practice • Selecting attributes : • - Theory • - Current practice • Conclusions
CURRENT APPROACH Mk is one of the following MORE REALISTIC APPROACH Mk should be • Data Masking • Synthetic Data • Licensing • Remote Access • Research Data Center Combination of 1-5 Identifying the Alternatives LetM= { Mk , k E}denote the class of alternatives data dissemination procedures Two rationales: • Data users and data users • needs are very diverse • (Mackie and Bradburn 2000) • Combining different methods • can produce greater data utility • for any level of disclosure risk • (Abowd and Lane 2003)
CURRENT APPROACH Mk is on of the following ? • Data Masking • Synthetic Data • Licensing • Remote Access • Research Data Center ? $ $ $ $ $ $ $ $ $ ? ? ? Portfolio Problem Identifying the Alternatives LetM= { Mk , k E}denote the class of alternatives data dissemination procedures MORE REALISTIC APPROACH Mk should be Combination of 1-5
Structuring the Objectives: Theory Information Organization Overall Objective: “The best” data dissemination Maximize safety Minimize Cost Maximize Usefulness Too broad and ambiguous to be of operational use STRATEGY: Divide an objective in lower level objectives that clarify the interpretation of the broader objective
a) Definition and identification of “Legitimate data users” b) For a given user in (a) identification of the statistical analysis of interest c) For a given user in (a) and statistical analysis in b) definition of “as if” An Illustration “[the data dissemination procedure] should allow legitimate data users to perform the statistical analyses of interest as if they were using the data set originally collected”. Usefulness Sources of ambiguity
Max. usefulness for DATA USER 1 Max. usefulness for DATA USER 2 Max. usefulness for DATA USER k Max. usefulness for OTHER UNKNOWN DATA USER ... ... statistical analysis SAk1 statistical analysisSAkm QUALITY FEASIBILITY Access to the data Perform the analysis Interprete the results TRANSPARENCY Exploratory Analysis People/skills Model uncertainty Technology Estimation Time Access to the data Perform the analysis Interprete the results Prediction Cost The Hierarchy Maximize Usefulness
No explicit hierarchy is • used • Research literature and current • practice in SDC as a whole have identified relevant aspects of the fundamental objectives: - Maximize Usefulness - Maximize Safety - Minimize Cost Structuring the Objectives: Current Practice • Implicit hierarchy is often • incomplete • However, only few of them are • taken into account in applications Transparency, accessibility, feasibility are often not considered
Output of the transformation Information about the transformation T F(Data) Output of a Statistical analysis of interest using “Data” : An Illustration ORIGINAL MICRODATA DORIG: • Apply some transformation, T, to the data • DREL= T( DORIG) ) • 2) Release to the user: DMASKED= ( DREL, I(T) ) DATA MASKING Usefulness assessment: D= F(DORIG)- F(DMASKED) IGNORING TRANSPARENCY!
General Guidelines for Structuring the Objectives • Definition of “safety”, “usefulness” and “cost” are problem dependent. • However, providing a clear definition of them in any specific Data • Dissemination Problem is crucial for the quality of the final decision. • The use of hierarchies could be very beneficial in terms of: 1. clarifying the interpretation of the relevant objectives 2.check that no relevant aspects of the problem have been ignored 3. facilitate communication
Types of Attributes: Selecting Attributes: Theory • Natural attributes • Constructed Attributes • Proxy attributes
Types of Attributes: Obvious scale that can be used to measure the extent to which an objective is achieved. Selecting Attributes: Theory • Natural attributes • Constructed Attributes • Proxy attributes Example: Objective: “Minimize Cost” (Natural) attribute: “Cost in Euros” • Not very common in SDC
Max. usefulness for DATA USER 1 Max. usefulness for DATA USER 2 Max. usefulness for DATA USER k Max. usefulness for OTHER UNKNOWN DATA USER ... ... statistical analyses SAk1 statistical analysesSAkm QUALITY FEASIBILITY Access to the data Perform the analysis Interprete the results TRANSPARENCY Exploratory Analysis People/skills Model uncertainty Technology Estimation Time Access to the data Perform the analysis Interprete the results Prediction Cost The Hierarchy Maximize Usefulness
Types of Attributes: Selecting Attributes: Theory "subjective scale" constructed out of several aspects typically associated with the objective of interest. • Natural attributes • Constructed Attributes • Proxy attributes
Table 1. Constructed attribute for public attitudes. (Keeney and Gregory 2005)
Types of Attributes: Selecting Attributes: Theory "subjective scale" constructed out of several aspects typically associated with the objective of interest. • Natural attributes • Constructed Attributes • Proxy attributes • Defining feature: Interpretability • Not used in SDC
Types of Attributes: Selecting Attributes: Theory • Natural attributes • Constructed Attributes • Proxy attributes Reflects the degree to which an associate objective is met but does not directly measure the objective.
Proxy Attributes for “Usefulness” in SDC GENERAL FORMULATION DORIG: ORIGINAL DATA DREL : DISSEMINATED DATA F( Data): some feature of “Data” PROXY = DISCREPANCY ( F(XORIG), F(XREL) ) INTUITION: Low distorsion of the data implies nearly correct inferences for nearly all statistical analyses
Proxy Attributes for “Usefulness” in SDC PROXY = DISCREPANCY ( F(DORIG), F(DREL) ) Proxy as discrepancy between summary statistics Domingo Torra (2001), Yancey W.E. et al. (2002), Oganyan, A. (2003), Grup Crises (2004) Proxy as discrepancy between distributions Agrawal and Aggarwal (2001), Gomatam et al. (2004), Karr et al. (2005) Inference based proxy Gomatam et al. (2004). , A.F. Karr et al. (2005) ,
Types of Attributes: Selecting Attributes: Theory • Defining features: • Usually easier to handle • Natural attributes • Constructed Attributes • Proxy attributes • Require some understanding • of the relationship between • the objective of interest and • the associated objective • measured by the proxy. • (TOO) OFTEN USED IN SDC
An Illustration Goal:Assessing the trade-off between “Maximize usefulness” and “maximize safety” for a given level “c” of “Cost” • Attribute for “usefulness” (Information loss): Hellinger Distance (IL) • Attribute for “safety” (Disclosure risk): % of record correctly re-identified (DR) What does D(IL)=0.1 mean in terms of fitting a regression model? Data dissemination1: D1 Data dissemination 2: D2 IL(D1)=0.4 DR(D1)= 1% IL(D2)=0.5 DR(D2)=0.5% DR(D1)- 0.5 IL(D1)+ 0.1 C DR(D1) IL(D1) C ????
CURRENT PRACTICE Order in Attributes selection • Natural attributes • Constructed attributes • 2. Proxy attributes Desirable Properties (in the paper) Attribute Selection: Theory and Current Practice THEORY Prescriptive Order in Attributes selection • Natural attributes • Constructed attributes • Proxy attributes
Conclusions ``There is a tendency in all problem solving to move quickly away from the ill-defined to the well-defined, from constraint-freethinking to constrained thinking. There is a need to feel, and perhaps even to measure, progress toward reaching a ``solution" to a decision problem.“ (Keeney, 1992, page 9) • In this talk it is argued that too little effort has been made for a comprehensive definition of the Data Dissemination • problem in terms of: • - alternatives • - objectives • - attributes
Conclusions (Cont.) • Hierarchy and constructed attributes could represent useful • tools to address these problems. • Although the discussion has not focus on dissemination of • longitudinal linked data as much as desired, I think it is particularly relevant for this type of data given: • - The complexity of the modeling • - The multiple decision makers involved • - The different perspectives of disclosure and utility • that must be accommodated in the final decision.
Acknowledgements Preparation of this paper was supported by the U.S. National Science Foundation under Grant EIA-0131884 to the National Institute of Statistical Sciences. The contents of the paper reflects the authors' personal opinion. The National Science Foundation is not responsible for any views or results presented.