110 likes | 122 Views
This article explores the challenges of disclosure control in research environments, with a focus on ensuring consistency, transparency, and security with limited resources. It discusses classifying safe and unsafe outputs, determining safety of different output types, and provides examples of assessing risks and setting limits for disclosure. Contact Felix Ritchie for further questions.
E N D
Disclosure detection & control in research environments Felix Ritchie
Why are research environments special? • Little disclosure control on input • Few limits on processing • Unpredictable, complex outputs • an infinity of “special cases” Manual review for disclosiveness required
Problems of reviewing research outputs • Limited application of rules • How do we ensure • consistency? • transparency? • security? • How do we do this with few resources?
Classifying the research zoo • Some outputs inherently “safe” • Some inherently “unsafe” • Concentrate on the unsafe • Focus training • Define limits • Discourage use
Safe versus unsafe • Safe outputs • Will be released unless certain conditions arise • Unsafe outputs • Won’t be released unless demonstrated to be safe Examples: * = conditions for release apply
Determining safety • Key is to understand whether the underlying functional form is safe or unsafe • Each output type assessed for risk of • Primary disclosure • Disclosure by differencing
Example:linear aggregates of data are unsafe • Inherent disclosiveness: • Disclosure by differencing: • Differencing is feasible • each data point needs to be assessed for threshold/dominance limits => resource problem for large datasets
Example:linear regression coefficients are safe • Let • But can’t identify single data point No risk of differencing • Exceptions • All right hand variables public and an excellent fit (easily tested, can generate automatic limits on prediction) • All observations on a single person/company • Must be a valid regression
Example:cross-product/variance-covariance matrices • Cross product matrix M = (X’X) is unsafe • Frequencies/totals identified by interaction with constant • And for any other categorical variables • What about variance-covariance matrices? • Can’t create a table for X unless Z=X and W=I weighted covariance matrix is safe • V is unsafe – can be inverted to produce M • But in the more general case
Example:Herfindahl indices • Composite index of industrial concentration • Safe as long as at least 3 firms in the industry? • No: • Quadratic term exacerbates dominance • If second-largest share is much smaller, H share of largest firm • Standard dominance rule of largest unit<45% share doesn’t prevent this • Current tests for safety not very satisfactory
Questions? Felix Ritchie Microdata Analysis and User Support Office for National Statistics felix.ritchie@ons.gov.uk +44 1633 45 5846