Disclosure detection & control in research environments

Disclosure detection & control in research environments Felix Ritchie

Why are research environments special? • Little disclosure control on input • Few limits on processing • Unpredictable, complex outputs • an infinity of “special cases”  Manual review for disclosiveness required

Problems of reviewing research outputs • Limited application of rules • How do we ensure • consistency? • transparency? • security? • How do we do this with few resources?

Classifying the research zoo • Some outputs inherently “safe” • Some inherently “unsafe” • Concentrate on the unsafe • Focus training • Define limits • Discourage use

Safe versus unsafe • Safe outputs • Will be released unless certain conditions arise • Unsafe outputs • Won’t be released unless demonstrated to be safe Examples: * = conditions for release apply

Determining safety • Key is to understand whether the underlying functional form is safe or unsafe • Each output type assessed for risk of • Primary disclosure • Disclosure by differencing

Example:linear aggregates of data are unsafe • Inherent disclosiveness: • Disclosure by differencing: • Differencing is feasible • each data point needs to be assessed for threshold/dominance limits => resource problem for large datasets

Example:linear regression coefficients are safe • Let • But  can’t identify single data point  No risk of differencing • Exceptions • All right hand variables public and an excellent fit (easily tested, can generate automatic limits on prediction) • All observations on a single person/company • Must be a valid regression

Example:cross-product/variance-covariance matrices • Cross product matrix M = (X’X) is unsafe • Frequencies/totals identified by interaction with constant • And for any other categorical variables • What about variance-covariance matrices? • Can’t create a table for X unless Z=X and W=I weighted covariance matrix is safe • V is unsafe – can be inverted to produce M • But in the more general case

Example:Herfindahl indices • Composite index of industrial concentration • Safe as long as at least 3 firms in the industry? • No: • Quadratic term exacerbates dominance • If second-largest share is much smaller, H share of largest firm • Standard dominance rule of largest unit<45% share doesn’t prevent this • Current tests for safety not very satisfactory

Questions? Felix Ritchie Microdata Analysis and User Support Office for National Statistics felix.ritchie@ons.gov.uk +44 1633 45 5846

Disclosure detection & control in research environments