250 likes | 319 Views
The new B ank of I taly R emote access to micro D ata (BIRD). G. Bruno, L. D’Aurizio, R. Tartaglia-Polcini Q2008 – Rome, July 10, 2008. Motivation. Information release and data protection as competing goals The risk-utility tradeoff: . risk of data disclosure
E N D
The new Bank of Italy Remote access to micro Data (BIRD) G. Bruno, L. D’Aurizio, R. Tartaglia-Polcini Q2008 – Rome, July 10, 2008
Motivation • Information release and data protection as competing goals • The risk-utility tradeoff: risk of data disclosure utility of widespread availability of data for research
Motivation GOALS (UTILITY): • satisfy growing demand from external researchers for business data • improve the accountability of the Central Bank as economic research centre • provide a service to the scientific community CONSTRAINTS (RISK): • Data confidentiality must be guaranteed: • as a prerequisite for respondents’ collaboration • to foster quality of the data provided • is required by the law • Public Use File (PUF) with individual data judged unfeasible: anonymisation very problematic with business data
Motivation SYNTHETIC DATA LIMITATIONS: • Identity disclosure impossible in principle, but, particularly with extreme values, it may be possible to re-identify a source record • Attribute disclosure may happen • Ample literature on data confounding and synthetic data (Duncan & Lambert 1989; Rubin 1993; Little 1993; Fuller 1993; Fienberg et al. 1996; Kennickell 1997; Abowd & Woodcock 2001; Reiter 2002; Raghunathan et al. 2003; etc.)
Choices • Data confounding: create a PUF containing perturbed data to prevent identification of individual information. Downside: results (esp. regressions) may heavily depend on the confounding technique adopted - controversial literature • Data lab (à la Istat: ADELE) – the researcher has to go to the lab in person. • Remote processing, using internet, without direct access to individual data (à la Luxembourg Income Study: LISSY)
Other remote processing systems • Luxembourg Income Study (LISSY, 1987) • Statistics Canada (2001) • Statistic Denmark (2001) • Statistic Netherlands (2002) • Australian Bureau of Statistics (2003) • Statistic Sweden (2003) • US Federal Agencies: NCHS (1997), NCES (1998), Census Bureau (2003)
The solution adopted at the Bank of Italy BIRD • Modeled on LISSY • Low setup cost • Easily customisable • Supports multiple packages • Maximum accessibility for users • Multi-level control (user/group, dataset, keyword) • Automatic and manual checks & review
How BIRD works USER ELIGIBILITY CRITERIA • Researcher status (not necessarily academic) proved by a presentation letter • Identification via valid personal id • Detailed information via form to be filled in
How BIRD works USER PROFILE CREATION • The researcher indicates an e-mail address which will be recognised by the system. • The researcher indicates her own user and password • User-chosen parameters are input in the user database • Access profile is created
How BIRD works SUBMISSION PROCEDURE • Communication with the processing environment via e-mail • Send a message containing user authentication info + statements to be submitted • Input message is parsed and checks are performed • If no error/security violation submit statements • Output is parsed (automatically / manually) • If no security violation forward to the user via e-mail
Confidentiality safeguards • User level • Data level • Processing level
Confidentiality safeguards User level: • Users are identified, qualified and registered • Registered mailboxes are whitelisted; ordinarily only one mailbox per user • Outputs are monitored and archived • Deontological code, privacy law, specific penalties Sanctions • Forbidden submissions or outputs are deleted • Grant of access for users trying to perform forbidden commands may be revoked • Any other sanctions or penalties required by the law where applicable
Confidentiality safeguards Data level: • Extreme data are censored (Winsorized) • Identifying variables (ids, names, addresses) are expunged from the datasets used for remote processing • Stratification variables are collapsed (geographical areas and not regions; Ateco aggregations and not codes)
Confidentiality safeguards Processing level: • Formally forbidden to display individual data • Keyword parserimplementedwith ceiling, blacklist e graylist • Particularly long and/or complex programmes are always reviewed manually • In the learning stage, all submissions are reviewed manually
How the parser works (*) This feature will be available in the next release of the system.
Datasets available STANDARD DATASET: quantitative data for the biggest firms (in terms of workforce) are censored (Winsorised) COMPLETE DATASET: no data censoring Id variables are expunged from both datasets, obviously
Datasets available Aggravated procedure for accessing the complete dataset: • Access must be explicitly requested – a special profile is created • Review is exclusively manual • Wait times are longer than average as time allocated to manual review on complete dataset is reduced
Documentation on the website • Application form • Instruction manual • Dataset description • Examples of submissions in the supported packages (SAS, Stata) • Methodological notes on the survey
Support • Documentation available on the Bank of Italy website (manuals, variables description, questionnaires)http://www.bancaditalia.it/statistiche/indcamp/indimpser/bird • Mailbox for queries and assistance: bird_assist@bancaditalia.it
An example Program submitted by the user in Stata. Authentication is in the first four lines.
An example Output forwarded after review
Usage of the system in the first weeks System started officially on Mar 13, 2008 Beta users from Feb 1, 2008 8 registered users 172 submissions in 21 weeks
Future developments • Web submission available alongside e-mail submission • Other datasets will be made available in the future (e.g. data from the Business Outlook Survey) • Open source packages processing (e.g. R) • Merging with external datasets provided by the user, for special projects, on a discretionary basis, under an aggravated procedure and higher security levels. • Creation of closed groups with special authorisation levels for specific projects