230 likes | 252 Views
Learn about the importance of SIM and QRCA for official statistics, the integration process, metadata system, and data security measures. Discover how AdminData quality is evaluated and documented effectively.
E N D
Supporting the use of administrative data in official statistics. The System of Integrated Microdata – SIM and the Quality Report Card of Administrative data - QRCA Marina Venturi Grazia Di Bella Italian National Institute of Statistics - Istat NTTS 2017 Brussels, 14 - 16 March 2017
New Istat modernization process Innovation of the statistical production system. Use of AdminData The overall Istat modernization processis focused on the new System of Integrated Registries based mainly on AdminData: Business Register Population Register Places Register Activities Register A single Directoratedeals with all aspects of the data collection, survey and AdminData, and the first data treatment process. Brussels, 14 -15 March 2017
Automation of processes for the management of the AdminData flows Considering the complexity of the system and the need to process large amounts of data • Strategy • Use of IT tools to acquire, store, integrate and disseminate AdminData • Strengthening of the metadata system to automate procedures and to create an updated documentation for AdminData quality Brussels, 14 -15 March 2017
See 13C001 Istat presentation by Giacomi AdminData centralized management AdminData management functions IT TOOLS ARCAM AD quality evaluation and documentation SIM EDI, SIM, ARCAM
SIM - The Integrated System of Administrative Microdata SIM is the Istat Database of integrated Admin microdata built with the aim of supporting the Istat statistical production process. Admin source subsets supplied (datasets) that contain microdata referred to statistical target units enter the system • Individuals • Economic units • Places
The SIM process: concept analysis As AD comes from different sources -> different characteristics • To make data consistent with the integration system • data analysis, according to the entity-relationship model • standardizedprocedures for each dataset deliveredperiodically • data loading into relational tables (object/entity its attributes) SIM integrates 70 different AdminData subsets each year from 2011 and is going to integrate other AdminData subsets in 2017
The SIM process: data loading Aministrative objects/entities recognized as statistical target unitsof type k, with k=1,2,3 feed the three main subsystems • Individuals • Economic units • Places AdminDatasets may contain in the same instance more basic units • Individuals (families relationships) • Economic units (local units) SIM relations subsystems: • Individuals and Economic units (i.e. Workers Leed, Students- Schools) • Places and economicunits • Places and individuals Statistical Registers: Population/Business/Places/Activities/
The SIM process: integrating data [1] The stage of Integration is incremental: as the datasets are acquired they are progressively integrated in SIM with the data already present. Operationally, the integration of the i-th dataset is made through a series of record linkage procedures between the input dataset and the corresponding Base Bk (k=1,2,3). Depending on the type of unit and on the AD domain defined by several criteria, a suitable integration strategy and a set of algorithms are applied.
The SIM process: integrating data [2] The integration of a dataset provides many integration processes as are the types of units. The integration stage ends with the attribution of an unique identification code (the SIM code) within the respective Base Bk, and each instance becomes part of the Base. Units that the integration procedures do not match with others already recorded in the Base will received a new code.
The SIM process: a step for data security Identification variables of the individuals are stored in a separate table. Integrated anonymized data are made available to internal users who have requested them and who are authorized to use them. The advantage of centrally integrate data is twofold: To allow users to link datasets simply using the SIM codes (no use of identification variables) avoiding duplicate work To comply with legislation on data privacy
About AdminData quality - What is useful and for whom • AD usability analysis function • Information on AD availability and usability • AD monitoring function • To identify possible regulatory changes that may induce discontinuity and that are not notify in advance; • To detect the presence of unexpected lack of quality For Istat users • Support the collection of AD requirements • Reporting on AdminData usability • AD supply monitoring function • To promptly check AD compliance with respect to data requests For the AD acquisition process unit • Feedback to improving AD quality • To share data quality with suppliers in specified manner defined case by case. For AD holders
Implementation using metadata How to MAKE and UPDATE efficiently and timely AD quality evaluation and documentation in a generalized way for all about 300 datasets acquired yearly by Istat corresponding to about a terabyte of data In the context of Istat modernization – metadata-driven production paradigm Istat IT systems contain many useful metadata
Useful metadata [1] • DB supporting AdminData acquisition - ARCAM • Metadata to identify ADsets • Agreements for delivery with the AD holders • AD Internal Users • List of Admin datasets available (source, data holder, dataset name, periodicity) • Quality measures • Admin dataset Relevance (users, related EU regulations) • For each delivery (reference time, period) • Punctuality • Timeliness
Useful metadata [2] • DB Oracle of the Integrated AdminMicrodata - SIM • Metadata supporting ETL and integration procedures - SISME Variables list Categories for categorical variables (classifications) Kind of target units and kind of relationships available
Useful metadata [2] • DB Oracle of the Integrated AdminMicrodata - SIM • With the aim of supporting data quality evaluation in the acquisition process (compliance analysis) a parameterized table contains for each file: • n. of records • NOT NULL values • frequency distribution for categorical variables Technical checks for the data compliance (variables, units), quality monitoring: comparisons of the quality measures with the previous datasets delivered - to promptly contact supplier in case of serious problem Percentage of missing values for each relevant variable (also in time series) Metadata Completeness for Categories descriptions …
Record linkage quality indicators DB Oracle of the Integrated AdminMicrodata - SIM Metadata of the record linkage process Bk data • Measures of Linkage variables quality (which variables are available and their quality) • Measures for monitoring quality of the integration procedures (deterministic measures) • Record linkage quality indicators (false positive and false negative) to estimate for certain values of the monitoring quality measures
Other useful information/metadata and also • ARCAM • SIM • The DB that manages the National Statistical Programme, the legislative measure that define statistics production for the NSS. It holds information on AD that statistical processes can use • Measures for compliance with the rules on confidentiality) • SIQualis the information system on quality for Istat surveys (usersstatistics – oriented) • Measures of the responseburdenreductionusing AD, output qualitydocumentationabout the use of AD… • SUMthe Unified System of Metadata related to statistical data and processes, modelledaccordingto GSIM. Very useful to standardize metadata and supporting the modernization process, to now used for the statistics dissemination. • Admin variables conceptual comparability (comparing input and output)
A critical point is Systems interoperability Each system is designed with a specific functionality and it is necessary to define a line that connects the information • From the conceptual point of view • From the technical point of view If possible it could be useful to share objectives among IT specialists and statisticians from the beginning, in order to standardize procedures and make metadata reusable
Steps and progress a. Adoption of a AdminData quality framework b. Definition of measures within the framework c. Deep analysis of existing processes, metadata and data flows • d. Classification of measures including • implementable in the short-term with metadata already available • implementable in the medium-term with metadata available but still not accessible • implementable in the long-term with information to acquire Propose and support the interoperability of systems f. Measures implementation g. Prototype by the use of IT generalized systems – BI (Microstrategy) to produce the Quality Report
Objectives to achieve • Quality report card of administrative data • For eachADset (about 300 datasets, of which 100 integrated in SIM) • A Report automaticallygeneratedusingmetadata and data from existing DB and periodicallyupdated • + • Other quality measures carried out by the users to share with others (coverage indicators, accuracy indicators) Further development First quality check of the Integrated System (Activities Register)
Prospectsand challenges • Improveefficiency • Improvequality • Manage complexity • Interaction with many actors • Need to comply with the processes timeliness