260 likes | 270 Views
ESSnet “Preparation of standardisation”. Istat Generalized Software Systems. Rome, 6-7 June 2011. Generalised solutions for statistical production.
E N D
ESSnet “Preparation of standardisation” Istat Generalized Software Systems Rome, 6-7 June 2011
Generalised solutions for statistical production • One of the goals of the Italian National Statistical Institute is to make available for each survey stage – from the sample design to data analysis and dissemination – generalised solutions, i.e. tools or systems designed so as to ensure production functionalities, that have the following features: • implement advanced methodologies and techniques, so to ensure the best possible quality levels of produced information; • are operable with no or limited need for further software development; • are provided with adequate documentation and user-friendly interface, usable also by non expert users. • Another desired requirement is interoperability, so to ensure the possibility to share tools and systems with other members of the official statistical community.
Technologies and software: open vs proprietary • A strategic decision has been taken in ISTAT since five years: to privilege open source software and technologies as instruments for the development of generalised IT tools. • This was due to various factors: • costs; • interoperability; • dynamics of open source communities. • Since 2006 we obtained the following results: • migration of all IT tools that previously had been developed by using SAS, to new releases based on open source software, first of all R; • mass training on R (about 300 people); • reduction of the use of SAS in production, and consequent reduction of SAS fees (from 1.0 million euros to 0.4).
Sample design: MAUSS • MAUSS (Multivariate Allocation of Units in Sampling Surveys), is based on Bethel’s method and allows to: • fix the sample size; • allocate the total amount of units in the different strata of the population. • Required inputs are: • desired precision for each estimate of interest; • variability of estimates in the domains of interest; • available budget and costs associated with data collection.
Record linkage: RELAIS In many situations it is necessary to link data from different sources, taking care in referring correctly the information pertaining to the same units. If we have common and unique identifiers on both datasets, to join the datasets is a straightforward task, but this is a very uncommon situation. So, we will have to compare a subset of common variables in order to perform the record linkage. The task is complicated by the fact that matching variables are generally subject to errors and missing values, so a methodology to deal with these complex situations is required, together with software enabling to apply it. Software RELAIS (REcord Linkage at IStat) allows to develop different and complex procedures for record linkage, both deterministic and probabilistic. The methodology to deal with probabilistic record linkage is the Fellegi-Sunter (Statistics Canada).
Statistical matching: package StatMatch By using "statistical matching”, we want to integrate data sources that do not have common observed units (or a limited subset of them), but do have a subset of common variables that are observed in both sources. The integration may happen at the micro level, by creating a synthetic dataset, or at an aggregate level, by inferencing the values of the parameters that describe relations among variables that are not jointly observed. This integration technique is currently used when it is necessary to combine two or more samples, as in this case the probability to observe the same units is very small. Past applications in ISTAT were related to the integration of the Household Budget Survey (ISTAT) samples with the Income Survey (Bank of Italy) samples, with the aim of creating the Social Account Matrices (SAM); or the integration of Labour Forces Survey with the Time Use Survey. A useful software that can be used is the open source R package “StatMatch”.
Edit and imputation: CONCORD and ADAMSOFT • The systems CONCORD and Adamsoft (“localise error” function) permit to automatically treat the localisation of errors in data. • They all implement the probability approach known as Fellegi-Holt approach (with variants), thus enabling a better identification of random errors, and therefore a greater accuracy. • In fact, compared to the deterministic approach (IF-THEN rules), the probabilistic approach minimises: • false positives (true values considered as errors); • false negatives (errors considered as true values).
Edit and imputation: CONCORD and ADAMSOFT The CONCORD (CONtrollo e CORrezione Dati) system can be applied to surveys where categorical variables are prevalent, and can be used in the main household surveys. Adamsoft (“localise error” function)can be applied to surveys with prevalence of continuous variables, i.e. surveys on businesses and institutions.
Selective Editing: package SelEMix • SelEMix (Selective Editing via Mixtures), is an R-package for identification of influential errors in numerical data. Methodology is based on latent class models (contamination models) • Required inputs are: • model choice (normal or log-normal) • sample weights • accuracy threshold • technical parameters… • Outputs are: • estimates of model parameters • list of influential units (at the given threshold) • predictions of “true” values” given observed values.
Selective Editing: package SelEMix Outlying observations are also returned according to their (posterior) probability of being erroneous. They not necessarily are influential errors. Example: Small and Medium Enterprises Survey
Selective Editing: package SelEMix • SelEmix can also be used to robustly impute missing items. Imputations are obtained as expectations of true values given the observed (non missing) values. • Advantages: • the method does not require a set of cleaned data for tuning of parameters • the threshold is directly related to the expected residual error in data • Disadvantages: • departures from model assumptions (in particular zero inflation) can deteriorate the performances • it is difficult to explicitly take into account balance edits
Calculation of sampling estimates: ReGenesees • ReGENESEES(R evolution GENEralised software for Sampling Errors and Estimates in Surveys) is a generalised software that can be used to: • assign sampling weight to observations taking account of the survey design, of total non-response and of the availability of auxiliary information (calibration estimators), in order to avoid bias and variability of the estimates (thus maximising their accuracy); • produce the estimates of interest; • calculate sampling errors to support the accuracy of estimates, and present them synthetically through regression models; • evaluate the efficiency of the sample (deff) for its optimisation and documentation. • It allows to: • build the vector of known totals for the calibration in an assisted and controlled way; • in case, the known totals vector can also be automatically calculated by direct access to sampling frame data; • calculate sampling variance for estimators of whatever complexity.
Disclosure control • ARGUS is a generalised software use to ensure confidentiality both of microdata and aggregate data. Two versions are in fact available: • mu-Argus is used for microdata: • it allows to assess the risk of disclosure associated with a given dataset; • if this exceeds a prefixed threshold, the software allows to apply different protection techniques (recoding, local suppression, microaggregation). • tau-Argus is used for aggregate data (tables). It allows to • identify the sensitive cells (using the dominance rule, or the prior-posterior rule) • apply a series of protection techniques (re-design of tables, rounding or suppression). • Protection techniques are based on algorithms that make sure that the loss of information be as low as possible. In this sense, the quality aspect that is improved is the accessibility.
Generalised solutions for statistical production Software and the related information are developed, or gathered and tested by means of an ad-hoc virtual space, created by Istat to meet the needs of the interested users, called Osservatorio Tecnologico per i Software generalizzati (OTS) – Technological Observatory for generalised software. OTS has been released and available at the following address: http://www.istat.it/strumenti/metodi/software/ These pages contain software tools developed in Istat (versions that have been fully tested), available for download. For software systems that are not property of Istat, indications are provided on how to find them.
‘Implementing a quality system in statistics’ Rome, 25 – 29 May 2009 Generalised solutions for statistical production