150 likes | 166 Views
This handbook covers advanced topics in data validation for international trade in goods statistics, updated in 2012. It includes validation rules for Intrastat and Extrastat data, aggregated data to be sent to Eurostat, and a validation schema applied by Eurostat. Various checks and controls are discussed, focusing on improving data quality, structural investigations, outlier detection, and more.
E N D
Data validation handbook ADVANCED ISSUES IN INTERNATIONAL TRADE IN GOODS STATISTICS ESTP training course 3 April 2014 Evangelos Pongas Eurostat, Unit G.5
Background • At the Data Validation Task Force meeting of April 2006 Eurostat proposed to harmonise data validation in International Trade in Goods Statistics (ITGS) among Member States • Since that time Eurostat has been constantly working on keeping the validation framework up-to-date. • The last revision of this document was made in 2012. The main objective of this revision was to keep the document up to date, make it more user-friendly, reflect recent changes in the methodological requirements and include new data validation rules proposed by MS.
Contents of the manual • Intrastat – Validation rules applicable at PSI level • Micro-data validation rules for Intrastat and Extrastat data performed at NSA • Validation of aggregated data to be sent to Eurostat (DOCMET400) • Data validation schema applied by Eurostat for validation of Extrastat and Intrastat detailed data transmitted by Member States • Metadata
Intrastat – Validation rules applicable at NSA level • This chapter presents the data validation rules which could be introduced in the IT applications used for filling in and transmitting Intrastat data by PSIs to the NSA. • The controls relates mostly to the data elements collected (validity checks) • Some credibility checks provided only as possibilities and application of them depends on national IT system for Intrastat/Extrastat data collection
Micro-data validation rules for Inrastat and Extrastat data performed at NSA • Rules that are recommended to be applied by NSA to the Intrastat and Extrastat micro data which have been submitted by PSIs in Intrastat systems and provided by Customs administration. • The controls relate to the indicators available in the data base (validity checks) and various credibility checks which can be performed using historical data and mathematical methods (outlier detection systems) • Validity references to Doc MET 400 and other major characteristics of the indicators were introduced and included.
Validation of aggregated data to be sent to Eurostat (DOCMET400) • The validations rules are described separately for Intrastat and Extrastat as two files are transmitted to Eurostat • Validation rules are based on the sections of DocMET 400 • The controls are related to the compliance of the data with format of the Doc MET 400 and with legislation. • It is supposed that by the time the data are prepared for the transmission to Eurostat, all credibility, logical outlier detection controls were executed on detailed data.
Eurostat data validation schema • Consolidates all validation rules executed at Eurostat and provides indication to Member States which of the controls are recommended to be fulfilled at national level before sending the data to Eurostat. • There are four steps of validation rules, from which three first could be implemented at national level: • Preliminary global data file checks • Record validation checks • Post record global validations checks • Advanced post validation checks
Improve quality: Sniff data • Is number of records acceptable • Total value and net mass correspond to economic and political trends (expectations) • Volume of missing (and/or estimated) data • Check chapter level
Improve quality: Structural investigations 1/2 • High rates of data imputation complicate the use of survey data. • Is reporting error in your data likely to not be random? If so, your results will be biased. • Do you think reporting error is more likely in some variables than in others? • Can you use other variables to inform sources of potential bias? For example, transport mode, origin, exchange rate etc.
Improve quality: Structural investigations 2/2 • Does the structure of the survey itself change the probability of reporting error? For example, does the order in which the survey asks questions suggest that people are less likely to answer certain questions because of that order, or might people have some reason for giving less reliable answers at one point in the survey than at some other point? • Do you provide users with summary results? For example, unit values, unit net-mass.
Improve quality: Outlier detection • Transformation (log, standardisation) • Unit values, unit net-mass • Parametric controls (average, standard deviation: trimmed or not) • Non parametric (median, quartiles, mad) • Few data techniques (less than 8) • Weight the importance of outlying data (with perspective of publication) • Outliers at company level
Improve quality: Outlier detection, global view • Test kurtosis (y), skewness (x) of 1200 series. Outliers are situated at the edges of the cloud of points
Thank you for your attention • Any questions ?