190 likes | 208 Views
This presentation discusses the modernization initiatives of Statistics Canada and their efforts to stay relevant in the data revolution. It explores leading-edge methods and research priorities, including integration and granularity, data suppression, data availability, confidentiality and access, random tabular adjustment, data quality measurement and communication, alternative data sources, handling big data volume challenges, combining classic and leading-edge methods, and machine learning for autocoding. The upcoming work and priorities for 2019-2020 are also highlighted.
E N D
Paving the way forward for a modern and responsive national statistical office by Susie Fortier International Cooperation and Methodology Innovation Centre Prepared for the New Techniques and Technologies for Statistics conference (NTTS 2019) Brussels, March 12 to 14 2019
CONTENT The content of this presentation represents the position of the author and may not necessary represent that of Statistics Canada
BACKGROUND • To stay relevant and respond to the data revolution, StatCan is in the mist of a large modernization initiative. • Organized around 5 pillars: • Originally illustrated with 4 pathfinder projects : Cannabis, Tourism, Low carbon economy, Housing.
LEADING EDGE METHODS AND R&D • Innovations vs Research & Development • Methodology Research and Development Program • Theory and framework; solving current horizontal issues; new areas, connect with researchers, build capacity. • Annual publication of achievements: 12-206-X in Statistics Canada publication catalogue
2018-2019 METHODOLOGY RESEARCH PRIORITIES Data suppression Data availability Integration and Granularity DATA Classic and Leading edge New sources volume challenge
INTEGRATION AND GRANULARITY • Expand and facilitate the use of small area estimations with a variety of auxiliary sources (Fay-Herriot model) • Current work : variable preparation and selection, model validation, local diagnostics. • Expand and facilitate the use of record linkage methods • Machine learning assistance for threshold selection in probabilistic RL methods • Impact of RL errors in linked data analysis • Data integration review paper (Yung, Beaumont, Dasylva, Fortier, Nambeu and Sango, 2019)
Data suppression Data availability CONFIDENTIALITY AND ACCESS • Operationalize and further develop new techniques such as Random Tabular Adjustment (RTA) • Stinner, M. (2017) “Disclosure control and random tabular adjustment”, in SSC proceedings. • Further evaluate, operationalize and develop techniques for micro data access • Analytically-rich synthetic data file or scientific use data files • Use case with R package synthpop: 2006 Census linked to mortality data (used as open data in an analytical hackathon) • Sallier, K. and Girard, C (2018), paper presented at Privacy in Statistical Databases conference.
RANDOM TABULAR ADJUSTMENT (RTA) • Confidentiality method based on perturbation and inference. • Our contribution is mostly on the risk measurement, based on Bayesian theory and our current sensitivity approaches. • RTA ensures that there is a certain level of uncertainty for any attempted inference to an individuals value, considering existing uncertainties (from sampling or non-response) and added noise (random value) when needed.
Communicate • Measure QUALITY • The methods and language for measuring and communicating data quality mostly originate from sampling theory. • Expansion needed in a world less based on surveys: • exploring the theoretical framework • addressing the immediate needs of data users and producers • Recent work: quality awareness toolkit, upcoming revision to quality guidelines, National Quality Assurance Framework (UN), Federal data strategy.
New sources Alternative data sources • Address methodological challenges with newest sources of alternative data • Scanner data, web scrapping, crowd sourcing, smart meters, etc.… • Highlighted use case: Wastewater • Reedman and Brennan (2019). Experimental Statistics from an unlikely source, NTTS 2019 poster session
DATA Big data volume challenge • Fully explore mathematical and statistical options to handle data volume challenges. • In collaboration with IT options such as cloud and high performance environment but with math/stat focus. • Sampling, parallelisation and optimisation algorithms.
Classic and Leading edge Combine to conquer • Further build our capacity to combine classic and leading edge methods. • Non probabilistic data: Use of non-prob data source in a scientifically rigorous framework (Beaumont, 2018) • Data Science: Continue and expand experimentation with machine learning and AI techniques. • StatCan has both a Data Science Accelerator and a Data Science Centre of Excellence.
Machine Learning for autocoding • Many on-going research and accelerator projects on the use of AI and machine learning techniques. • Most advanced ones are related to text classification (autocoding) • Chu, Yeung, Laroche and Fortier (2018), “Exploring modern coding Method”, presented to Statistics Canada’s Advisory Committee on Statistical Methods. • Unbalanced training data, transfer learning, hyperparameter tuning, quality assurance in production, (unsupervised learning).
UPCOMING WORK AND PRIORITIES • 2018-2019 annual report • 2019-2020 priorities: • Goals: • More timely • More detailed • More efficient Valid Statistical Inference Quality Rigour and ethics
THANK YOU / MERCI Pour de plus amplesrenseignements, veuillezcontacter: For more information, please contact: Susie.Fortier@canada.ca The content of this presentation represents the position of the author and may not necessary represent that of Statistics Canada. Methods described may be planned approaches that have not yet been implemented in Statistics Canada’s programs.
Simple Fictitious Example Pertubation Solution - RTA Suppression Solution – G-Confid Marginal Cells are adjusted accordingly • Loss of $175,008 with cell suppressions (66% of internal cells) Determine and add Protection Noise Sensitive Cell - Suppress Secondary suppressions
Suppression – G-Confid Perturbation- RTA