1 / 21

Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci

Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics. Outline. Background: Istat Big D ata strategy and experimental projects

mercia
Download Presentation

Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute ofStatistics

  2. Outline Background: Istat Big Data strategy and experimental projects IT issues in experimentalprojects Finalremarks Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  3. Istat Big Data Strategy - 1 • Istat (The Italian National Institute of Statistics) set up a technicalCommission with the objective to orient investments on Big Data adoption in statistical production processes • Duration: from February 2013 to February 2015 • Members coming from different areas: Official Statistics, Academy, Private Sector Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  4. Objectiveof the talk • I will NOT deal with (just) technologicalissues • I will deal instead (mainly) with IT methodologicalissues • Example: . MapReduce-Hadoop : Open Source Framework Map-Reduce Programming Model: are OS stat methods mapreduce-able? Current research for investigating mapreduce-ability of (classes of) computational problems Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  5. Istat Big Data Strategy - 2 • The Commission will release a strategy for Big Data adoption • Three experimental projects launched and monitored by the Commission: • Persons and Places • Labour Market Estimationbased on Google Trends • ICT Usage in enterprisesbased on Internet as a Data Source (IaD) • Status: advanced implementation (first results already available) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  6. Persons and Places • Purpose • Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data • Actors involved in the project • Istat • National Research Council • University of Pisa • Methodology • Inference of population mobility profiles from GSM Call Data Records (CDRs) • Comparisonwith data derivedfromadministrativesources Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  7. Labour Market Estimation • Purpose • Test the usage of Google Trends for forecasting and nowcasting purposes in the Labour Force domain • Actors involved in the project • Istat: Central Methodology Sector and Labour Force Survey • Methodology • Autoregressivemodel vs. Usageof Google Trendsaspredictionmodels • Comparisonextendedtomacroeconomicspredictionmodels Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  8. ICT Usage in Enterprises • Purpose: • Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions • Actors involved in the project: • Istat: Survey on the ICT Usage in Enterprises • Cineca (Consortium of Italian universities, National Research Council and Ministry of Education and Research) • Methodology • Scraping of web sites for data extraction • Supervised classification task Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  9. Features of Experimental Projects Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  10. StatisticalPhasesfor Big Data Management Collapsed phases Inversion of the two phases • Principal selected phases • Inversion due to the fact that “traditional” design phase is not anymore present for Big Data • Collapse due to the fact that same methods can be used for both phases • Other phases, e.g. Dissemination, not (yet) involved in Big Data Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  11. Collect: IT issues - 1 • Access to Big Data sources: • Type 1: Access control mechanisms that the Big data provider designedly set up and/or • Type 2: Technological barriers • Google Trends: • Absence of APIs, preventing from the possibility of accessing GT data by a software program • Not possible to foresee the usage of such a facility in production processes Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  12. Collect: IT issues - 2 • ICT Usage: Bothtype 1 and 2 problems • 8.647 URLs of enterprises’ Web sites, but only about 5.600 were actually accessed • Type 1: Scrapers deliberately blocked, e.g. mechanisms in place to verify human access to sites, like CAPTCHA • Type 2: Usage of some technologies like Adobe Flash simply prevented from accessing contents Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  13. Design: IT Issues - 1 • Even if a traditional survey design cannot take place, the problem of “understanding” the data still present • Semantic extraction techniques • Knowledge representation and natural language processing • E.g.: FRED (http://wit.istc.cnr.it/stlab-tools/fred) permits to extract an ontology from sentences in natural language Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  14. Design: IT Issues - 2 • ICT Usage: • Human inspection refined by: • Some NLP techniques: Tokenization, Stemming, Stopwords removal, etc, • Semantic enrichment by semantic dictionaries (WordNet) • Images: tag extraction Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  15. Process/Analyse: IT Issues - 1 • Big size, possibly solvable by Map-Reduce algorithms • Model absence, possibly solvable by learning techniques • Privacy constraints, solvable by privacy-preserving techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  16. Process/Analyse: IT Issues - 2 • Map-Reduce algorithms: problem of formulating algorithms that can be implemented according to such a paradigm ->”mapreduce-ability” • Recent state of the art Map-Reduce algorithms for: • Basic graph problems, e.g. minimum spanning trees, triangle counting and matching • Combinatorial optimization, e.g. maximum coverage, densest subgraph, and k-means clustering Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  17. Process/Analyse: IT Issues - 3 • Persons and Places: • Match mobility-related data with data stored in Istat archives • Record linkage problem should be solved (future task) • Model Absence: neither survey-based nor “traditional” model-based approaches directly applicable to Big Data • Possible usage of machine learning techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  18. Process/Analyse: IT Issues - 4 • ICT Usage: • Supervised learning methods used to learn the YES/NO answers to the questionnaire (including, classification trees, random forests and adaptive boosting) • Persons and Places: • Unsupervised learning technique, namely SOM (Self Organizing Map) to learn mobility profiles • E.g. “free city users” vs. “embedded city users” (more confidently estimated by deterministic constraints) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  19. Process/Analyse: IT Issues - 5 • Privacy constraints could potentially be solved in a proactive way by relying on techniques that work on anonymous data • Privacy-preserving data integration, e.g. [DMKM-2004] • Privacy-preserving data mining, e.g. [TKDE - 2004] • Personsand Places: AnonymousmatchingofCDRswith Istat archives via privacy-preserving record linkage Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  20. ConcludingRemarks • Illustrationof some IT issuesconsideredasrelevantfor Big Data adoption by OS on the basis of practical experiences • Probably technology is not an issue but IT methodology is an issue!!! • Some IT issues also share some statistical methodological aspects • Other relevant IT issues: • Event data management • On-line analytics Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

  21. Thank you for the attention! Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014

More Related