210 likes | 372 Views
Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute of Statistics. Outline. Background: Istat Big D ata strategy and experimental projects
E N D
Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute ofStatistics
Outline Background: Istat Big Data strategy and experimental projects IT issues in experimentalprojects Finalremarks Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Istat Big Data Strategy - 1 • Istat (The Italian National Institute of Statistics) set up a technicalCommission with the objective to orient investments on Big Data adoption in statistical production processes • Duration: from February 2013 to February 2015 • Members coming from different areas: Official Statistics, Academy, Private Sector Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Objectiveof the talk • I will NOT deal with (just) technologicalissues • I will deal instead (mainly) with IT methodologicalissues • Example: . MapReduce-Hadoop : Open Source Framework Map-Reduce Programming Model: are OS stat methods mapreduce-able? Current research for investigating mapreduce-ability of (classes of) computational problems Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Istat Big Data Strategy - 2 • The Commission will release a strategy for Big Data adoption • Three experimental projects launched and monitored by the Commission: • Persons and Places • Labour Market Estimationbased on Google Trends • ICT Usage in enterprisesbased on Internet as a Data Source (IaD) • Status: advanced implementation (first results already available) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Persons and Places • Purpose • Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data • Actors involved in the project • Istat • National Research Council • University of Pisa • Methodology • Inference of population mobility profiles from GSM Call Data Records (CDRs) • Comparisonwith data derivedfromadministrativesources Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Labour Market Estimation • Purpose • Test the usage of Google Trends for forecasting and nowcasting purposes in the Labour Force domain • Actors involved in the project • Istat: Central Methodology Sector and Labour Force Survey • Methodology • Autoregressivemodel vs. Usageof Google Trendsaspredictionmodels • Comparisonextendedtomacroeconomicspredictionmodels Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
ICT Usage in Enterprises • Purpose: • Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions • Actors involved in the project: • Istat: Survey on the ICT Usage in Enterprises • Cineca (Consortium of Italian universities, National Research Council and Ministry of Education and Research) • Methodology • Scraping of web sites for data extraction • Supervised classification task Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Features of Experimental Projects Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
StatisticalPhasesfor Big Data Management Collapsed phases Inversion of the two phases • Principal selected phases • Inversion due to the fact that “traditional” design phase is not anymore present for Big Data • Collapse due to the fact that same methods can be used for both phases • Other phases, e.g. Dissemination, not (yet) involved in Big Data Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Collect: IT issues - 1 • Access to Big Data sources: • Type 1: Access control mechanisms that the Big data provider designedly set up and/or • Type 2: Technological barriers • Google Trends: • Absence of APIs, preventing from the possibility of accessing GT data by a software program • Not possible to foresee the usage of such a facility in production processes Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Collect: IT issues - 2 • ICT Usage: Bothtype 1 and 2 problems • 8.647 URLs of enterprises’ Web sites, but only about 5.600 were actually accessed • Type 1: Scrapers deliberately blocked, e.g. mechanisms in place to verify human access to sites, like CAPTCHA • Type 2: Usage of some technologies like Adobe Flash simply prevented from accessing contents Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Design: IT Issues - 1 • Even if a traditional survey design cannot take place, the problem of “understanding” the data still present • Semantic extraction techniques • Knowledge representation and natural language processing • E.g.: FRED (http://wit.istc.cnr.it/stlab-tools/fred) permits to extract an ontology from sentences in natural language Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Design: IT Issues - 2 • ICT Usage: • Human inspection refined by: • Some NLP techniques: Tokenization, Stemming, Stopwords removal, etc, • Semantic enrichment by semantic dictionaries (WordNet) • Images: tag extraction Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 1 • Big size, possibly solvable by Map-Reduce algorithms • Model absence, possibly solvable by learning techniques • Privacy constraints, solvable by privacy-preserving techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 2 • Map-Reduce algorithms: problem of formulating algorithms that can be implemented according to such a paradigm ->”mapreduce-ability” • Recent state of the art Map-Reduce algorithms for: • Basic graph problems, e.g. minimum spanning trees, triangle counting and matching • Combinatorial optimization, e.g. maximum coverage, densest subgraph, and k-means clustering Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 3 • Persons and Places: • Match mobility-related data with data stored in Istat archives • Record linkage problem should be solved (future task) • Model Absence: neither survey-based nor “traditional” model-based approaches directly applicable to Big Data • Possible usage of machine learning techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 4 • ICT Usage: • Supervised learning methods used to learn the YES/NO answers to the questionnaire (including, classification trees, random forests and adaptive boosting) • Persons and Places: • Unsupervised learning technique, namely SOM (Self Organizing Map) to learn mobility profiles • E.g. “free city users” vs. “embedded city users” (more confidently estimated by deterministic constraints) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 5 • Privacy constraints could potentially be solved in a proactive way by relying on techniques that work on anonymous data • Privacy-preserving data integration, e.g. [DMKM-2004] • Privacy-preserving data mining, e.g. [TKDE - 2004] • Personsand Places: AnonymousmatchingofCDRswith Istat archives via privacy-preserving record linkage Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
ConcludingRemarks • Illustrationof some IT issuesconsideredasrelevantfor Big Data adoption by OS on the basis of practical experiences • Probably technology is not an issue but IT methodology is an issue!!! • Some IT issues also share some statistical methodological aspects • Other relevant IT issues: • Event data management • On-line analytics Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Thank you for the attention! Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014