210 likes | 224 Views
This article discusses the IT issues faced in dealing with big data for official statistics, including access to data sources, ICT usage, and design challenges. It also presents the experimental projects undertaken by the Italian National Institute of Statistics (Istat) in this field.
E N D
Dealing with Big Data for Official Statistics: IT Issues Giulio Barcaroli Stefano De Francisci Monica Scannapieco Donato Summa Istat – Italian National Institute ofStatistics
Outline Background: Istat Big Data strategy and experimental projects IT issues in experimentalprojects Finalremarks Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Istat Big Data Strategy - 1 • Istat (The Italian National Institute of Statistics) set up a technicalCommission with the objective to orient investments on Big Data adoption in statistical production processes • Duration: from February 2013 to February 2015 • Members coming from different areas: Official Statistics, Academy, Private Sector Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Objectiveof the talk • I will NOT deal with (just) technologicalissues • I will deal instead (mainly) with IT methodologicalissues • Example: . MapReduce-Hadoop : Open Source Framework Map-Reduce Programming Model: are OS stat methods mapreduce-able? Current research for investigating mapreduce-ability of (classes of) computational problems Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Istat Big Data Strategy - 2 • The Commission will release a strategy for Big Data adoption • Three experimental projects launched and monitored by the Commission: • Persons and Places • Labour Market Estimationbased on Google Trends • ICT Usage in enterprisesbased on Internet as a Data Source (IaD) • Status: advanced implementation (first results already available) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Persons and Places • Purpose • Production of the origin/destination matrix of daily mobility for purpose of work and study at the spatial granularity of municipalities starting from phone (tracking) data • Actors involved in the project • Istat • National Research Council • University of Pisa • Methodology • Inference of population mobility profiles from GSM Call Data Records (CDRs) • Comparisonwith data derivedfromadministrativesources Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Labour Market Estimation • Purpose • Test the usage of Google Trends for forecasting and nowcasting purposes in the Labour Force domain • Actors involved in the project • Istat: Central Methodology Sector and Labour Force Survey • Methodology • Autoregressivemodel vs. Usageof Google Trendsaspredictionmodels • Comparisonextendedtomacroeconomicspredictionmodels Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
ICT Usage in Enterprises • Purpose: • Evaluate the possibility of adopting Web scraping and text mining techniques for estimates on the usage of ICT by enterprises and public institutions • Actors involved in the project: • Istat: Survey on the ICT Usage in Enterprises • Cineca (Consortium of Italian universities, National Research Council and Ministry of Education and Research) • Methodology • Scraping of web sites for data extraction • Supervised classification task Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Features of Experimental Projects Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
StatisticalPhasesfor Big Data Management Collapsed phases Inversion of the two phases • Principal selected phases • Inversion due to the fact that “traditional” design phase is not anymore present for Big Data • Collapse due to the fact that same methods can be used for both phases • Other phases, e.g. Dissemination, not (yet) involved in Big Data Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Collect: IT issues - 1 • Access to Big Data sources: • Type 1: Access control mechanisms that the Big data provider designedly set up and/or • Type 2: Technological barriers • Google Trends: • Absence of APIs, preventing from the possibility of accessing GT data by a software program • Not possible to foresee the usage of such a facility in production processes Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Collect: IT issues - 2 • ICT Usage: Bothtype 1 and 2 problems • 8.647 URLs of enterprises’ Web sites, but only about 5.600 were actually accessed • Type 1: Scrapers deliberately blocked, e.g. mechanisms in place to verify human access to sites, like CAPTCHA • Type 2: Usage of some technologies like Adobe Flash simply prevented from accessing contents Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Design: IT Issues - 1 • Even if a traditional survey design cannot take place, the problem of “understanding” the data still present • Semantic extraction techniques • Knowledge representation and natural language processing • E.g.: FRED (http://wit.istc.cnr.it/stlab-tools/fred) permits to extract an ontology from sentences in natural language Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Design: IT Issues - 2 • ICT Usage: • Human inspection refined by: • Some NLP techniques: Tokenization, Stemming, Stopwords removal, etc, • Semantic enrichment by semantic dictionaries (WordNet) • Images: tag extraction Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 1 • Big size, possibly solvable by Map-Reduce algorithms • Model absence, possibly solvable by learning techniques • Privacy constraints, solvable by privacy-preserving techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 2 • Map-Reduce algorithms: problem of formulating algorithms that can be implemented according to such a paradigm ->”mapreduce-ability” • Recent state of the art Map-Reduce algorithms for: • Basic graph problems, e.g. minimum spanning trees, triangle counting and matching • Combinatorial optimization, e.g. maximum coverage, densest subgraph, and k-means clustering Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 3 • Persons and Places: • Match mobility-related data with data stored in Istat archives • Record linkage problem should be solved (future task) • Model Absence: neither survey-based nor “traditional” model-based approaches directly applicable to Big Data • Possible usage of machine learning techniques Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 4 • ICT Usage: • Supervised learning methods used to learn the YES/NO answers to the questionnaire (including, classification trees, random forests and adaptive boosting) • Persons and Places: • Unsupervised learning technique, namely SOM (Self Organizing Map) to learn mobility profiles • E.g. “free city users” vs. “embedded city users” (more confidently estimated by deterministic constraints) Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Process/Analyse: IT Issues - 5 • Privacy constraints could potentially be solved in a proactive way by relying on techniques that work on anonymous data • Privacy-preserving data integration, e.g. [DMKM-2004] • Privacy-preserving data mining, e.g. [TKDE - 2004] • Personsand Places: AnonymousmatchingofCDRswith Istat archives via privacy-preserving record linkage Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
ConcludingRemarks • Illustrationof some IT issuesconsideredasrelevantfor Big Data adoption by OS on the basis of practical experiences • Probably technology is not an issue but IT methodology is an issue!!! • Some IT issues also share some statistical methodological aspects • Other relevant IT issues: • Event data management • On-line analytics Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014
Thank you for the attention! Monica Scannapieco, Dealing with Big Data for Official Statistics: IT Issues , MSIS 2014