Giulio Barcaroli (), Alessandra Nurra (), Marco Scarnò(**), Donato Summa(*)

Quality 2014 Wien, June 2-5 2014 Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra Nurra(*), Marco Scarnò(**), Donato Summa(*) (*) Italian National Institute of Statistics (Istat) (**) Cineca Quality 2014

The “ICT in enterprises” survey • In Italy, the survey investigates on a universe of 211,851 enterprises with at least 10 employees, by means of a sampling survey involving 19,186 of them (2011). • In the 2013 round of the survey, 8,687 indicated their website (45% of sampling respondent units). • The access to the indicated websites in order to gather information directly within them, gives different opportunities. Quality 2014

The “ICT in enterprises” survey Quality 2014

Quality 2014

Predictive approach vs Content Analysis We assume that our target is to increase the accuracy of estimates by making use of data originating by the Internet as auxiliary data. This particular case is based on the use of textual data as auxiliary data. Texts are a “perfect” example of unstructured data, that is one of the characteristics of most Big Data. First, the usual model-based approach will be followed, requiring the prediction of values at unit level: under this approach, the target is to maximise the correctness of classification for each unit in the reference population. Next, a different approach will be illustrated, where the prediction of values at unit level is no more required and the target becomes to directly maximise the accuracy at the aggregate level (estimates accuracy). Quality 2014

Predictive approach In a predictive approach, the subset of data related to sampled respondent units can be considered as the labeled data, and supervisioned learning methods can be applied. In other words, the subset of 8,687 enterprises that indicated to have a website or a home page, and also responded to questions [B8a : B8g], can be considered as the training and test set by means of which different models can be estimated in order to predict answers to [B8a : B8g] questions for the whole reference population. Texts(websitescontent) Text and data mining Model SurveyMicrodata Quality 2014

Predictive approach In our case, we can apply one among the supervisioned learning methods: • Classification Trees; • “ensembles” (Bootstrap Aggregating, Adaptive Boosting, Random Forests); • Supervised Latent Dirichlet Allocation for classification (SLDA); • Neural Networks; • Logistic Regression; • Support Vector Machines; • Naïve Bayes. Quality 2014

Evaluation of predictive models From the error matrix it is possible to compute the following indicators: Quality 2014

Evaluation of predictive models Application of different learners to predict question B8a “Online ordering or reservation or booking (Yes/No)” Quality 2014

Evaluation of predictive models In general, when the misclassification cases are not balanced in absolute terms, the result is that the distribution of predicted values can be significantly different from the distribution of observed cases. From these results, Naïve Bayes predictor can be considered as the most convenient, because even if its precision (78%) is the lowest, though sensitivity is the highest, specificity is good, and the alignment of observed and predicted proportion is perfect. Quality 2014

Evaluation of predictive models Application of Naïve Bayes to predict all questions in section B8 Quality 2014

Content analysis Quality 2014

Content analysis performance … In order to verify the robustness of the Content Analysis, we iterated 40 times the selection of a training set from survey data (each time producing an estimate of the proportion of web sales functionality), in correspondence to different rates of training set on the total (from 10% to 90%). The results show correctness of the method until 30% of training rate, but a great variability of estimates for every rate. Quality 2014

… compared to Naïve Bayes The same exercise has been carried out for Naive Bayes. The results show a minimum bias (in the order of one or two percentage points), but a much lower variability. Quality 2014

Future work The experimented approach will be improved and extended in different directions: with reference to the population of interest: we will consider the URLs of all the units belonging to the Business Register, and perform a mass scraping of related websites (in this case also experimenting more properly the high volume problems related to Big Data), considering the whole sampling subset of websites as a training set, so to obtain a model that can be applied the whole population. The aim is to produce estimates under a full predictive approach, reducing the sampling errors at the cost of introducing additional bias (both components of MSE should be evaluated); with reference to the content of the questionnaire: the results obtained with the set of variables contained in the “B8” section of the questionnaire, will be evaluated also with the other suitable variables in the questionnaire (e-recruitment, e-procurement, use of social networks, etc.).

Thank you for your attention Contacts barcaroli@istat.it nurra@istat.it marco.scarno@cineca.it summa@istat.it Quality 2014

Giulio Barcaroli (), Alessandra Nurra (), Marco Scarnò(**), Donato Summa(*)