300 likes | 510 Views
Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014 Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation
E N D
Q2014 European Conference on quality in official statistics Vienna, 2-5 June 2014 Web scraping techniques to collect data on consumer electronics and airfares for Italian HICP compilation Riccardo Giannini (rigianni@istat.it), Rosanna Lo Conte (rolocont@istat.it), Stefano Mosca (stmosca@istat.it), Federico Polidoro (polidoro@istat.it), Francesca Rossetti (frrosset@istat.it)
Outline of the presentation Implementing European project “Multipurpose price statistics” Centralised data collection for Italian Harmonized Index of Consumer Prices (HICP) Testing and implementing web scraping techniques on the survey concerning prices of “consumer electronics” products Testing web scraping techniques on the survey concerning “airfares” IT choices adopted to implement web scraping procedures Possible future developments and conclusive remarks
Implementing European project “Multipurpose price statistics” • The key role of price statistics for the choices of policy makers • The requirements to ensure and improving quality in terms of methodology and production process, with specific reference to the data collection phase • The demand for timely and cost efficient production of high quality statistical data increases, as well the need for new solutions to declining response levels (Scheveningen Memorandum) • Multipurpose price statistics as reply to these requirements
Implementing European project “Multipurpose price statistics”
Implementing European project “Multipurpose price statistics” • Modernization of data collection tools for improving HICP quality is one of the pillars of “Multipurpose Price Statistics” • Focus on “scanner data” and “web scraping” techniques as tools to capture big amount of data for the compilation of inflation • Concerning web scraping Istat is testing and implementing procedures to “scrape” big amount of data for HICP aims, using the Internet as data source • Focus on two groups of products: “consumer electronics” (goods) and “airfares” (services)
Centralised data collection for Italian HICP Breakdown of the basket of products in terms of weights
Centralised data collection for Italian HICP Criteria to select the products for testing web scraping techniques
Centralised data collection for Italian HICP Relevance of web as retail trade channel Source: Istat survey on “Aspects of daily life”
Centralised data collection for Italian HICP • The choice has finally fallen on two groups of products: consumer electronics (goods) and airfares (services) • Testing web scraping techniques on these two groups of products was aimed first of all at making the on line data collection more efficient, for products for which the web is a relevant retail trade channel • The aim of exploring the potentialities of web scraping to allow a better coverage of the reference population (linked to the issue about the use of big data for statistical purposes and the consequences of this use on the traditional sampling methodologies)
Testing and implementing web scraping techniques on “consumer electronics” products • Survey about prices of consumer electronics is carried out in ten phases and it concerns:
Testing and implementing web scraping techniques on “consumer electronics” products Focus for web scraping test on Phase 5 and semi automatic detection (time consuming phase and feasibility of the test)
Testing and implementing web scraping techniques on “consumer electronics” products • Evaluation of the results obtained • In terms of • Amount of prices downloaded in the lists • Amount of prices that was possible to link automatically for each store (via SAS procedures) to the product codes in the sample selected in phase 4 • Improvements of efficiency in terms of time saving
Testing and implementing web scraping techniques on “consumer electronics” products Table 2.Number of elementary price quotes manually collected and web scraped and usable for indices compilation. A comparison between February and March 2013 Source: Istat
Testing and implementing web scraping techniques on “consumer electronics” products Table 3.Current workload for monthly data collection. Comparison between manual and web scraping download. Hours The comparison between workload necessary for the manual detection of prices and the download of prices through web scraping macros shows the efficiency gains obtained Source: Istat
Testing and implementing web scraping techniques on “consumer electronics” products Table 4. Annual working hours for half of shops sample for data collection of prices of consumer electronics products. Comparison between manual download and web scraping. Hours Source: Istat • A correct evaluation of the gains in terms of efficiency assessing the annual workload, taking into account the starting cost of developing macros (starting workload to be considered for annual index changing base) and the maintenance of the macros (current maintenance) • Also on annual basis efficiency increases: the workload necessary to manage the survey is reduced from about 23 working days to 16 working days (more than 30% of time saved that increases if we take into account that human resources are available for other tasks). • The choice of switching from test to production the use of web scraping macros for consumer electronics
Testing and implementing web scraping techniques on “consumer electronics” products Table 5. Sample of models, price quotes scraped and price quotes collected for consumer electronics products survey for Italian CPI/HICP compilation. January 2014. Units and percentages
Testing and implementing web scraping techniques on “consumer electronics” products • The potentialities of web scraping techniques in terms of amount of information captured • A big “waste” of information: why ? • For being, from the statistical point of view, useless for the aims of estimating inflation estimation of consumer electronics products ? • Or for sampling schemes that were conceived taking into account the constraints of the data collection and in particular the limitations to collect all the information available on web ? • The possible massive use of web scraping techniques opens “new doors” to statisticians concerning the capture of the information necessary to measure the object of the survey ?
Testing web scraping techniques on the survey concerning “airfares” A brief description of the survey • COICOP class (passenger transport by air, weight on the total HICP basket of products=0.85% in 2013) articulated in three consumption segments: Domestic flights, European flights, Intercontinental flights, further stratified by type of vector, destination and route • In 2013 final sample consisted of 208 routes (from/to16 Italian airports): 47 national routes, 97 European routes, 64 intercontinental routes. 81 routes referred to TCs and 127 routes referred to LCCs. • Product definition: one ticket, economy class, adult, fixed route connecting two cities, outward and return trip, on fixed departure/return days, final price including airport or agency fees
Testing web scraping techniques on the survey concerning “airfares” A brief description of the survey • Prices are collected by means of purchasing simulations on Internet, according to a pre-fixed yearly calendar (first Tuesday of the month considering 2 or 4 time distances from the date of departure) • In 2013, data collection was carried out on 16 LCCs’ websites and on three web agencies selling air tickets (Opodo, Travelprice and Edreams), where only TC’s airfares are collected • More than 960 elementary price quotes are registered monthly, which correspond to the cheapest economy fare available at the moment of booking for the route and for the dates selected, including taxes and compulsory services charges
Testing web scraping techniques on the survey concerning “airfares” Testing web scraping • The aim of testing web scraping techniques on airfares is twofold: verifying the improvements of efficiency and evaluating the chance of extending data collection to further dates (two and three months of “purchasing advance”) with respect to those ones ordinary scheduled (monthly or twice a month with departure dates ten days and one month after), exploiting the potentialities of web scraping procedures • Taking into account characteristics and peculiarities of the survey on airfares the activity of testing web scraping techniques on airfares data collection has required not only developing and assembling scraping macros but also implementing a multitude of logic controls, derived from the statistical design of the survey • Web scraping techniques have been tested on EasyJet, Ryanair, and Meridiana and for the traditional airlines companies on Opodo.it
Testing web scraping techniques on the survey concerning “airfares” Results of Testing web scraping With regard to the LCCs, each airline company site showed its own specific problems:
Testing web scraping techniques on the survey concerning “airfares” Results of Testing web scraping • Finally, attention was concentrated on EasyJet • The macros developed have provided excellent results in correctly replicating manual data collection. They are used in the current activities, starting from the more recent data collections • Improvements in terms of time saving have been quite small. This is due to the time spent in preparing the input files used by the macros to correctly identify the routes and dates for which scraping the prices and returning a correct output usable for the index compilation, but also to the limited amount of elementary quotes involved (60) that does not allow to have a meaningful measure of time saving deriving from the adoption of web scraping techniques as a powerful tool to acquire big amount of elementary data in an efficient way.
Testing web scraping techniques on the survey concerning “airfares” Results of Testing web scraping • For airfares offered by traditional airlines companies web scraping macros have been tested on the web agency Opodo(www.opodo.it). In this case, an amount of about 160 monthly price quotes was involved. • Improvements in terms of efficiency more meaningful than those ones obtained with EasyJet macro (1 hour and 48 minutes to download the 160 elementary price quotes manually downloaded in about 2 hours and half) • Also for Opodo it is necessary to prepare an input file to drive the macro in searching the correct sample of routes and, in addition to Easyjet macro, in managing the distinction between traditional and low cost carriers
Testing web scraping techniques on the survey concerning “airfares” Results of Testing web scraping • Therefore the total time necessary for automatic detection of prices is not so different with respect to the manual detection; and time to update the macro is also needed • But it has to be considered that, if the Opodo macro works correctly and only marginal check activity is needed, then the two hours time of manual work is saved and could be dedicated to other phases of the production process or to improve quality and coverage of the survey • Also in the case of airfares the possibilities (enlarging the amount of elementary data collected through web scraping), to better cover the reference universe emerge clearly
IT choices adopted to implement web scraping procedures • The choice of Imacrosas software to be used for testing web scraping techniques in the field of consumer price data collection. Why ? • It allows speeding up the acquisition of textual information on the web and above all it can be used with the help of programming languages and scripting (e.g. Java, JavaScript) • iMacrostasks can be performed with the most popular browsers • Documented by wiki (i.e. http://wiki.imacros.net/iMacros_for_Firefox) and fora (e.g. http://forum.iopus.com/viewforum.php) • It is possible to take advantage from some projects (e.g. http://sourceforge.net/projects/jacob-project/) for the use of Java, delivering to user a great potential for interface and integration with other solutions software and legacy environments.
IT choices adopted to implement web scraping procedures • The approach adopted has been implementing two different macros for each survey: pointing and scraping macro (the pointing ones to reach the page, the scraping ones to collect data and register them into a flat file) • Main advantages: a)Easy maintenance due to modularity that helps the identification of problems when they occur; b) In all cases in which problems reside into pointing macro, there is no need of IT specialist support in maintenance • The main disadvantages are: a)lower usability (collectors are forced to use two macros instead of one; b) More time necessary to execute the complete activity of web scraping • But advantages prevail on disadvantages
Possible future developments and conclusive remarks • Developing and testing web scraping procedures for the Italian consumer price survey have confirmed the enormous potentialities of the use of automatic detection of prices (and related information) • Concerning the efficiency the improvements are clear when data collection is carried out on a few websites with a big amount of information. The situation appears to be partially different if it is necessary to collect few prices on several distinct websites • This issue stresses the potential use of web scraping techniques to collect information for Purchasing Power Parity (PPP) or Detailed Average Price (DAP) exercise at international level of comparison but seems to limit their use for sub national spatial comparison, for which the data collection on a certain amount of websites should be necessary • But the actual challenges emerged have become more clear and they are in front of the statisticians in terms of use of “big data” for statistical purpose
Possible future developments and conclusive remarks • The open questions regard the adoption of web scraping techniques to gather big amount of data useful to better estimate inflation. This challenges is already proposed by the study carried out by economic researchers at the Massachusetts Institute of Technology (MIT), within the project called "The Billion Prices Project @ MIT" that is aimed at monitoring daily price fluctuations of online retailers across the world • Web scraping (and scanner data) are the future of consumer price statistics as basis of reengineering production processes or also challenges to deal with a deep revision of the statistical survey design ? • Is it possible fully exploiting the potentiality of these “big data” (web scraped prices and scanner data) to enhance the quality of official statistical information in a so delicate field as inflation estimation ?