Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome

Webscraping at Statistics Netherlands Olav ten Bosch23 March 2016, ESSnet big data WP2, Rome

Content • Internet as a datasource (IAD): motivation • Some IAD projects over past years • Technologies used • Summary / trends • Observations / thoughts • Legal • The Dutch Business Register

The why Internet sources Faster, better, more efficient Administrative sources • Tax, social security services • Municipalities/ Provinces • Supermarkets • … • … • Surveys New indicators Less!!!

Fuel prices (2009) • Daily fuel prices from website of unmanned petrol stations (tinq.nl) • Regional prices (per station) every day Now: 2016: • A direct data feed from travelcard company, weekly • Fuel prices per day and all transactions of that week • Publication in website: prices per month

Airline tickets (2010) • Pilot: 3 robots on 6 airline companies • 2 robots by external companies, 1 by SN • Prices comply with manual collection • Quite expensive; negative business case • 2016: still manual price collection of airline tickets

Housing market • Housing market (from 2011): • Discussions with external company for > 1 year (iWoz) • We scraped 5 sites, about 250.000 observations / week, 2 years 2013 ->: • Direct feed from one of the sites (Jaap.nl) • Statline tables: Bestaande woningen in verkoop • “based on 80-90 percent of the market”

Bulk price collection for CPI (1) • Bulk price collection for CPI (from 2012): • Mainly clothing • Software scrapes all prices and product data (id, name, description, category, colour, size,…) 2016: • About 500.000 price observations daily from 10 sites • Data from 3 sites used in production of Dutch CPI • Price collection process embedded in organisation • Plans to extend to > 20 sites; other domains

Bulk price collection for CPI (2) Features: Fine-knit Jumper Dark blue Striped Cotton edges Data collection & Feature extraction Structured data Big Data Index methods Index based on internet data Processing bulk data from the Internet

Robot-assisted price collection • Robot tool for detecting price changes on (parts of) websites • Traffic light indicates status: • Green: nothing changed, prices is saved in database • Red: some change, need attention of statistician • Two click to hold old price or store a new one • In production from 2014

Collect data on enterprises for EGR (2013) • Pilot: find data about EGR enterprises on the web • We scraped semi structured data from Wikipedia • Multiple wikipedia languages (NL, EN, DE, FR) • 2016: something alike in ESSnet BD WP2?

Search product descriptions for classifying business activities • Search product descriptions on web (from 2014) • First time we used automated searchwith Google search API for statistics • Pilot, no production • Some doubts on google results

Twitter-LinkedIn (1) • LinkedIn-Twitter for profiling (2015) • Automated search on LinkedIn based on a sample of twitter users • Very specific and experimental • “Profiling of Twitter data, a big data selectivity study”, Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch

Scraping websites of enterprises • Identify family businesses (search and / or crawling) (2016) • Identify businesses with a Corporate Social Responsibility (CSR) (search and / or crawling) (2016) • Research program: • “Extracting information from websites to improve economic figures” • This ESSnet BD WP2 !!!

Crawling for Statistics Incomplete statistical data Url-base Search terms Navigation terms Focused Crawler (Roboto) Internet Item identifyer terms “year report, family business” More complete statistical data Search & Match ElasticSearch Datastore

Technologies used • Perl (2009), Djuggler (2010) • Python, Scrapy (2010) • R (2011-2015) • NodeJS (Javacript on server) (2014-) • Google Search API (2014-) • ElasticSearch (2016) • Roboto (nodejs package, 2015-2016) • Nutch: tested, not used • Generic Framework (robot framework) for bulk scraping of prices

Summary / trends

Observations / thoughts … • If it is there, we can get it • Technology is (usually) not the problem! • The internet is a living thing! • It’s too simple to think we can just buy the internet somewhere and then make statistics! • It’s powerful to combine something we know with something we observe! • External companies can help, but be careful …

Legal • Dutch Statistics Law: • Enterprises have to provide data to Statistics Netherlands on request • Scraping information from websites reduces response burden • Statistics Netherlands does use data for official statistics only • Dutch database legislation: • Commercial re-use of intellectual property is forbidden • This may also apply to internet sources • Privacy: • Dutch (statistical) legislation on protection of personal information • Statistics Netherlands does only scrape public sources and processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally • Netiquette: • respect robots.txt • identify yourself (user-agent) • do not overload servers, use some idle time between requests

Dutch Business Register (simplified) - From administrative units to statistical units: • Sources: • Trade Register • Tax Register • Social security register (employees) • Profilers • About 1.5 Million administrative entities • About 0.5 Million have a url • Quality of url field not known, but seems usable

Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome

Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome

Presentation Transcript

WP2: Data Management

Olav ten Bosch MSIS, Dublin, 14-16 April 2014

Big Ten SAA Stipends

Big Ten Explorers

Big Ten development conference

WP2. Data Management

ESSnet on Data Warehousing - WP2 Overview Amsterdam September 2013

Grid Data Management (WP2)

WP2 - Data Management

23 January 2014, Rome

WP2: Data Management

The March on Rome

Essnet for SDMX Phase II WP2 PC-Axis SDMX Integration

WP2: Data Management

23 March

Elixir WP2 Data Resources

10 Big Data Predictions for 2016

Epic Research Daily Forex Report 23 March 2016

Big Data Big Data

ESSnet DI WP2: Record Linkage

WP2: Data Management