210 likes | 218 Views
Webscraping at Statistics Netherlands. Olav ten Bosch 23 March 2016, ESSnet big data WP2, Rome. Content. Internet as a datasource (IAD): motivation Some IAD projects over past years Technologies used Summary / trends Observations / thoughts Legal The Dutch Business Register.
E N D
Webscraping at Statistics Netherlands Olav ten Bosch23 March 2016, ESSnet big data WP2, Rome
Content • Internet as a datasource (IAD): motivation • Some IAD projects over past years • Technologies used • Summary / trends • Observations / thoughts • Legal • The Dutch Business Register
The why Internet sources Faster, better, more efficient Administrative sources • Tax, social security services • Municipalities/ Provinces • Supermarkets • … • … • Surveys New indicators Less!!!
Fuel prices (2009) • Daily fuel prices from website of unmanned petrol stations (tinq.nl) • Regional prices (per station) every day Now: 2016: • A direct data feed from travelcard company, weekly • Fuel prices per day and all transactions of that week • Publication in website: prices per month
Airline tickets (2010) • Pilot: 3 robots on 6 airline companies • 2 robots by external companies, 1 by SN • Prices comply with manual collection • Quite expensive; negative business case • 2016: still manual price collection of airline tickets
Housing market • Housing market (from 2011): • Discussions with external company for > 1 year (iWoz) • We scraped 5 sites, about 250.000 observations / week, 2 years 2013 ->: • Direct feed from one of the sites (Jaap.nl) • Statline tables: Bestaande woningen in verkoop • “based on 80-90 percent of the market”
Bulk price collection for CPI (1) • Bulk price collection for CPI (from 2012): • Mainly clothing • Software scrapes all prices and product data (id, name, description, category, colour, size,…) 2016: • About 500.000 price observations daily from 10 sites • Data from 3 sites used in production of Dutch CPI • Price collection process embedded in organisation • Plans to extend to > 20 sites; other domains
Bulk price collection for CPI (2) Features: Fine-knit Jumper Dark blue Striped Cotton edges Data collection & Feature extraction Structured data Big Data Index methods Index based on internet data Processing bulk data from the Internet
Robot-assisted price collection • Robot tool for detecting price changes on (parts of) websites • Traffic light indicates status: • Green: nothing changed, prices is saved in database • Red: some change, need attention of statistician • Two click to hold old price or store a new one • In production from 2014
Collect data on enterprises for EGR (2013) • Pilot: find data about EGR enterprises on the web • We scraped semi structured data from Wikipedia • Multiple wikipedia languages (NL, EN, DE, FR) • 2016: something alike in ESSnet BD WP2?
Search product descriptions for classifying business activities • Search product descriptions on web (from 2014) • First time we used automated searchwith Google search API for statistics • Pilot, no production • Some doubts on google results
Twitter-LinkedIn (1) • LinkedIn-Twitter for profiling (2015) • Automated search on LinkedIn based on a sample of twitter users • Very specific and experimental • “Profiling of Twitter data, a big data selectivity study”, Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch
Scraping websites of enterprises • Identify family businesses (search and / or crawling) (2016) • Identify businesses with a Corporate Social Responsibility (CSR) (search and / or crawling) (2016) • Research program: • “Extracting information from websites to improve economic figures” • This ESSnet BD WP2 !!!
Crawling for Statistics Incomplete statistical data Url-base Search terms Navigation terms Focused Crawler (Roboto) Internet Item identifyer terms “year report, family business” More complete statistical data Search & Match ElasticSearch Datastore
Technologies used • Perl (2009), Djuggler (2010) • Python, Scrapy (2010) • R (2011-2015) • NodeJS (Javacript on server) (2014-) • Google Search API (2014-) • ElasticSearch (2016) • Roboto (nodejs package, 2015-2016) • Nutch: tested, not used • Generic Framework (robot framework) for bulk scraping of prices
Observations / thoughts … • If it is there, we can get it • Technology is (usually) not the problem! • The internet is a living thing! • It’s too simple to think we can just buy the internet somewhere and then make statistics! • It’s powerful to combine something we know with something we observe! • External companies can help, but be careful …
Legal • Dutch Statistics Law: • Enterprises have to provide data to Statistics Netherlands on request • Scraping information from websites reduces response burden • Statistics Netherlands does use data for official statistics only • Dutch database legislation: • Commercial re-use of intellectual property is forbidden • This may also apply to internet sources • Privacy: • Dutch (statistical) legislation on protection of personal information • Statistics Netherlands does only scrape public sources and processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally • Netiquette: • respect robots.txt • identify yourself (user-agent) • do not overload servers, use some idle time between requests
Dutch Business Register (simplified) - From administrative units to statistical units: • Sources: • Trade Register • Tax Register • Social security register (employees) • Profilers • About 1.5 Million administrative entities • About 0.5 Million have a url • Quality of url field not known, but seems usable