210 likes | 227 Views
Explore how web scraping is revolutionizing data collection for statistics in the Netherlands, enabling faster, better, and more efficient analysis of diverse sources like fuel prices, airline tickets, and housing market data. Discover the technologies, insights, and legal considerations shaping this innovative approach.
E N D
Webscraping at Statistics Netherlands Olav ten Bosch23 March 2016, ESSnet big data WP2, Rome
Content • Internet as a datasource (IAD): motivation • Some IAD projects over past years • Technologies used • Summary / trends • Observations / thoughts • Legal • The Dutch Business Register
The why Internet sources Faster, better, more efficient Administrative sources • Tax, social security services • Municipalities/ Provinces • Supermarkets • … • … • Surveys New indicators Less!!!
Fuel prices (2009) • Daily fuel prices from website of unmanned petrol stations (tinq.nl) • Regional prices (per station) every day Now: 2016: • A direct data feed from travelcard company, weekly • Fuel prices per day and all transactions of that week • Publication in website: prices per month
Airline tickets (2010) • Pilot: 3 robots on 6 airline companies • 2 robots by external companies, 1 by SN • Prices comply with manual collection • Quite expensive; negative business case • 2016: still manual price collection of airline tickets
Housing market • Housing market (from 2011): • Discussions with external company for > 1 year (iWoz) • We scraped 5 sites, about 250.000 observations / week, 2 years 2013 ->: • Direct feed from one of the sites (Jaap.nl) • Statline tables: Bestaande woningen in verkoop • “based on 80-90 percent of the market”
Bulk price collection for CPI (1) • Bulk price collection for CPI (from 2012): • Mainly clothing • Software scrapes all prices and product data (id, name, description, category, colour, size,…) 2016: • About 500.000 price observations daily from 10 sites • Data from 3 sites used in production of Dutch CPI • Price collection process embedded in organisation • Plans to extend to > 20 sites; other domains
Bulk price collection for CPI (2) Features: Fine-knit Jumper Dark blue Striped Cotton edges Data collection & Feature extraction Structured data Big Data Index methods Index based on internet data Processing bulk data from the Internet
Robot-assisted price collection • Robot tool for detecting price changes on (parts of) websites • Traffic light indicates status: • Green: nothing changed, prices is saved in database • Red: some change, need attention of statistician • Two click to hold old price or store a new one • In production from 2014
Collect data on enterprises for EGR (2013) • Pilot: find data about EGR enterprises on the web • We scraped semi structured data from Wikipedia • Multiple wikipedia languages (NL, EN, DE, FR) • 2016: something alike in ESSnet BD WP2?
Search product descriptions for classifying business activities • Search product descriptions on web (from 2014) • First time we used automated searchwith Google search API for statistics • Pilot, no production • Some doubts on google results
Twitter-LinkedIn (1) • LinkedIn-Twitter for profiling (2015) • Automated search on LinkedIn based on a sample of twitter users • Very specific and experimental • “Profiling of Twitter data, a big data selectivity study”, Piet Daas, Joep Burger, Quan Lé, Olav ten Bosch
Scraping websites of enterprises • Identify family businesses (search and / or crawling) (2016) • Identify businesses with a Corporate Social Responsibility (CSR) (search and / or crawling) (2016) • Research program: • “Extracting information from websites to improve economic figures” • This ESSnet BD WP2 !!!
Crawling for Statistics Incomplete statistical data Url-base Search terms Navigation terms Focused Crawler (Roboto) Internet Item identifyer terms “year report, family business” More complete statistical data Search & Match ElasticSearch Datastore
Technologies used • Perl (2009), Djuggler (2010) • Python, Scrapy (2010) • R (2011-2015) • NodeJS (Javacript on server) (2014-) • Google Search API (2014-) • ElasticSearch (2016) • Roboto (nodejs package, 2015-2016) • Nutch: tested, not used • Generic Framework (robot framework) for bulk scraping of prices
Observations / thoughts … • If it is there, we can get it • Technology is (usually) not the problem! • The internet is a living thing! • It’s too simple to think we can just buy the internet somewhere and then make statistics! • It’s powerful to combine something we know with something we observe! • External companies can help, but be careful …
Legal • Dutch Statistics Law: • Enterprises have to provide data to Statistics Netherlands on request • Scraping information from websites reduces response burden • Statistics Netherlands does use data for official statistics only • Dutch database legislation: • Commercial re-use of intellectual property is forbidden • This may also apply to internet sources • Privacy: • Dutch (statistical) legislation on protection of personal information • Statistics Netherlands does only scrape public sources and processes data within Statistics Netherlands’ safe environment, just as with other (privacy-sensitive) data internally • Netiquette: • respect robots.txt • identify yourself (user-agent) • do not overload servers, use some idle time between requests
Dutch Business Register (simplified) - From administrative units to statistical units: • Sources: • Trade Register • Tax Register • Social security register (employees) • Profilers • About 1.5 Million administrative entities • About 0.5 Million have a url • Quality of url field not known, but seems usable