Enhancing Now-Casting Accuracy with Automated Data Pipeline and Machine Learning

Now-casting competition: the JRC approach Panos Christidis, Andris Peize European Commission, Joint Research Centre Seville, Spain

JRC participation in Big Data competition: • Track 2 (inflation): 8 Member States • Track 3 (inflation w/o energy): 8 Member States • Track 4 (tourism): all Member States • Track 5 (tourism/hotel): all Member States

General strategy: • Wide approach (vs. deep) in order to generalize in the future (BDCOMP as a benchmark) • Automated methods with minimum human involvement • Data sources should be permanently accessible, machine readable and updated at least monthly • Where possible apply machine learning techniques

Automated data pipeline in R: • eurostat: to automate the extraction of data from the EUROSTAT database • forecast: for time series processing and arima modelling • randomForest: for Random Forest based regression • xgboost: for xgboost based regression • data.table and dplyr: for processing

Structure of approaches: ARIMA External "Big" Data ARIMA "new" methods: randomForest & xgboost External "Big" Data ARIMA

ARIMA:

External data: • Billion Prices Project (inflation) • SABRE/ Amadeus (flight reservations)

Track 2 (inflation) • Good match for Germany, France, UK (RMSE 0.20%- 0.25%) • Reasonable match for Ireland, Spain, Netherlands, Italy (RMSE 0.30%- 0.75%) • Greece: all approaches > 1%

Track 2 insights (1/2) • The differences between the approaches were mainly the results of member State characteristics rather than method. • The machine learning methods over-fitted to past data (can be improved) • "Big Data" source (Billion Prices Project) inserted too much noise (further research needed) • Post-competition analysis: the mean of the two best approaches further improves their score

Track 2 insights (2/2) • The impact of the long and short term trends is different in each Member State (and when compared to the other indicators, also for each indicator): the individual ARIMA models 1-5 had different parameters and the right weights for their ensemble gave the best results. • The xgboost and random forest models did not capture the shift in the trends so well (but there are methods to allow them to do so) • None of the approaches was capable of capturing structural changes like the ones in Greece due to the crisis.

Tracks 4 & 5 (tourism)

Track 4&5 insights (1/2) • Wide range of precision between countries and indicators (2% to over 10%) • Better match for track 5 (stays in hotels) than track 4 (all types) • "Plain vanilla" ARIMA still better than more complicated methods (but some positive messages) • "Big" data sources useful for Member States with high % of foreign tourists with high % in hotels (but otherwise overfit without correct weights

Track 4&5 insights (2/2) • Largest source of error: March/ April (Semana Santa) • Calendar differences (weekends, leap year) • Weather conditions probably affect (but not taken into account) • Different trends per market segment, difficult to calibrate • Length of stay changes, but not captured • Airbnb effect?

What did we learn? • Feasible to now-cast official statistics (technically & methodologically) • Precision can improve with further work • "Big Data" sources promising, but require considerable effort to make usable • Over-fitting is a problem for both conventional and new approaches • Approaches need to be tested over even longer periods (but the 12 months of the BDCOMP is already impressive)

What next? CASSANDRA: a policy support tool inspired from the BDCOMP

Enhancing Now-Casting Accuracy with Automated Data Pipeline and Machine Learning

Enhancing Now-Casting Accuracy with Automated Data Pipeline and Machine Learning

Presentation Transcript

Andrea Saltelli, European Commission, Joint Research Centre andrea.saltellijrc.it ECOINFORMATICS meeting US Environme

Joint Research Centre

IET - Institute for Energy and Transport Joint Research Centre, European Commission

Joint Research Centre

Introduction to the Three Rs Concept Marlies Halder European Commission Joint Research Centre

Andrea Saltelli, European Commission, Joint Research Centre andrea.saltelli@jrc.it

European Commission Bureau of European Policy Advisers and Joint Research Centre

Simon Jeffery and Ciro Gardi EUROPEAN COMMISSION JOINT RESEARCH CENTRE

Geertrui Louwagie European Commission - Joint Research Centre -

European Commission Joint Research Centre

Generation IV Roland Schenkel DG Joint Research Centre - EUROPEAN COMMISSION

Paul Smits, Anders Friis-Christensen European Commission, DG Joint Research Centre

EUROPEAN COMMISSION DIRECTORATE-GENERAL JRC JOINT RESEARCH CENTRE

Joint Research Centre

Luca Montanarella EUROPEAN COMMISSION JOINT RESEARCH CENTRE

Andrea Conte, PhD European Commission DG Joint Research Centre Brussels, 11/10/2016

Joint Research Centre

Andrea Conte, PhD European Commission, Joint Research Centre Brussels 28/05/2015

EU Nuclear Safeguards activities Dr S.Abousahl European Commission / Joint Research Centre

Joint Research Centre

Andrea Conte, PhD European Commission DG Joint Research Centre Catanzaro, 3/11/2016