1 / 15

Enhancing Now-Casting Accuracy with Automated Data Pipeline and Machine Learning

This paper focuses on the JRC's approach in the Big Data Competition, utilizing automated methods with minimal human involvement and machine learning techniques. The strategies involve a wide approach for future generalization, with a focus on data accessibility and machine readability. The ARIMA model, random forest, and xgboost regression are employed for precise now-casting of inflation and tourism trends across European Member States, with insights on data sources and strategic improvements discussed in post-competition analysis.

lewissamuel
Download Presentation

Enhancing Now-Casting Accuracy with Automated Data Pipeline and Machine Learning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Now-casting competition: the JRC approach Panos Christidis, Andris Peize European Commission, Joint Research Centre Seville, Spain

  2. JRC participation in Big Data competition: • Track 2 (inflation): 8 Member States • Track 3 (inflation w/o energy): 8 Member States • Track 4 (tourism): all Member States • Track 5 (tourism/hotel): all Member States

  3. General strategy: • Wide approach (vs. deep) in order to generalize in the future (BDCOMP as a benchmark) • Automated methods with minimum human involvement • Data sources should be permanently accessible, machine readable and updated at least monthly • Where possible apply machine learning techniques

  4. Automated data pipeline in R: • eurostat: to automate the extraction of data from the EUROSTAT database • forecast: for time series processing and arima modelling • randomForest: for Random Forest based regression • xgboost: for xgboost based regression • data.table and dplyr: for processing

  5. Structure of approaches: ARIMA External "Big" Data ARIMA "new" methods: randomForest & xgboost External "Big" Data ARIMA

  6. ARIMA:

  7. External data: • Billion Prices Project (inflation) • SABRE/ Amadeus (flight reservations)

  8. Track 2 (inflation) • Good match for Germany, France, UK (RMSE 0.20%- 0.25%) • Reasonable match for Ireland, Spain, Netherlands, Italy (RMSE 0.30%- 0.75%) • Greece: all approaches > 1%

  9. Track 2 insights (1/2) • The differences between the approaches were mainly the results of member State characteristics rather than method. • The machine learning methods over-fitted to past data (can be improved) • "Big Data" source (Billion Prices Project) inserted too much noise (further research needed) • Post-competition analysis: the mean of the two best approaches further improves their score

  10. Track 2 insights (2/2) • The impact of the long and short term trends is different in each Member State (and when compared to the other indicators, also for each indicator): the individual ARIMA models 1-5 had different parameters and the right weights for their ensemble gave the best results. • The xgboost and random forest models did not capture the shift in the trends so well (but there are methods to allow them to do so) • None of the approaches was capable of capturing structural changes like the ones in Greece due to the crisis.

  11. Tracks 4 & 5 (tourism)

  12. Track 4&5 insights (1/2) • Wide range of precision between countries and indicators (2% to over 10%) • Better match for track 5 (stays in hotels) than track 4 (all types) • "Plain vanilla" ARIMA still better than more complicated methods (but some positive messages) • "Big" data sources useful for Member States with high % of foreign tourists with high % in hotels (but otherwise overfit without correct weights

  13. Track 4&5 insights (2/2) • Largest source of error: March/ April (Semana Santa) • Calendar differences (weekends, leap year) • Weather conditions probably affect (but not taken into account) • Different trends per market segment, difficult to calibrate • Length of stay changes, but not captured • Airbnb effect?

  14. What did we learn? • Feasible to now-cast official statistics (technically & methodologically) • Precision can improve with further work • "Big Data" sources promising, but require considerable effort to make usable • Over-fitting is a problem for both conventional and new approaches • Approaches need to be tested over even longer periods (but the 12 months of the BDCOMP is already impressive)

  15. What next? CASSANDRA: a policy support tool inspired from the BDCOMP

More Related