150 likes | 161 Views
This paper focuses on the JRC's approach in the Big Data Competition, utilizing automated methods with minimal human involvement and machine learning techniques. The strategies involve a wide approach for future generalization, with a focus on data accessibility and machine readability. The ARIMA model, random forest, and xgboost regression are employed for precise now-casting of inflation and tourism trends across European Member States, with insights on data sources and strategic improvements discussed in post-competition analysis.
E N D
Now-casting competition: the JRC approach Panos Christidis, Andris Peize European Commission, Joint Research Centre Seville, Spain
JRC participation in Big Data competition: • Track 2 (inflation): 8 Member States • Track 3 (inflation w/o energy): 8 Member States • Track 4 (tourism): all Member States • Track 5 (tourism/hotel): all Member States
General strategy: • Wide approach (vs. deep) in order to generalize in the future (BDCOMP as a benchmark) • Automated methods with minimum human involvement • Data sources should be permanently accessible, machine readable and updated at least monthly • Where possible apply machine learning techniques
Automated data pipeline in R: • eurostat: to automate the extraction of data from the EUROSTAT database • forecast: for time series processing and arima modelling • randomForest: for Random Forest based regression • xgboost: for xgboost based regression • data.table and dplyr: for processing
Structure of approaches: ARIMA External "Big" Data ARIMA "new" methods: randomForest & xgboost External "Big" Data ARIMA
External data: • Billion Prices Project (inflation) • SABRE/ Amadeus (flight reservations)
Track 2 (inflation) • Good match for Germany, France, UK (RMSE 0.20%- 0.25%) • Reasonable match for Ireland, Spain, Netherlands, Italy (RMSE 0.30%- 0.75%) • Greece: all approaches > 1%
Track 2 insights (1/2) • The differences between the approaches were mainly the results of member State characteristics rather than method. • The machine learning methods over-fitted to past data (can be improved) • "Big Data" source (Billion Prices Project) inserted too much noise (further research needed) • Post-competition analysis: the mean of the two best approaches further improves their score
Track 2 insights (2/2) • The impact of the long and short term trends is different in each Member State (and when compared to the other indicators, also for each indicator): the individual ARIMA models 1-5 had different parameters and the right weights for their ensemble gave the best results. • The xgboost and random forest models did not capture the shift in the trends so well (but there are methods to allow them to do so) • None of the approaches was capable of capturing structural changes like the ones in Greece due to the crisis.
Track 4&5 insights (1/2) • Wide range of precision between countries and indicators (2% to over 10%) • Better match for track 5 (stays in hotels) than track 4 (all types) • "Plain vanilla" ARIMA still better than more complicated methods (but some positive messages) • "Big" data sources useful for Member States with high % of foreign tourists with high % in hotels (but otherwise overfit without correct weights
Track 4&5 insights (2/2) • Largest source of error: March/ April (Semana Santa) • Calendar differences (weekends, leap year) • Weather conditions probably affect (but not taken into account) • Different trends per market segment, difficult to calibrate • Length of stay changes, but not captured • Airbnb effect?
What did we learn? • Feasible to now-cast official statistics (technically & methodologically) • Precision can improve with further work • "Big Data" sources promising, but require considerable effort to make usable • Over-fitting is a problem for both conventional and new approaches • Approaches need to be tested over even longer periods (but the 12 months of the BDCOMP is already impressive)
What next? CASSANDRA: a policy support tool inspired from the BDCOMP