1 / 23

Predictive Analytics Solutions: Leveraging HDInsight and RapidMiner November 2013

Predictive Analytics Solutions: Leveraging HDInsight and RapidMiner November 2013. Quick background on me. Synopsis of talk. Principled a pproach to predictive work. Agenda. Summary of a customer c ase. Background. Database Marketing VLDB Marketing systems for the Cable Industry

nibaw
Download Presentation

Predictive Analytics Solutions: Leveraging HDInsight and RapidMiner November 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Predictive Analytics Solutions: Leveraging HDInsight and RapidMinerNovember 2013

  2. Quick background on me Synopsis of talk Principled approach to predictive work Agenda Summary of a customer case

  3. Background • Database Marketing • VLDB Marketing systems for the Cable Industry • Dashboards and Modeling led to Recommender Engines (nnus) • Web 2.0 Analytics • Migrated legacy processing from RDBMS to Hadoop(Hive) • Social Graph Processing in Hadoop(MR/Hive) • Healthcare • Document management system in Hadoop(HBase) • High volume, low-latency processing

  4. Synopsis • Predictive Analytics is all the rage lately. The use of predictive techniques has expanded beyond simple recommender engines (Netflix and Amazon, for example) and is now becoming a key strategic tool for competitive advantage. Organizations looking for competitive advantage are yearning to better understand their customers and the industry the compete in. Part of the challenge is the ability to understand internal operational data, the other challenge is to identify and analyze external data that will provide the additional insight to truly provide competitive advantage. In this talk, we will present a pattern for leveraging Hadoop (through HDInsight) to obtain, integrate, conform and aggregate data (internal and external) for consumption/presentation by RapidMiner (a leading predictive analytics tool).

  5. Approach: CRISP-DM • Business-focused • Results-Oriented • Recognizes that data preparation is a significant portion of the overall effort • Studies say 90% • Recognizes the iterative nature of the process • Tenured • Conceived in 1996 • Most common practice by 2002

  6. Cross Industry Standard Process for Data Mining (CRISP-DM) CRISP-DM 1.0

  7. Summary of a customer case • Customer’s Core Business and Goals • Customer provides roadside assistance service for major auto manufacturers • Current business is reactive • call a tow truck when a car breaks down • Opportunities (to save costs, increase customer satisfaction, and generate new revenue) depend on predicting when cars will break down • Project Goals • Prove HDInsight as a processing platform for handling large data • Prepare (obtain, understand, massage) existing customer data • Incorporate external data that could be useful (e.g. weather data) • Produce a preliminary model

  8. Business Understanding:Initial hypotheses • It might be possible to predict breakdowns (to some degree and under some conditions) so that we can know when to “pre-position” trucks. • Weather should affect breakdowns • but we don’t know how much, under what conditions, or what type of breakdowns might be predictable • Commercially available weather data for all of the US for the last 7 years is prohibitively expensive, but free data might be “good enough” to validate a weather-based model.

  9. Data Preparation Client (DB extracts/email) NOAA (ftp) Zip Code Masterfile Roadside Events Roadside Lookups Station List ISD/ISH Files Provider Rosters Contract Performance Azure Windows VM Produced Lookup CSV files find_nearest_n_stations.py get_noaa_ish_files.py normalize_noaa_data.py Local Workstation Split Events into (64) .gz files create_hive_tables.py Roadside Events Azure Blob Storage roadside_weather export.csv Zip Code Masterfile Roadside Lookups ISD/ISH Files Station List Azure HDInsight/Hadoop

  10. Modeling:Prototyping a preliminary model • Industry-standard data mining methodology • Sample, Explore, Modify, Model, Assess (SEMMA) • Randomly sample prototype-scale dataset • Perform exploratory data analysis • Construct several models • Visualize model results

  11. Randomly sample dataset • 2007-01 to 2013-07 • Select 10 coldest states: • AK, ME, MN (majority), MT, ND, NH, SD, VT, WI, WY • Random sample of 20k events

  12. Perform exploratory analysisRough analysis can reveal preparation errors

  13. Winch calls relate to temperature Total events / hour

  14. Winch - temperature effect

  15. Proportion of Winch events

  16. Construct RapidMiner models

  17. Linear regression on Winch

  18. Decision tree on winch ratio

  19. Naïve Bayes probabilities

  20. Naïve Bayes results

  21. Naïve Bayes temperature

  22. Naïve Bayes latitude

More Related