230 likes | 540 Views
Predictive Analytics Solutions: Leveraging HDInsight and RapidMiner November 2013. Quick background on me. Synopsis of talk. Principled a pproach to predictive work. Agenda. Summary of a customer c ase. Background. Database Marketing VLDB Marketing systems for the Cable Industry
E N D
Predictive Analytics Solutions: Leveraging HDInsight and RapidMinerNovember 2013
Quick background on me Synopsis of talk Principled approach to predictive work Agenda Summary of a customer case
Background • Database Marketing • VLDB Marketing systems for the Cable Industry • Dashboards and Modeling led to Recommender Engines (nnus) • Web 2.0 Analytics • Migrated legacy processing from RDBMS to Hadoop(Hive) • Social Graph Processing in Hadoop(MR/Hive) • Healthcare • Document management system in Hadoop(HBase) • High volume, low-latency processing
Synopsis • Predictive Analytics is all the rage lately. The use of predictive techniques has expanded beyond simple recommender engines (Netflix and Amazon, for example) and is now becoming a key strategic tool for competitive advantage. Organizations looking for competitive advantage are yearning to better understand their customers and the industry the compete in. Part of the challenge is the ability to understand internal operational data, the other challenge is to identify and analyze external data that will provide the additional insight to truly provide competitive advantage. In this talk, we will present a pattern for leveraging Hadoop (through HDInsight) to obtain, integrate, conform and aggregate data (internal and external) for consumption/presentation by RapidMiner (a leading predictive analytics tool).
Approach: CRISP-DM • Business-focused • Results-Oriented • Recognizes that data preparation is a significant portion of the overall effort • Studies say 90% • Recognizes the iterative nature of the process • Tenured • Conceived in 1996 • Most common practice by 2002
Cross Industry Standard Process for Data Mining (CRISP-DM) CRISP-DM 1.0
Summary of a customer case • Customer’s Core Business and Goals • Customer provides roadside assistance service for major auto manufacturers • Current business is reactive • call a tow truck when a car breaks down • Opportunities (to save costs, increase customer satisfaction, and generate new revenue) depend on predicting when cars will break down • Project Goals • Prove HDInsight as a processing platform for handling large data • Prepare (obtain, understand, massage) existing customer data • Incorporate external data that could be useful (e.g. weather data) • Produce a preliminary model
Business Understanding:Initial hypotheses • It might be possible to predict breakdowns (to some degree and under some conditions) so that we can know when to “pre-position” trucks. • Weather should affect breakdowns • but we don’t know how much, under what conditions, or what type of breakdowns might be predictable • Commercially available weather data for all of the US for the last 7 years is prohibitively expensive, but free data might be “good enough” to validate a weather-based model.
Data Preparation Client (DB extracts/email) NOAA (ftp) Zip Code Masterfile Roadside Events Roadside Lookups Station List ISD/ISH Files Provider Rosters Contract Performance Azure Windows VM Produced Lookup CSV files find_nearest_n_stations.py get_noaa_ish_files.py normalize_noaa_data.py Local Workstation Split Events into (64) .gz files create_hive_tables.py Roadside Events Azure Blob Storage roadside_weather export.csv Zip Code Masterfile Roadside Lookups ISD/ISH Files Station List Azure HDInsight/Hadoop
Modeling:Prototyping a preliminary model • Industry-standard data mining methodology • Sample, Explore, Modify, Model, Assess (SEMMA) • Randomly sample prototype-scale dataset • Perform exploratory data analysis • Construct several models • Visualize model results
Randomly sample dataset • 2007-01 to 2013-07 • Select 10 coldest states: • AK, ME, MN (majority), MT, ND, NH, SD, VT, WI, WY • Random sample of 20k events
Perform exploratory analysisRough analysis can reveal preparation errors
Winch calls relate to temperature Total events / hour