340 likes | 779 Views
DATA MINING. Team #1 Kristen Durst Mark Gillespie Banan Mandura. University of Dayton MBA 664 13 APR 09. Data Mining: Outline. Introduction Applications / Issues Products Process Techniques Example. Introduction. Data Mining Definition Analysis of large amounts of digital data
E N D
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of Dayton MBA 664 13 APR 09
Data Mining: Outline • Introduction • Applications / Issues • Products • Process • Techniques • Example MBA 664, Team #1
Introduction • Data Mining Definition • Analysis of large amounts of digital data • Identify unknown patterns, relationships • Draw conclusions AND predict future • Data Mining Growth • Increase in computer processing speed • Decrease in cost of data storage MBA 664, Team #1
Introduction • High Level Process • Summarize the Data • Generate Predictive Model • Verify the Model • Analyst Must Understand • The business • Data and its origins • Analysis methods and results • Value provided MBA 664, Team #1
Applications / Issues • Applications • Telecommunications • Cell phone contract turnover • Credit Card • Fraud identification • Finance • Corporate performance • Retail • Targeting products to customers • Legal and Ethical Issues • Aggregation of data to track individual behavior MBA 664, Team #1
Data Mining Products • Angoss Software (www.angoss.com) • Knowledge Seeker/Studio • Strategy Builder • Infor Global Solutions (www.infor.com) • Infor CRM Epiphany • Portrait Software (www.portraitsoftware.com) • SAS Institute (www.sas.com) • SAS Enterprise Miner • SAS Analytics • SPSS Inc (www.spss.com) • Clementine MBA 664, Team #1
Angoss Knowledge Studio MBA 664, Team #1
SAS Institute MBA 664, Team #1
SPSS Inc. MBA 664, Team #1
Data Mining Process • No uniformly accepted practice • 2002 www.KDnuggets.com survey • SPSS CRISP-DM • SAS SEMMA MBA 664, Team #1
Data Mining Process • SPSS CRISP-DM • CRoss Industry Standard Process for Data Modeling • Consortium: Daimler-Chrysler, SPSS, NCR • Hierarchical Process – Cyclical and Iterative MBA 664, Team #1
Data Mining Process • CRISP-DM MBA 664, Team #1
Data Mining Process • SAS SEMMA • Model development is focus • User defines problem, conditions data outside SEMMA • Sample – portion data, statistically • Explore – view, plot, subgroup • Modify – select, transform, update • Model – fit data, any technique • Assess – evaluate for usefulness MBA 664, Team #1
Data Mining Process • Common Steps in Any DM Process • 1. Problem Definition • 2. Data Collection • 3. Data Review • 4. Data Conditioning • 5. Model Building • 6. Model Evaluation • 7. Documentation / Deployment MBA 664, Team #1
Data Mining Techniques • Statistical Methods (Sample Statistics, Linear Regression) • Nearest Neighbor Prediction • Neural Network • Clustering/Segmenting • Decision Tree MBA 664, Team #1
Statistical Methods • Sample Statistics • Quick look at the data • Ex: Minimum, Maximum, Mean, Median, Variance • Linear Regression • Easy and works with simple problems • May need more complex model using different method MBA 664, Team #1
Example: Linear Regression Total Purchase Amount Customer Income MBA 664, Team #1
Nearest Neighbor Prediction • Easy to understand • Used for predicting • Works best with few predictor variables • Based on the idea that something will behave the same as how others “near” it behave • Can also show level of confidence in prediction MBA 664, Team #1
Example: Nearest Neighbor Product Sales by Population of City and Distance from Competitor Population of City A A: > 200 units B: 100 – 200 units C: < 100 units A A B A U A A A B B C B A C C B C Distance from Competitor MBA 664, Team #1
Neural Network • Contains input, hidden and output layer • Used when there are large amounts of predictive variables • Model can be used again and again once confirmed successful • Can be hard to interpret • Extremely time consuming to format the data MBA 664, Team #1
Example: Neural Network Population of City W1 =.36 Product Sales Prediction 0.736 W2 =.64 Distance from Competitor MBA 664, Team #1
Clustering/Segmenting • Not used for prediction • Forms groups that are very similar or very different • Gives an overall view of the data • Can also be used to identify potential problems if there is an outlier MBA 664, Team #1
Example: Clustering/Segmenting Dimension B < 40 years >= 40 years Red = Female Blue= Male Dimension A MBA 664, Team #1
Decision Trees • Uses categorical variables • Determines what variable is causing the greatest “split” between the data • Easy to interpret • Not much data formatting • Can be used for many different situations MBA 664, Team #1
Example: Decision Trees Change from original score .76 .14 .58 -.46 n = 67 n = 51 n = 115 n = 48 Baseline < 3.75 Baseline >= 3.75 F M Large body type Small body type M F -.29 n = 24 -.63 n = 24 .47 n = 28 1.11 n = 23 -.29 n = 24 MBA 664, Team #1
Data Mining Example1. Problem Definition • Improve On-Time Delivery of New Products MBA 664, Team #1
Brainstorm Variation Sources Data Collection Plan Data Mining Example2. Collect Data MBA 664, Team #1
Data Mining Example3. Data Review • Data Segments TOTAL LEAD TIME by Part Type: p < .05 Level N Mean StDev ----+---------+---------+---------+-- BRACKET 520 x6.76 x3.14 (--*-) DUCT 138 x6.70 x0.40 (----*---) MANIFOLD 44 x9.95 x4.68 (-------*-------) TUBE 47 x3.60 x2.79 (------*-------) ----+---------+---------+---------+-- Pooled StDev = 68.47 MBA 664, Team #1
Data Mining Example5. Build Model MBA 664, Team #1
Data Mining Example5. Build Model Combined Model: 2 separate regressions Design and Manufacturing – combined thru a common term SHIP-DUE = 7.97 + 0.269*(MODEL_CR-DUE) + 0.173*(CR-ISS) + 0.704*(MAN_BOMC) + 0.748*(SCH_ST-MAN) + 0.862*(MOS_MOFIN) [R^2A 4.4%] – {R^2A(1) 76.5%, R^2A(2) 68.0%} MBA 664, Team #1
Data Mining Example6. Model Evaluation Model Accurately Reflects Delivery Distribution MBA 664, Team #1
Data Mining Example7. Document / Deploy Design Release Required for On Time Delivery Due Date MBA 664, Team #1
Requirements Plan Actual Data Mining Example7. Document / Deploy Update Planning and Automate Tracking MBA 664, Team #1
Data Mining • Questions? MBA 664, Team #1