60 likes | 266 Views
Science in Business Data Mining?. Background: support managerial decision making Is there a science to data mining (with CI-methods)? Outline Data Mining in Business & Management Rules established in Business practices vs. Data mining? Statistics vs. Data driven modelling A personal view
E N D
Science in Business Data Mining? • Background: support managerial decision making • Is there a science to data mining (with CI-methods)? Outline • Data Mining in Business & Management • Rules established in Business practices vs. Data mining? • Statistics vs. Data driven modelling • A personal view • How do develop meta-knowledge YES, but it depends(and it may be empirical Wizardry driven by efficiency rather than effectiveness!) Sven F. Crone, Lancaster University Management SchoolResearch Centre for Forecasting
Business Data Mining? Churn Prediction • Main areas for Data Mining: • Finance: Credit risk (personal & corporate) • Marketing: Customer Relationship Management (=Direct Marketing, Database Marketing) DirectMarketing Credit Scoring adapted from Berry and Linoff (2004) and Olafson et al (2006) Sven F. Crone, Lancaster University Management SchoolResearch Centre for Forecasting
Practitioners & Consultants use statistics Best practices Credit Scoring Cross-Selling Large & imbalanced sample Use large sample sizes Original (Imbalanced) class distribution … • Small & Balanced classes • Use 2000 of minority class • Use undersampling • Discretise all (!) variables • Binary dummies / WOE to capture non-linearity • Use Logistic regression Extensive use of expert domain knowledge efficient solution ≠ best GAP A personal view: • Data selection is best using prior domain knowledge (use filters) • Pre-processing more important than method [Crone et al, 2006; Keogh 2002] • (Balanced) sampling & pre-processing is method dependent • Best practices exist & are domain dependent(e.g. homogeneous datasets in credit scoring) • Flat Maximum effect [Lovie & Lovie, 1986] Sven F. Crone, Lancaster University Management SchoolResearch Centre for Forecasting
How do derive (meta)-knowledge? • Lessons from other disciplines: Time Series Forecasting • More ‘Evidence based methods” [Armstrong 2000] • Empirical Evidence • Conditions under which methods perform well (multiple hypothesis) • Domain specific Competitions (valid & reliable) • Multiple out-of-sample evaluations (≠ single fold, one origin) • Multiple homogeneous datasets from one domain • Use of valid benchmark methods & unbiased error measures • Honour the domain & decision context (active learning, cost sensitive) • Replications • Studies must allow replications – document all steps / parameters • STOP FINE-TUNING / MARGINAL EXTENSION OF SINGLE METHOD ON SINGLE TOY DATASET • Develop solutions for domain (Why make life harder?) • Where to start? follow high impact approach! • Identify most prominent application domains (e.g. credit risk) • Select promising application domains for CI-methods • Get corporate sponsor & run competition • Analyse conditions (!) using meta-studies! • Embed findings as methodology in SOFTWARE Sven F. Crone, Lancaster University Management SchoolResearch Centre for Forecasting
Literature • Ian Ayres (2007) Super Crunchers: Why Thinking-by-Numbers Is the New Way to Be Smart, Bantam • Thomas H. Davenport, Jeanne G. Harris (2007) Competing on Analytics: The New Science of Winning, Harvard Business School Press • Fildes, Nikolopoulos, Crone, Synthetos (2009) Forecasting and Operational Research – a Review, JORS, forthcoming • Finlay, Crone (under review), Sampling issues in Credit Scoring – the effect of sample size and sample distribution on predictive accuracy, EJOR • Keogh, Kasetty (2002, 2004) On the Need for Time Series Data Mining Benchmarks: A Survey and Empirical Demonstration, SIGKDD’02 & Data Mining Journal