1 / 71

Final Project

Final Project. Data sets. Visit web site: http://www.kdnuggets.com/datasets/index.html

radamson
Download Presentation

Final Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Final Project

  2. Data sets • Visit web site: http://www.kdnuggets.com/datasets/index.html • This is an online repository of large data setswhich encompasses a wide variety of data types, analysis tasks, and application areas. The primary role of this repository is to enable researchers in knowledge discovery and data mining to scale existing and future data analysis algorithms to very large and complex data sets. • http://kdd.ics.uci.edu/

  3. Data sets Data Sets by application area by name by date (reverse chronological) Machine Learning Repository Task Files by task type by application area by name by date (reverse chronological) by data type

  4. Report & Presentation 書面 (50%) + 簡報 (50%)==> 為期末考成績 4位同學一組 書面報告 (8 pages at least, cover not included) 簡報: 15分鐘+問題提問 (5分鐘) ,簡報同學不發問,其餘同學皆須回答問題,不用及時回答,可於下課前回答。 一節課用於討論與提問,並預先訂定所選定資料庫。(可於一星期內修改之) 。

  5. Business Data Mining Applications

  6. Business Data Mining Applications Partial representative sample of applications Catalog sales CRM Credit scoring Banking (loans) Investment risk Insurance

  7. Fingerhut • Founded 1948 • today sends out 130 different catalogs • to over 65 million customers • 6 terabyte data warehouse • 3000 variables of 12 million most active customers • over 300 predictive models • Focused marketing

  8. Fingerhut Purchased by Federated Department Stores for $1.7 billion in 1999 (for database) Fingerhut had $1.6 to $2 billion business per year, targeted at lower income households Can mail 400,000 packages per day Each product line has its own catalog

  9. Fingerhut • Uses segmentation, decision tree, regression, neural network tools from SAS and SPSS • Segmentation - combines order & demographic data with product offerings • can target mailings to greatest payoff • customers who recently had moved tripled their purchasing 12 weeks after the move • send furniture, telephone, decoration catalogs

  10. Data for SEGMENTATION cluster indices subj age income marital grocery dine out savings 1001 53 80000 wife 180 90 30000 1002 48 120000 husband 120 110 20000 1003 32 90000 single 30 160 5000 1004 26 40000 wife 80 40 0 1005 51 90000 wife 110 90 20000 1006 59 150000 wife 160 120 30000 1007 43 120000 husband 140 110 10000 1008 38 160000 wife 80 130 15000 1009 35 70000 single 40 170 5000 1010 27 50000 wife 130 80 0

  11. Initial Look at Data • Want to know features of those who spend a lot dining out • INCLUDE AS MANY ACTIONABLE VARIABLES AS POSSIBLE • things you can identify • Manipulate data • sort on most likely indicator (dine out)

  12. Sorted by Dine Out cluster indices subject age income maritalgrocery dine out savings 1004 26 40000 wife 80 40 0 1010 27 50000 wife 130 80 0 1001 53 80000 wife 180 90 30000 1005 51 90000 wife 110 90 20000 1002 48 120000 husband 120 110 20000 1007 43 120000 husband 140 110 10000 1006 59 150000 wife 160 120 30000 1008 38 160000 wife 80 130 15000 1003 32 90000 single30 160 5000 1009 35 70000 single40 170 5000

  13. Analysis • Best indicators • marital status • groceries • Available • marital status might be easier to get

  14. Fingerhut • Mailstream optimization • which customers most likely to respond to existing catalog mailings • save near $3 million per year • reversed trend of catalog sales industry in 1998 • reduced mailings by 20% while increasing net earnings to over $37 million

  15. LIFT • LIFT = probability in class by sample divided by probability in class by population • if population probability is 20% and sample probability is 30%, LIFT = 0.3/0.2 = 1.5 • Best lift not necessarily best • need sufficient sample size • as confidence increases, longer list but lower lift

  16. Lift Example • Product to be promoted • Sampled over 10 identifiable segments of potential buying population • Profit $50 per item sold • Mailing cost $1 • Sorted by Estimated response rates

  17. Lift Data

  18. Lift Chart

  19. Profit Impact

  20. RFM • Recency, Frequency, Monetary • Same purpose as lift • Identify customers more likely to respond • RFM tracks customer transactions by its 3 measures • Code each customer • Often 5 cells for each measure, or 125 combinations • Identify positive response of each of the combinations

  21. CUSTOMER RELATIONSHIP MANAGEMENT (CRM) • understanding value customer provides to firm • Kathleen Khirallah - The Tower Group • Banks will spend $9 billion on CRM by end of 1999 • Deloitte • only 31% of senior bank executives confident that their current distribution mix anticipated customer needs

  22. Customer Value Middle age (41-55), 3-9 years on job, 3-9 years in town, savings account year annual purchases profit discounted net 1.3 rate 1 1000 200 153 153 2 1000 200 118 272 3 1000 200 91 363 4 1000 200 70 433 5 1000 200 53 487 6 1000 200 41 528 7 1000 200 31 560 8 1000 200 24 584 9 1000 200 18 603 10 1000 200 14 618

  23. Younger Customer Young (21-29), 0-2 years on job, 0-2 years in town, no savings account year annual purchases profit discounted net 1.3 1 300 60 46 46 2 360 72 43 89 3 432 86 39 128 4 518 104 36 164 5 622 124 34 198 6 746 149 31 229 7 896 179 29 257 8 1075 215 26 284 9 1290 258 24 308 10 1548 310 22 331

  24. Lifetime Value ApplicationDrew et al. (2001), Journal of Service Research 3:3 • Cellular telephone division, major US telecommunications firm • Data on billing, usage, demographics • Neural net model of churn proportion by month of tenure • 36 tenure classes • Tested model on 21,500 subscribers • April 1998 • Trained on 15,000, tested on 6,500

  25. Customer Tenure Segments • Least likely to churn • Left alone • Slight propensity to churn at end of tenure • Moderate pre-expiration marketing • Large spike in churn at expiration • Concentrated marketing efforts before expiration • Highest risk • Continued competitive offers

  26. CREDIT SCORING Data warehouseincluding demand deposits, savings, loans, credit cards, insurance, annuities, retirement programs, securities underwriting, other Statistical & mathematical models (regression) to predict repayment

  27. CREDIT SCORING Bank Loan Applications Age Income Assets Debts Want On-time 24 55557 27040 48191 1500 1 20 17152 11090 20455 400 1 20 85104 0 14361 4500 1 33 40921 91111 90076 2900 1 30 76183 101162 114601 1000 1 55 80149 511937 21923 1000 1 28 26169 47355 49341 3100 0 20 34843 0 21031 2100 1 20 52623 0 23054 15900 0 39 59006 195759 161750 600 1

  28. Credit Card Management Very profitable industry Card surfing - pay old balance with new card Promotions typically generate 1000 responses, about 1% In early 1990s, almost all mass marketing Data mining improves (lift)

  29. British Credit Card Company • Monthly credit data • Didn’t want those who paid in full (no profit) • Application scoring • Continued what had been done manually for over 50 years • Behavioral scoring • Monitor revolving credit accounts for early warning • 90,000 customers • State variable: cumulative months of missed repayment • Selected sample of 10,000 observations • Initial state all 0 in selected data • Over 70% of customers never left state 0

  30. Analysis • Clustering • Unsupervised partitioning • K-median to get more stable results • Pattern search • Sought patterns from object grouping • Unexpectedly large number of similar objects • Estimated probability of each case belonging to objects

  31. Comparison Compared clustering partitions with pattern search groupings Pattern search identified those behaving in anomalous manner

  32. Banking Among first users of data mining Used to find out what motivates their customers (reduce churn) Loan applications Target marketing Norwest: 3% of customers provided 44% profits Bank of America: program cultivating top 10% of customers

  33. CHURN • Customer turnover • Critical to: • telecommunications • banks • human resource management • retailers

  34. Characteristics of Not On-Time Age Income Assets Debts Want On-time 28 26169 47355 49341 3100 0 20 52623 0 23054 15900 0 Here, DebtsexceedAssets Age Young IncomeLow BETTER: Base on statistics, large sample supplement data with other relevant variables

  35. Identify Characteristics of Those Who Leave Age Time-job Time-town min bal checking savings card loan years months months $ 27 12 12 549 x x 41 18 41 3259 x x x 28 9 15 286 x x 55 301 5 2854 x x x 43 18 18 1112 x x x 29 6 3 0 x 38 55 20 321 x x x 63 185 3 2175 x x x 26 15 15 386 x x 46 13 12 1187 x x x 37 32 25 1865 x x x

  36. Analysis • What are the characteristics of those who leave? • Correlation analysis • Which customers do you want to keep? • Customer value - net present value of customer to the firm

  37. Correlation Age Time Time min-bal check saving card loan Job Town Age 1.0 0.60.4-0.4 0.0 0.4 0.2 0.3 Job 1.0 0.9-0.6 0.1 0.60.9 -0.2 Town 1.0 -0.5 -0.1 0.30.50.4 Min-Bal 1.0 -0.2 0.30.6 -0.1 Check 1.0 0.5 0.2 0.2 Saving 1.0 0.90.3 Card 1.0 0.5 Loan 1.0

  38. Bankruptcy PredictionSung et al. (1999), Journal of MIS 16:1 • Late 20th-century, East Asian corporate bankruptcy critical • Models built for normal & crisis conditions • Used decision tree models for explanation • Discriminant analysis applied to benchmark • Korean corporations • Data for all bankrupt corporations on Korean Stock Exchange, 2nd quarter 1997 to 1st quarter 1998 • 75 such cases – full data on 30 of those • Normal 2nd Qtr 1991 to 1st Qtr 1995 • 56 firms, full data on 26

  39. Korean Bankruptcy Study • Matched bankrupt firms with one or two nonbankrupt firms that had similar assets and size • 56 financial ratios used • Eliminated 16 due to duplication

  40. Financial Ratios Growth (5) Profitability (13) Leverage (9) Efficiency (6) Productivity (7) DV 0/1 variable of bankruptcy or not

  41. Multivariate Discriminant Analysis • Used stepwise procedure • NORMAL PERIOD Normal = 0.58 * cash flow/assets + 0.0623 * productivity of capital - 0.006 * average inventory turnover • BANKRUPT PERIOD Bankrupt = 0.053 * cash flow/liabilities + 0.056 * productivity of capital + 0.014 * fixed assets/(equity+LT liab)

  42. Decision Tree Models • Used C4.5 • Applied boosting to improve predictive power, improved prediction success • NORMAL RULES • IF productivity of capital > 19.65 THEN OK • IF cash flow/total assets > 5.64 THEN OK • IF cash flow/total assets ≤ 55.64 & productivity of capital ≤ 19.65 THEN bankrupt

  43. CRISIS RULES IF productivity of capital > 20.61 THEN OK IF cash flow/liabilities > 2.64 THEN OK IF fixed assets/(equity+long-term invest) > 87.23 THEN OK IF cash flow/liabilities ≤ 2.64 AND productivity of capital ≤20.61 AND fixed assets/(equity+long-term invest) ≤ 87.23 THEN bankrupt

  44. Comparison

  45. Mortgage Market • Early 1990s - massive refinancing • Need to keep customers happy to retain • Contact current customers who have rates significantly higher than market • a major change in practice • data mining & telemarketing increased Crestar Mortgage’s retention rate from 8% to over 20%

  46. Country Investment Risk • Outcome categories: • Most safe • Developed • Mature emerging markets • New emerging markets • Frontier

  47. Investment Risk AnalysisBecerra-Fernandez et al. (2002) Computers and Industrial Engineering 43 • Risk by country • Expert assessment available • Decision tree (C5), neural network models • Data: • Economic indicators (4) • Depth & liquidity (4) • Performance & value (5) • Economic & market risk (4) • Regulation & efficiency (4) 52 samples, so used bootstrapping

  48. Models • Decision trees • Pruning rate 50%: • Pruning rate 75% • Neural networks • Backpropogation • Fuzzy (ARTMAP) • Learning vector quantization

  49. Results • Decision tree algorithms more accurate • Lower pruning rate – lowest error rate • Neural networks disadvantaged by small data set • Decision tree algorithms consistently optimistic relative to expert ratings

  50. Banking • Fleet Financial Group • $30 million data warehouse • hired 60 database marketers, statistical/quantitative analysts & DSS specialists • expected to add $100 million in profit by 2001

More Related