1 / 47

KDD’99 Knowledge Discovery Contest

KDD’99 Knowledge Discovery Contest. Advisor: Dr. Hsu Graduate: Yu-Wei Su. Outline. Motivation Objective Contest target KDD’99 Competition: Knowledge Discovery Contest Knowledge Discovery in a Charitable Organization’s Donor Database

Download Presentation

KDD’99 Knowledge Discovery Contest

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. KDD’99 Knowledge Discovery Contest Advisor: Dr. Hsu Graduate: Yu-Wei Su Intelligent Database System Lab, IDSL

  2. Outline • Motivation • Objective • Contest target • KDD’99 Competition: Knowledge Discovery Contest • Knowledge Discovery in a Charitable Organization’s Donor Database • Profiling Your Customers Using Bayesian Networks • Opinion Intelligent Database System Lab, IDSL

  3. Motivation • Direct mail to all customers are inefficiency and high costs • Utilizing unsupervised clustering method instead of supervised classification method in the 1998 competition Intelligent Database System Lab, IDSL

  4. Objective • Discovering higher-level knowledge from data • Maximizing the profit for predictive model Intelligent Database System Lab, IDSL

  5. Contest target • The database is the same used in the 1998 competition • Data is implemented by an American Charity in the June ’97 renewal campaign • Techniques included • Unsupervised clustering • Knowledge-driven segmentation • Association rule discovery • Causal modeling Intelligent Database System Lab, IDSL

  6. Contest target( cont’) • Each team had to build a profit-maximizing predictive model for the ’98 competition task Intelligent Database System Lab, IDSL

  7. KDD’99 Competition: Knowledge Discovery Contest • Introduction • Exploratory data analysis • A two-stage prediction model • Understanding the model • Conclusion Intelligent Database System Lab, IDSL

  8. Introduction • The paper discusses SAS Institute’s findings • Expand on the ’98 KDD cup competition • To reveal unusual data anomalies • A two-stage prediction model yields superior results to those in ’98 • Use decision tree to better understanding • Apply a confidence interval to judge model performance Intelligent Database System Lab, IDSL

  9. Introduction( cont’) • The models were built using a 95412 case training data set with known response • To judge model efficacy, expected gift amount were calculated for a validation data set with concealed response Intelligent Database System Lab, IDSL

  10. Exploratory data analysis • Successful statistical prediction models’ elements • Problem-specific knowledge • Historical data • Analytical savvy Intelligent Database System Lab, IDSL

  11. Exploratory data analysis • Four anomalies are immediately apparent Intelligent Database System Lab, IDSL

  12. Exploratory data analysis( cont’) Intelligent Database System Lab, IDSL

  13. Exploratory data analysis( cont’) Intelligent Database System Lab, IDSL

  14. A two-stage prediction model • To accurately estimate the probability distribution of gift amount to a potential donor • Prediction is done in two way • Directly estimating expected gift • Separately estimating expected donation probability and the expected gift amount and multiplying them Intelligent Database System Lab, IDSL

  15. A two-stage prediction model( cont’) • Task was done by using two multi-layer perceptron(MLP) neural networks • Gift amount model • First, using the cases where gift occurred(5% of the data) • Input reflecting historical patterns in the gift amount( fitting class-probability decision tree) Intelligent Database System Lab, IDSL

  16. A two-stage prediction model( cont’) • Gift probability model • Using all the cases • Input reflecting recency, frequency, amount(RFA) and demographic data and the patterns noted in the exploratory data analysis Intelligent Database System Lab, IDSL

  17. A two-stage prediction model( cont’) • The first stage MLP • Input layer with five inputs fully connected to 20 hidden units and them fully connected to a target unit • 4843 cases with TAGET_B=1 Intelligent Database System Lab, IDSL

  18. A two-stage prediction model( cont’) Intelligent Database System Lab, IDSL

  19. A two-stage prediction model( cont’) • The second stage MLP • Input layer with eight inputs fully connected to 20 hidden units and them fully connected to a target unit • Expected gift amount for each case was calculate as the product from the first- and second-stage models Intelligent Database System Lab, IDSL

  20. A two-stage prediction model( cont’) • Net revenue of the model using validation data is $14877.77 • This is $165.53 more than the Gold-medal winner at KDD-98 Intelligent Database System Lab, IDSL

  21. Understanding the model Intelligent Database System Lab, IDSL

  22. Brief summary • Smaller mailing size will lead to smaller variability in expected total gift because the variance sum will have fewer terms • Further development incorporates both profit maximization and risk minimization as determinants of optimum mailing depth Intelligent Database System Lab, IDSL

  23. Knowledge Discovery in a Charitable Organization’s Donor Database • Introduction • Main results • Detailed results • comments Intelligent Database System Lab, IDSL

  24. Introduction • Data set contains about 95000 customers, with an average net donation of slightly over 11 cents per customers and a total net donation of around $10500 from the “mail to all” policy • The task utilizing standard 2-class knowledge discovery and with Value Weighted Analysis(VWA) Intelligent Database System Lab, IDSL

  25. Main results • Maximal net profit of $15515 when checked against the evaluation data set, compared to KDD-Cup 98’ best result of $14712 net profit • Built a “white-box” model comprised of 11 customer segments and bring a combined net donation of $13397 for the evaluation data set Intelligent Database System Lab, IDSL

  26. Main results( cont’) • Donation segments with highly profitable and actionable • Approximately 14000 people who live in an area where over 5% of renters pay over $400 per month • Have donated over $100 in the past • Have an average donation of over $12 • Account for $8200 net donation in the training set Intelligent Database System Lab, IDSL

  27. Main results( cont’) Intelligent Database System Lab, IDSL

  28. Main results( cont’) • Identifying donors is a different task than maximizing donation • The variability of profit results • Profit difference less than $500 cannot be considered significant • Difference of $2000 in profit is not significant on different data sets Intelligent Database System Lab, IDSL

  29. Main results( cont’) • Main discovery & modeling approach was a 1-stage 2-class model based on VWA Intelligent Database System Lab, IDSL

  30. Detailed results • The most significant variables for predicting a customer’s donation behavior are the previous donation behavior summaries • The NK phenomenon • US-census data turns out to be quite strongly connected to the donation performance of the population Intelligent Database System Lab, IDSL

  31. Detailed results( cont’) • 5 models was chosen at last • Two “white box” models and one relatively simple model, based on 40 variables and two candidates for “best overall” model Intelligent Database System Lab, IDSL

  32. Detailed results( cont’) • Building the white-box model • Total net donation of 11 segments if $13397 for 55086 customers Intelligent Database System Lab, IDSL

  33. Detailed results( cont’) Intelligent Database System Lab, IDSL

  34. Detailed results( cont’) • Best single model • Selected 31 original variables, plus 9 additional demographic summary variables Optimal point Predict point Intelligent Database System Lab, IDSL

  35. Detailed results( cont’) Intelligent Database System Lab, IDSL

  36. Detailed results( cont’) • Improving prediction by averaging Intelligent Database System Lab, IDSL

  37. Brief summary • A good leave-out test set performance can hardly be considered a reliable indication of good future performance • The net profit has a very large variance Intelligent Database System Lab, IDSL

  38. Profiling Your Customers Using Bayesian Networks • Introduction • Data manipulation and preprocessing • Result • Conclusion Intelligent Database System Lab, IDSL

  39. Introduction • Build two causal models to understand the characteristics of respondents to direct mail fund raising campaigns • The first model( response-net) captures the dependency of the probability of response to the mailing campaign on the independent variables • 96376 lapsed donors data set Intelligent Database System Lab, IDSL

  40. Introduction( cont’) • The second network( donation-net) models the dependency of the dollar amount of the gift • 5%respondents to the ’97 mailing campaign Intelligent Database System Lab, IDSL

  41. Data manipulation and preprocessing • Remove redundant variables and more than 99% of missing values and variables with only one state • All continuous variables were discretized into four bins of equal length • 30 variables at last were available Intelligent Database System Lab, IDSL

  42. Data manipulation and preprocessing( cont’) • These variables can be divided into three group • Variables about personal information about the donors • Variable about information of donors neighborhood as socio-economic and urbanicity indicators • Variables about history and promotion history file of the donors Intelligent Database System Lab, IDSL

  43. Result-profiling respondents Intelligent Database System Lab, IDSL

  44. Result-profiling donors Intelligent Database System Lab, IDSL

  45. Result • Profit prediction • Both two model can be used to predict the expected profit Intelligent Database System Lab, IDSL

  46. Brief summary • Shown an application of Bayesian methods to a Knowledge Discovery task • To maintain a high response rate to direct fund raising is to continuously update the database of donors Intelligent Database System Lab, IDSL

  47. Opinion • The aspect of this papers are too higher concepts • Simple methodology can make great achievement Intelligent Database System Lab, IDSL

More Related