470 likes | 484 Views
KDD’99 Knowledge Discovery Contest. Advisor: Dr. Hsu Graduate: Yu-Wei Su. Outline. Motivation Objective Contest target KDD’99 Competition: Knowledge Discovery Contest Knowledge Discovery in a Charitable Organization’s Donor Database
E N D
KDD’99 Knowledge Discovery Contest Advisor: Dr. Hsu Graduate: Yu-Wei Su Intelligent Database System Lab, IDSL
Outline • Motivation • Objective • Contest target • KDD’99 Competition: Knowledge Discovery Contest • Knowledge Discovery in a Charitable Organization’s Donor Database • Profiling Your Customers Using Bayesian Networks • Opinion Intelligent Database System Lab, IDSL
Motivation • Direct mail to all customers are inefficiency and high costs • Utilizing unsupervised clustering method instead of supervised classification method in the 1998 competition Intelligent Database System Lab, IDSL
Objective • Discovering higher-level knowledge from data • Maximizing the profit for predictive model Intelligent Database System Lab, IDSL
Contest target • The database is the same used in the 1998 competition • Data is implemented by an American Charity in the June ’97 renewal campaign • Techniques included • Unsupervised clustering • Knowledge-driven segmentation • Association rule discovery • Causal modeling Intelligent Database System Lab, IDSL
Contest target( cont’) • Each team had to build a profit-maximizing predictive model for the ’98 competition task Intelligent Database System Lab, IDSL
KDD’99 Competition: Knowledge Discovery Contest • Introduction • Exploratory data analysis • A two-stage prediction model • Understanding the model • Conclusion Intelligent Database System Lab, IDSL
Introduction • The paper discusses SAS Institute’s findings • Expand on the ’98 KDD cup competition • To reveal unusual data anomalies • A two-stage prediction model yields superior results to those in ’98 • Use decision tree to better understanding • Apply a confidence interval to judge model performance Intelligent Database System Lab, IDSL
Introduction( cont’) • The models were built using a 95412 case training data set with known response • To judge model efficacy, expected gift amount were calculated for a validation data set with concealed response Intelligent Database System Lab, IDSL
Exploratory data analysis • Successful statistical prediction models’ elements • Problem-specific knowledge • Historical data • Analytical savvy Intelligent Database System Lab, IDSL
Exploratory data analysis • Four anomalies are immediately apparent Intelligent Database System Lab, IDSL
Exploratory data analysis( cont’) Intelligent Database System Lab, IDSL
Exploratory data analysis( cont’) Intelligent Database System Lab, IDSL
A two-stage prediction model • To accurately estimate the probability distribution of gift amount to a potential donor • Prediction is done in two way • Directly estimating expected gift • Separately estimating expected donation probability and the expected gift amount and multiplying them Intelligent Database System Lab, IDSL
A two-stage prediction model( cont’) • Task was done by using two multi-layer perceptron(MLP) neural networks • Gift amount model • First, using the cases where gift occurred(5% of the data) • Input reflecting historical patterns in the gift amount( fitting class-probability decision tree) Intelligent Database System Lab, IDSL
A two-stage prediction model( cont’) • Gift probability model • Using all the cases • Input reflecting recency, frequency, amount(RFA) and demographic data and the patterns noted in the exploratory data analysis Intelligent Database System Lab, IDSL
A two-stage prediction model( cont’) • The first stage MLP • Input layer with five inputs fully connected to 20 hidden units and them fully connected to a target unit • 4843 cases with TAGET_B=1 Intelligent Database System Lab, IDSL
A two-stage prediction model( cont’) Intelligent Database System Lab, IDSL
A two-stage prediction model( cont’) • The second stage MLP • Input layer with eight inputs fully connected to 20 hidden units and them fully connected to a target unit • Expected gift amount for each case was calculate as the product from the first- and second-stage models Intelligent Database System Lab, IDSL
A two-stage prediction model( cont’) • Net revenue of the model using validation data is $14877.77 • This is $165.53 more than the Gold-medal winner at KDD-98 Intelligent Database System Lab, IDSL
Understanding the model Intelligent Database System Lab, IDSL
Brief summary • Smaller mailing size will lead to smaller variability in expected total gift because the variance sum will have fewer terms • Further development incorporates both profit maximization and risk minimization as determinants of optimum mailing depth Intelligent Database System Lab, IDSL
Knowledge Discovery in a Charitable Organization’s Donor Database • Introduction • Main results • Detailed results • comments Intelligent Database System Lab, IDSL
Introduction • Data set contains about 95000 customers, with an average net donation of slightly over 11 cents per customers and a total net donation of around $10500 from the “mail to all” policy • The task utilizing standard 2-class knowledge discovery and with Value Weighted Analysis(VWA) Intelligent Database System Lab, IDSL
Main results • Maximal net profit of $15515 when checked against the evaluation data set, compared to KDD-Cup 98’ best result of $14712 net profit • Built a “white-box” model comprised of 11 customer segments and bring a combined net donation of $13397 for the evaluation data set Intelligent Database System Lab, IDSL
Main results( cont’) • Donation segments with highly profitable and actionable • Approximately 14000 people who live in an area where over 5% of renters pay over $400 per month • Have donated over $100 in the past • Have an average donation of over $12 • Account for $8200 net donation in the training set Intelligent Database System Lab, IDSL
Main results( cont’) Intelligent Database System Lab, IDSL
Main results( cont’) • Identifying donors is a different task than maximizing donation • The variability of profit results • Profit difference less than $500 cannot be considered significant • Difference of $2000 in profit is not significant on different data sets Intelligent Database System Lab, IDSL
Main results( cont’) • Main discovery & modeling approach was a 1-stage 2-class model based on VWA Intelligent Database System Lab, IDSL
Detailed results • The most significant variables for predicting a customer’s donation behavior are the previous donation behavior summaries • The NK phenomenon • US-census data turns out to be quite strongly connected to the donation performance of the population Intelligent Database System Lab, IDSL
Detailed results( cont’) • 5 models was chosen at last • Two “white box” models and one relatively simple model, based on 40 variables and two candidates for “best overall” model Intelligent Database System Lab, IDSL
Detailed results( cont’) • Building the white-box model • Total net donation of 11 segments if $13397 for 55086 customers Intelligent Database System Lab, IDSL
Detailed results( cont’) Intelligent Database System Lab, IDSL
Detailed results( cont’) • Best single model • Selected 31 original variables, plus 9 additional demographic summary variables Optimal point Predict point Intelligent Database System Lab, IDSL
Detailed results( cont’) Intelligent Database System Lab, IDSL
Detailed results( cont’) • Improving prediction by averaging Intelligent Database System Lab, IDSL
Brief summary • A good leave-out test set performance can hardly be considered a reliable indication of good future performance • The net profit has a very large variance Intelligent Database System Lab, IDSL
Profiling Your Customers Using Bayesian Networks • Introduction • Data manipulation and preprocessing • Result • Conclusion Intelligent Database System Lab, IDSL
Introduction • Build two causal models to understand the characteristics of respondents to direct mail fund raising campaigns • The first model( response-net) captures the dependency of the probability of response to the mailing campaign on the independent variables • 96376 lapsed donors data set Intelligent Database System Lab, IDSL
Introduction( cont’) • The second network( donation-net) models the dependency of the dollar amount of the gift • 5%respondents to the ’97 mailing campaign Intelligent Database System Lab, IDSL
Data manipulation and preprocessing • Remove redundant variables and more than 99% of missing values and variables with only one state • All continuous variables were discretized into four bins of equal length • 30 variables at last were available Intelligent Database System Lab, IDSL
Data manipulation and preprocessing( cont’) • These variables can be divided into three group • Variables about personal information about the donors • Variable about information of donors neighborhood as socio-economic and urbanicity indicators • Variables about history and promotion history file of the donors Intelligent Database System Lab, IDSL
Result-profiling respondents Intelligent Database System Lab, IDSL
Result-profiling donors Intelligent Database System Lab, IDSL
Result • Profit prediction • Both two model can be used to predict the expected profit Intelligent Database System Lab, IDSL
Brief summary • Shown an application of Bayesian methods to a Knowledge Discovery task • To maintain a high response rate to direct fund raising is to continuously update the database of donors Intelligent Database System Lab, IDSL
Opinion • The aspect of this papers are too higher concepts • Simple methodology can make great achievement Intelligent Database System Lab, IDSL