330 likes | 387 Views
Explore the evolution of KDD Cup competitions from 1997 to 2000, showcasing top methods used, winning strategies, and impact on the knowledge discovery and data mining field.
E N D
KDD Cup Survey Xinyue Liu
Outline • Nuts and Bolts of KDD Cup • KDD Cup 97-99 • KDD Cup 2000 • Summary
About KDD Cup A knowledge discovery and data mining tools competition in conjunction with KDD conferences. It aims at: • showcase the best methods for discovering higher-level knowledge from data. • Helping to close the gap between research and industry • Stimulating further KDD research and development
Statistics • Participation in KDD Cup grew steadily, especially requests to access the data • Average person-hours per submission: 204Max person-hours per submission: 910 • Commercial software grew from 44% (cup 97) to 52% (cup 98) to 77% (cup 2000)
Algorithms Decision trees most widely tried and by far the most commonly submitted
KDD Cup 97 • A classification task – to predict Financial services industry direct mail response • Winners • Charles Elkan, a PhD from UC-San Diego with his Boosted Naive Bayesian (BNB) • Silicon Graphics, Inc with their software MineSet • Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System
BNB • Boosting – to learn a series of classifiers, where each classifier in the series pays more attention to the examples misclassified by its predecessor. Repeated T rounds. • BNB – representationally equivalent to a multilayer perceptron with a single hidden layer. • Complexity – O(ef) e – examples f - attributes
MineSet • A KDD tool that combines data access, transformation, classification, and visualization.
KDD Cup 98 • URL:www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html • A classification task – to analyze fund raising mail responses to a non-profit organization • Winners • Urban Science Applications, Inc. with their software GainSmarts. • SAS Institute, Inc. with their software Enterprise Miner. • Quadstone Limited with their software Decisionhouse
GainSmarts • GainSmarts – a feature selection expert system • First step - used Logistic Regression to assign each prospect a probability of donation (Pi). • Second step - used Linear Regression to estimate a conditional donation amount of responding donors (Ai) • Result (<1% error) - Prediction = Pi * Ai
Enterprise Miner • A data mining solution that addresses the entire data mining process • SEMMA Process • Sample • Explore • Modify • Model • Assess • Algorithms • Decision tree • Neural network • Regression
Decisionhouse • Decisionhouse – an integrated modelling software suite by Quadstone • Data exploration using visualization modules. • Use Decision trees and Scorecards to model more complex tasks. • Choose the final model by comparing a variety of modeling approaches and looking at the difference in predicted net profitability (lift curve).
Results Maximum Possible Profit Line ($72,776 in profits with 4,873 mailed) Mail to Everyone Solution ($10,560 in profits with 96,367 mailed) GainSmarts SAS/Enterprise Miner Quadstone/Decisionhouse
KDD Cup 99 • URL: www.cse.ucsd.edu/users/elkan/kdresults.html • Problem same data set as KDD Cup 98 • Winners • SAS Institute Inc. with their software Enterprise Miner. • Amdocs with their Information Analysis Environment
Software • SAS – using two-stage model which includes two multi-layer perceptron (MLP) neural networks models. • Amdocs – using its own Information Analysis Environment, which allows modeling of the value and class membership simultaneously. Algorithms used is a hybrid logistic regression model
KDD Cup 2000 www.ecn.purdue.edu/KDDCUP/ Sponsored by Purdue University Blue Martini Software
Data Set Data collected from Gazelle.com, a legwear and legcare web retailer • Pre-processed • Training set: 2 months • Test sets: one month • Data collected includes: • Click streams • Order information • Registration form
Problems • The goal – to design models to support web-site personalization and to improve the profitability of the site by increasing customer response. • Questions - Whengiven a set of page views, • will the visitor view another page on the site or leave? • which product brand will the visitor view in the remainder of the session? • characterize heavy spenders • characterize killer pages • characterize which product brand a visitor will view in the remainder of the session?
Evaluation • Accuracy/score was measured for the two questions with test sets • Insight questions judged with help of retail experts from Gazelle and Blue Martini Created a list of insights from all participants • Each insight was given a weigh • Each participant was scored on all insights • Additional factors: • Presentation quality • Correctness
The Winners • Question 1 & 5 Winner: Amdocs • Question 2 & 3 Winner: Salford Systems • Question 4 Winner: e-steam poster
Software (Amdocs) • Exploratory Data Analysis – SAS • Classification Tree – Amdocs Business Insight Tool • Decision tree • Rules Extraction • Modeling • Combining models
Rule Generator Rule Generator Rule Generator 1466 rules 1466 rules 1466 rules 111 continue rules 111 continue rules 111 continue rules Best Best Best Hybrid Hybrid Hybrid Merged Merged Merged Rule Rule Rule Model Model Model Rules Rules Rules Main Model Decision Tree Decision Tree Decision Tree 5 trees 5 trees 5 trees built on 34000 cases built on 34000 cases built on 34000 cases
Sub-models Each model captures a different aspect of the overall behavior in the data. Combining or ensembling the models provides the best prediction results. Best rule Chooses most accurate rule satisfied by each record Logistic regression on rule set + raw field values combine to define score for each record Hybrid Model Logistic regression on rule set defines score for each record as a combination of rules the record satisfies Merged Rules
Software (Salford) • CART - a decision tree tool that automatically searching for and isolating significant patterns and relationships • MARS - a multivariate non-parametric regression procedure • HotSpotDetector • TreeNet
Cart • Binary recursive partitioning. • Key elements: • Splitting rules • Brute force search all possible splits for all variables • Rank each splitting rule on the basis of a quality-of-split criterion (default GINI) • Recursion - split until further splitting is impossible or stopped. • Class assignment • Plurality rule • Assign every node whether it is terminal or not. • Pruning Trees – does not stop in the middle • Testing - best sub-tree is the one with the lowest error
MARs • Automatic variable search • Automatic variable transformation • Automatic limited interaction searches • Variable nesting • Built-in testing regimens model selection parameters.
Insights (Heavy Spenders) • Some of the Good insights • Referrers - establish ad policy based on conversion rates, not click-throughs • Not an AOL user - browser window too small for layout • Referring site traffic changed dramatically over time • Came to site from print-ad or news, not friends & families • Very high and very low income • Geographic: Northeast U.S. states • Repeat visitors
Insights (Who leaves?) • Some of the good insights • Crawlers, bots accounted for 16% of sessions • Long processing time (> 12 seconds) implies high abandonment • Referring sites: mycoupons have long sessions, shopnow.com are prone to exit quickly • Returning visitors' prob of continuing is double • View of specific products (Oroblue,Levante) cause abandonment • Probability of leaving decreases with page views • Free Gift and Welcome templates on first three pages encouraged visitors to stay at site
Insights(Brand view) • Some good insights • Referrer URL is great predictor: • Fashionmall.com and winnie-cooper are referrers for Hanes and Donna Karan • mycoupons.com, tripod, deal-finder are referrers for American Essentials • Previous views of a product imply later views
Summary • Data mining requires background knowledge and access to business users • Successful data mining solutions combine automated and manual analysis, integrating the power of the machine with expert knowledge and human insight • Web Mining is challenging: crawlers/bots, frequent site changes, etc. • KDD Cup is an excellent source to learn the state-of-art KDD techniques • KDD Cup data available for research and education
References Elkan C. (1997). Boosting and Naive Bayesian Learning. Technical Report No. CS97-557, September 1997, UCSD. Decisionhouse (1998). KDD Cup 98: Quadstone Take Bronze Miner Award. Retrieved March 15, 2001 fromhttp://www.kdnuggets.com/meetings/kdd98/quadstone/index.html Urbane Science (1998). Urbane Science wins the KDD-98 Cup. Retrieved March 15, 2001 from http://www.kdnuggets.com/meetings/kdd98/gain-kddcup98-release.html Georges, J. & Milley, A. (1999). KDD’99 Competition: Knowledge Discovery Contest. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/saskdd99.pdf Rosset, S. & Inger A. (1999). KDD-Cup 99 : Knowledge Discovery In a Charitable Organization’s Donor Database. Retrieved March 15, 2001 from http://www.cse.ucsd.edu/users/elkan/KDD2.doc
References (Cont.) Sebastiani P., Ramoni M. & Crea A. (1999). Profiling your Customers using Bayesian Networks. Retrieved March 15, 2001 from http://bayesware.com/resources/tutorials/kddcup99/kddcup99.pdf Inger A., Vatnik N., Rosset S. & Neumann E. (2000). KDD-Cup 2000: Question 1 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-1.ppt Neumann E., Vatnik N., Rosset S., Duenias M., Sasson I. & Inger A. (2000). KDD-Cup 2000: Question 5 Winner’s Report. Retrieved March 18, 2000 from http://www.ecn.purdue.edu/KDDCUP/amdocs-slides-5.ppt Salford System white papers: http://www.salford-systems.com/whitepaper.html Summary talk presented at KDD (2000) http://robotics.stanford.edu/~ronnyk/kddCupTalk.ppt