KDD-Cup A Survey: 1997-201 2

KDD-Cup A Survey: 1997-2012 Special Thanks to Prof.Qiang YANG’s course materials! (partly based on Xinyue Liu’s slides @SFU, and Nathan Liu’s slides @hkust) Hong Kong University of Science and Technology

About ACM KDDCUP • ACM KDD: Premiere Conference in knowledge discovery and data mining • ACM KDDCUP: • Worldwide competition in conjunction with ACM KDD conferences. • It aims at: • showcase the best methods for discovering higher-level knowledge from data. • Helping to close the gap between research and industry • Stimulating further KDD research and development

Statistics • Participation in KDD Cup grew steadily • Average person-hours per submission: 204Max person-hours per submission: 910

KDD Cup 97 • A classification task – to predict financial services industry (direct mail response) • Winners • Charles Elkan, a Prof from UC-San Diego with his Boosted Naive Bayesian (BNB) • Silicon Graphics, Inc with their software MineSet • Urban Science Applications, Inc. with their software gain, Direct Marketing Selection System

MineSet (Silicon Graphics Inc.) • A KDD tool that combines data access, transformation, classification, and visualization.

KDD Cup 98: CRM Benchmark • URL:www.kdnuggets.com/meetings/kdd98/kdd-cup-98.html • A classification task – to analyze fund raising mail responses to a non-profit organization • Winners • Urban Science Applications, Inc. with their software GainSmarts. • SAS Institute, Inc. with their software SAS Enterprise Miner ™ • Quadstone Limited with their software Decisionhouse ™

KDDCUP 1998 Results Maximum Possible Profit Line ($72,776 in profits with 4,873 mailed) Mail to Everyone Solution ($10,560 in profits with 96,367 mailed) GainSmarts SAS/Enterprise Miner Quadstone/Decisionhouse

ACM KDD Cup 1999 • URL: www.cse.ucsd.edu/users/elkan/kdresults.html • Problem To detect network intrusion and protect a computer network from unauthorized users, including perhaps insiders • Data: from DoD • Winners • SAS Institute Inc. with their software Enterprise Miner. • Amdocs with their Information Analysis Environment • URL: www.cse.ucsd.edu/users/elkan/kdresults.html • Problem To detect network intrusion and protect a computer network from unauthorized users, including perhaps insiders • Data: from DoD • Winners • SAS Institute Inc. with their software Enterprise Miner. • Amdocs with their Information Analysis Environment

Data collected from Gazelle.com, a legwear and legcare Web retailer Pre-processed Training set: 2 months Test sets: one month Data collected includes: Click streams Order information The goal – to design models to support web-site personalization and to improve the profitability of the site by increasing customer response. Questions - Whengiven a set of page views, characterize heavy spenders characterize killer pages characterize which product brand a visitor will view in the remainder of the session? KDDCUP 2000: Data Set and Goal:

3 Bioinformatics Tasks Dataset 1: Prediction of Molecular Bioactivity for Drug Design half a gigabyte when uncompressed Dataset 2: Prediction of Gene/Protein Function (task 2) and Localization (task 3) Dataset 2 is smaller and easier to understand 7 megabytes uncompressed A total of 136 groups participated to produce a total of 200 submitted predictions over the 3 tasks: 114 for Thrombin, 41 for Function, and 45 for Localization. KDD Cup 2001

Task 1, Thrombin: Jie Cheng (Canadian Imperial Bank of Commerce). Bayesian network learner and classifier Task 2, Function: Mark-A. Krogel (University of Magdeburg). Inductive Logic programming Task 3, Localization: Hisashi Hayashi, Jun Sese, and Shinichi Morishita (University of Tokyo). K nearest neighbor Task 2: the genes of one particular type of organism A gene/protein can have more than one function, but only one localization. 2001 Winners

molecular biology : Two tasks Task 1: Document extraction from biological articles Task 2: Classification of proteins based on gene deletion experiments Winners: Task 1: ClearForest and Celera, USA Yizhar Regev and Michal Finkelstein Task 2: Telstra Research Laboratories, Australia Adam Kowalczyk and Bhavani Raskutti

2003 KDDCUP • Information Retrieval/Citation Mining of Scientific research papers • based on a very large archive of research papers • First Task: predict how many citations each paper will receive during the three months leading up to the KDD 2003 conference • Second Task: a citation graph of a large subset of the archive from only the LaTex sources • Third Task: each paper's popularity will be estimated based on partial download logs • Last Task: devise their own questions

2004 Tasks and Results • (Particle physics; plus protein homology prediction） • Winners of the two tasks： • David S. Vogel, Eric Gottschalk, and Morgan C. Wang • Bernhard Pfahringer, Yan Fu, RuiXiang Sun, Qiang Yang, Simin He, Chunli Wang, Haipeng Wang, Shiguang Shan, Junfa Liu, Wen Gao.

Past KDDCUP Overview: 2005-2010

KDDCUP’11 Dataset • 11 years of data • Rated items are • Tracks • Albums • Artists • Genres • Items arranges in a taxonomy • Two tasks

Items in a Taxonomy

Track 1 Details

Track 1 Highlights • Largest publicly available dataset • Large number of items (50 times more than Netflix) • Extreme rating sparsity (20 times more sparse than Netflix) • Taxonomy can help in combating sparsely rated items. • Fine time stamps with both date and time allow sophisticated temporal modeling.

Track 2 Details

Track 2 Highlights • Performance metric focus on ranking/ classification, which differs from traditional collaborative filtering. • No validation data provided, need to self-construct binary labeled data from rating data. • Unlike track 1, track 2 removed time stamps to focus more than long term preference rather than short term behaviors.

Submission Stats

Winners

Chinese Teams at KDDCUP (NTU, CAS, HKUST) Nathan Liu: HKUST CSE PhD student

KDDCUP 2012 • Tencent • Task 1: Micro-blog (Weibo) User Recommendation • Recommends a popular person / an organization / a group TO a user • Task 2: Ad click-through rate prediction from search log • How often will an Ad be clicked by a user?

Task1: User recommendation UI 26 Popular user recommendation

Task2: Ad click-through rate prediction Ad click-through rate prediction

Task1 Data – User-Item Matrix 28 2088948 1760350 -1 1318348785 2088948 1774722 -1 1318348785 2088948 786313 -1 1318348785 601635 1775029 -1 1318348785 601635 1902321 -1 1318348785 601635 462104 -1 1318348785 1529353 1774509 -1 1318348786 • rec_log_train.txt / rec_log_test.txt UserID ItemID ?followed TimeStamp • ~75M records in training data • ?followed: -1/1, user accepts the recommendation or not • In test data, it is filled with 0, to be predicted as -1/1. • TimeStamp: unix-timestamp • Seconds from 70.1.1 00:00:00 (UTC time)

Task2 Data – Main Data Table 29 Extremely Large Training Data ~150M records 10Gig raw csv file + keywords + userProfiles Predicting CTR to helps search provider to rank/price ads correctly

Winners

Summary • To place on top of KDDCUP requires • Team work • Expertise in domain knowledge as well as mathematical tools • Often done by world famous institutes and companies • Recent trends: • Dataset increasingly more realistic • Participants increasingly more professional • Tasks are increasingly more difficult

Summary • KDD Cup is an excellent source to learn the state-of-art KDD techniques • KDDCUP dataset often becomes the standard benchmark for future research, development and teaching • Top winners are highly regarded and respected • References: http://www.sigkdd.org/kddcup/index.php

KDD-Cup A Survey: 1997-201 2