160 likes | 317 Views
Introducing Project 2/Assignment 3: KDDCUP 2012 (Track 1). COMP 4332 Tutorial 9 April 8 CHEN Zhao. KDDCup 2012 hosted by Tencent. Website by Kaggle : https://www.kddcup2012.org / Tencent , 5 th largest Internet Company in the world, by market cap(Google, Amazon, Ebay , Facebook).
E N D
Introducing Project 2/Assignment 3: KDDCUP 2012 (Track 1) COMP 4332 Tutorial 9 April 8 CHEN Zhao
KDDCup 2012 hosted by Tencent • Website by Kaggle: https://www.kddcup2012.org/ • Tencent, 5th largest Internet Company in the world, by market cap(Google, Amazon, Ebay, Facebook). • QQ (Online Messenger) • Weibo (micro-blog system) KDDCup 2012 data source • QZone (blog system) • Games, Mail, Wechat, etc. • Two tasks • User recommendation • Ad click-through rate prediction • 16,000 US$ prize
KDDCUP 2012 tasks • Task 1: Micro-blog User Recommendation • Recommends a popular person / an organization / a group TO a user • Task 2: Ad click-through rate prediction • How often will an Ad be clicked by a user? Project 2 only includes task 1
User recommendation UI Popular user Recommendation You don’t need this Website to do the Task.
Data • Download: http://www.kddcup2012.org/c/kddcup2012-track1/data • Description: http://www.kddcup2012.org/c/kddcup2012-track1 • File list: • rec_log_train.txt (1.99Gb) rec_log_test.txt (1GB) • user_profile.txt (55.8Mb) • item.txt (1.18Mb) • user_action.txt (217Mb) • user_sns.txt (740Mb) • user_key_word.txt (182Mb)
Data - item • Item.txt: • An item is a specific user in TencentWeibo, which can be a person, an organization, or a group, that was selected and recommended to other users. • Total: 6096 • One example: • ID category keywords • Category is a four-field string, representing a hierarchical category. 2335869 8.1.4.2 412042;974;85658;...;6183;974
Data – user profile • user_profile.txt • ID birth-y gender #tweet tags • Tag 0: this user has not marked any tag. 100044 1899 1 5 831;55;198;8;450;7;39;5;111 100054 1987 2 6 0 100065 1989 1 57 0
Data – user-item matrix • rec_log_train.txt / rec_log_test.txt • UserIDItemID ?followed TimeStamp • ?followed: -1/1, user accepts the recommendation or not • In test data, it is filled with 0, to be predicted as -1/1. • TimeStamp: unix-timestamp • Seconds from 70.1.1 00:00:00 (UTC time) 2088948 1760350 -1 1318348785 2088948 1774722 -1 1318348785 2088948 786313 -1 1318348785 601635 1775029 -1 1318348785 601635 1902321 -1 1318348785 601635 462104 -1 1318348785 1529353 1774509 -1 1318348786
Data -- Other information on users • user_action.txt • user A has retweeted user B 5 times, has “at” B 3 times, and has commented user B 6 times, then there is one line “A B 3 5 6” in user_action.txt. • user_sns.txt • (Follower-userid)\t(Followee-userid) • user_key_word.txt • contains the keywords extracted from the tweet/retweet/comment by each user. • For details: http://www.kddcup2012.org/c/kddcup2012-track1
Collaborative Filtering (CF) in one slide • Will a user follow an item (popular person/organization/group) ? users ? items More CF in next tutorial.
Evaluation metric, AP@n • http://www.kddcup2012.org/c/kddcup2012-track1/details/Evaluation • Page 7 of http://sas.uwaterloo.ca/stats_navigation/techreports/04WorkingPapers/2004-09.pdf • ap@n = Σ k=1,...,n P(k) / (number of items clicked in m items) • (1) If among the 5 items recommended to the user, the user clicked #1, #3, #4, then ap@3 = (1/1 + 2/3)/3 ≈ 0.56 • (2) If among the 4 items recommended to the user, the user clicked #1, #2, #4, then ap@3 = (1/1 + 2/2)/3 ≈ 0.67 • (3) If among the 3 items recommended to the user, the user clicked #1, #3, then ap@3 = (1/1 + 2/3)/2 ≈ 0.83
Leaderboard • http://www.kddcup2012.org/c/kddcup2012-track1/leaderboard
An open challenge • Few literature on this application. • Generally, methods from CF/RecSys can be used. • Try simple heuristics first. • Basic target: beat “Example Submission”
Simple heuristics • Select most popular items to every user. • Select the items with high acceptance ratio.
Assignment 3 • A survey on solutions of KDDCUP 2012 Track 1 • Contains at least 2 of the top 3 teams( http://www.kddcup2012.org/workshop ) • Compare their pros and cons • Try to propose some improvements. • Deadline: 17th April 2014 • Get prepared. • Requirements of Project 2 will be released next week.