270 likes | 289 Views
Heterogeneous Cross Domain Ranking in Latent Space. Bo Wang 1 , Jie Tang 2 , Wei Fan 3 , Songcan Chen 1 , Zi Yang 2 , Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University 3 IBM T.J. Watson Research Center, USA 4 Peking University. Introduction.
E N D
Heterogeneous Cross Domain Ranking in Latent Space Bo Wang1,Jie Tang2, Wei Fan3, Songcan Chen1, Zi Yang2, Yanzhu Liu4 1Nanjing University of Aeronautics and Astronautics 2Tsinghua University 3IBM T.J. Watson Research Center, USA 4Peking University
Introduction • The web is becoming more and more heterogeneous • Ranking is the fundamental problem over web • unsupervised v.s. supervised • homogeneous v.s. heterogeneous
Motivation Main Challenges 1) How to capture the correlation between heterogeneous objects? 2) How to preserve the preference orders between objects across heterogeneous domains? Heterogeneous cross domain ranking
Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Conclusion
Related Work • Learning to rank • Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07] [Yue, 07] • Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08] • Ranking adaptation: [Chen, 08] • Transfer learning • Instance-based: [Dai, 07] [Gao, 08] • Feature-based: [Jebara, 04] [Argyriou, 06] [Raina, 07] [Lee, 07] [Blitzer, 06] [Blitzer, 07] • Model-based: [Bonilla, 08]
Outline • Related Work • Heterogeneous cross domain ranking • Basic idea • Proposed algorithm: HCDRank • Experiments • Conclusion
Query: “data mining” Conference Expert Target Domain Source Domain might be empty! (no labelled data in target domain) mis-ranked pairs mis-ranked pairs Latent Space
Learning Task • In the HCD ranking problem, the transfer ranking task can be defined as: • Given limited number of labeled data L_T, a large number of unlabeled data S from the target domain, and sufficiently labeled data L_S from the source domain, the goal is to learn a ranking function f_T^* for predicting the rank levels of unlabeled data in the target domain. • Key issues: • Different feature distributions/different feature spaces • Number of rank levels different • Number of labeled training examples very unbalanced (thousands vs a few)
How to optimize? The Proposed Algorithm —HCDRank How to define? penalty Loss function in source domain Loss function in target domain Non-convex Loss function: Number of mis-ranked pairs C: cost-sensitive parameter which deals with imalance of labeled data btwn domains \lambda: balances the empirical loss and the penalty unsolvable Dual problem
alternately optimize matrix M and D O((2T+1)*sNlog(N) + d3 O(2T*sN logN) Construct transformation matrix d: feature number, N = nr of instance pairs for training, s: number of non-zero features O(d3) learning in latent space Learn weight vector of target domain O(sN logN) Apply learnt weight vector to predict
Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Ranking on Homogeneous data • Ranking on Heterogeneous data • Ranking on Heterogeneous tasks • Conclusion
Experiments • Data sets • Homogeneous data set: LETOR_TR • 50/75/106 queries with 44/44/25 features for TREC2003_TR, TREC2004_TR and OHSUMED_TR • Heterogeneous academic data set: ArnetMiner.org • 14,134 authors, 10,716 papers, and 1,434 conferences • Heterogeneous task data set: • 9 queries, 900 experts, 450 best supervisor candidates • Evaluation measures • P@n : Precision@n • MAP : mean average precision • NDCG : normalized discount cumulative gain
Ranking on Homogeneous data • LETOR_TR • We made a slight revision of LETOR 2.0 to fit into the cross-domain ranking scenario • three sub datasets: TREC2003_TR, TREC2004_TR, and OHSUMED_TR • Baselines
TREC2003_TR TREC2004_TR Cosine Similarity=0.01 Cosine Similarity=0.23 OHSUMED_TR Cosine Similarity=0.18
Observations • Ranking accuracyHCDRank is +5.6% to +6.1% in terms of MAP better • Effect of differencewhen cosine similarity is high (TREC2004), simply combining the two domains would result in a better ranking performance • Training time: next slide
Training Time BUT: HCDRank can easily be parallelized And training process only needs to be run once on a data set
Ranking on Heterogeneous data • ArnetMiner data set (www.arnetminer.org) 14,134 authors, 10,716 papers, and 1,434 conferences • Training and test data set: • 44 most frequent queried keywords from log file • Author collection: Libra, Rexa and ArnetMiner • Conference collection: Libra, ArnetMiner • Ground truth: • Conference: online resources • Expert: two faculty members and five graduate students from CS provided human judgments for expert ranking
Feature Definition 16 features for a conference, 17 features for an expert
Observations • Ranking accuracyHCDRank outperforms the baselinesespecially the two unsupervised systems • Feature analysisnext slide: final weight vectors which exploits the data information from two domains and adjusts the weight learn from single domain data • Training time: next slide
Ranking on Heterogeneous tasks • Expert finding task v.s. best supervisor finding task • Training and test data set: • expert finding task: ranking lists from ArnetMiner or annotated lists • best supervisor finding task: 9 most frequent queries from log file of ArnetMiner • For each query, we collected 50 best supervisor candidates, and sent emails to 100 researchers for annotation • Ground truth: • Collection of feedbacks about the candidates (yes/ no/ not sure)
Best supervisor finding Training/test set and ground truth 724 mails sent Fragment of mail • Feedbacks in effect > 82 (increasing) • Rate each candidate by the definite feedbacks (yes/no) 24
Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Conclusion
Conclusion • Formally define the problem of heterogeneous cross domain ranking and propose a general framework • We provide a preferred solution under the regularized framework by simultaneously minimizingtwo ranking loss functions in two domains • The experimental results on three different genres of data setsverified the effectiveness of the proposed algorithm