1 / 27

Heterogeneous Cross Domain Ranking in Latent Space

Heterogeneous Cross Domain Ranking in Latent Space. Bo Wang 1 , Jie Tang 2 , Wei Fan 3 , Songcan Chen 1 , Zi Yang 2 , Yanzhu Liu 4 1 Nanjing University of Aeronautics and Astronautics 2 Tsinghua University 3 IBM T.J. Watson Research Center, USA 4 Peking University. Introduction.

lamk
Download Presentation

Heterogeneous Cross Domain Ranking in Latent Space

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Heterogeneous Cross Domain Ranking in Latent Space Bo Wang1,Jie Tang2, Wei Fan3, Songcan Chen1, Zi Yang2, Yanzhu Liu4 1Nanjing University of Aeronautics and Astronautics 2Tsinghua University 3IBM T.J. Watson Research Center, USA 4Peking University

  2. Introduction • The web is becoming more and more heterogeneous • Ranking is the fundamental problem over web • unsupervised v.s. supervised • homogeneous v.s. heterogeneous

  3. Motivation Main Challenges 1) How to capture the correlation between heterogeneous objects? 2) How to preserve the preference orders between objects across heterogeneous domains? Heterogeneous cross domain ranking

  4. Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Conclusion

  5. Related Work • Learning to rank • Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07] [Yue, 07] • Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08] • Ranking adaptation: [Chen, 08] • Transfer learning • Instance-based: [Dai, 07] [Gao, 08] • Feature-based: [Jebara, 04] [Argyriou, 06] [Raina, 07] [Lee, 07] [Blitzer, 06] [Blitzer, 07] • Model-based: [Bonilla, 08]

  6. Outline • Related Work • Heterogeneous cross domain ranking • Basic idea • Proposed algorithm: HCDRank • Experiments • Conclusion

  7. Query: “data mining” Conference Expert Target Domain Source Domain might be empty! (no labelled data in target domain) mis-ranked pairs mis-ranked pairs Latent Space

  8. Learning Task • In the HCD ranking problem, the transfer ranking task can be defined as: • Given limited number of labeled data L_T, a large number of unlabeled data S from the target domain, and sufficiently labeled data L_S from the source domain, the goal is to learn a ranking function f_T^* for predicting the rank levels of unlabeled data in the target domain. • Key issues: • Different feature distributions/different feature spaces • Number of rank levels different • Number of labeled training examples very unbalanced (thousands vs a few)

  9. How to optimize? The Proposed Algorithm —HCDRank How to define? penalty Loss function in source domain Loss function in target domain Non-convex Loss function: Number of mis-ranked pairs C: cost-sensitive parameter which deals with imalance of labeled data btwn domains \lambda: balances the empirical loss and the penalty unsolvable Dual problem

  10. alternately optimize matrix M and D O((2T+1)*sNlog(N) + d3 O(2T*sN logN) Construct transformation matrix d: feature number, N = nr of instance pairs for training, s: number of non-zero features O(d3) learning in latent space Learn weight vector of target domain O(sN logN) Apply learnt weight vector to predict

  11. Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Ranking on Homogeneous data • Ranking on Heterogeneous data • Ranking on Heterogeneous tasks • Conclusion

  12. Experiments • Data sets • Homogeneous data set: LETOR_TR • 50/75/106 queries with 44/44/25 features for TREC2003_TR, TREC2004_TR and OHSUMED_TR • Heterogeneous academic data set: ArnetMiner.org • 14,134 authors, 10,716 papers, and 1,434 conferences • Heterogeneous task data set: • 9 queries, 900 experts, 450 best supervisor candidates • Evaluation measures • P@n : Precision@n • MAP : mean average precision • NDCG : normalized discount cumulative gain

  13. Ranking on Homogeneous data • LETOR_TR • We made a slight revision of LETOR 2.0 to fit into the cross-domain ranking scenario • three sub datasets: TREC2003_TR, TREC2004_TR, and OHSUMED_TR • Baselines

  14. TREC2003_TR TREC2004_TR Cosine Similarity=0.01 Cosine Similarity=0.23 OHSUMED_TR Cosine Similarity=0.18

  15. Observations • Ranking accuracyHCDRank is +5.6% to +6.1% in terms of MAP better • Effect of differencewhen cosine similarity is high (TREC2004), simply combining the two domains would result in a better ranking performance • Training time: next slide

  16. Training Time BUT: HCDRank can easily be parallelized And training process only needs to be run once on a data set

  17. Ranking on Heterogeneous data • ArnetMiner data set (www.arnetminer.org) 14,134 authors, 10,716 papers, and 1,434 conferences • Training and test data set: • 44 most frequent queried keywords from log file • Author collection: Libra, Rexa and ArnetMiner • Conference collection: Libra, ArnetMiner • Ground truth: • Conference: online resources • Expert: two faculty members and five graduate students from CS provided human judgments for expert ranking

  18. Feature Definition 16 features for a conference, 17 features for an expert

  19. Expert Finding Results

  20. Observations • Ranking accuracyHCDRank outperforms the baselinesespecially the two unsupervised systems • Feature analysisnext slide: final weight vectors which exploits the data information from two domains and adjusts the weight learn from single domain data • Training time: next slide

  21. Feature Correlation Analysis

  22. Ranking on Heterogeneous tasks • Expert finding task v.s. best supervisor finding task • Training and test data set: • expert finding task: ranking lists from ArnetMiner or annotated lists • best supervisor finding task: 9 most frequent queries from log file of ArnetMiner • For each query, we collected 50 best supervisor candidates, and sent emails to 100 researchers for annotation • Ground truth: • Collection of feedbacks about the candidates (yes/ no/ not sure)

  23. Best supervisor finding Training/test set and ground truth 724 mails sent Fragment of mail • Feedbacks in effect > 82 (increasing) • Rate each candidate by the definite feedbacks (yes/no) 24

  24. Feature Definition

  25. Best supervisor finding results

  26. Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Conclusion

  27. Conclusion • Formally define the problem of heterogeneous cross domain ranking and propose a general framework • We provide a preferred solution under the regularized framework by simultaneously minimizingtwo ranking loss functions in two domains • The experimental results on three different genres of data setsverified the effectiveness of the proposed algorithm

More Related