Heterogeneous Cross Domain Ranking in Latent Space

Heterogeneous Cross Domain Ranking in Latent Space Bo Wang1,Jie Tang2, Wei Fan3, Songcan Chen1, Zi Yang2, Yanzhu Liu4 1Nanjing University of Aeronautics and Astronautics 2Tsinghua University 3IBM T.J. Watson Research Center, USA 4Peking University

Introduction • The web is becoming more and more heterogeneous • Ranking is the fundamental problem over web • unsupervised v.s. supervised • homogeneous v.s. heterogeneous

Motivation Main Challenges 1) How to capture the correlation between heterogeneous objects? 2) How to preserve the preference orders between objects across heterogeneous domains? Heterogeneous cross domain ranking

Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Conclusion

Related Work • Learning to rank • Supervised: [Burges, 05] [Herbrich, 00] [Xu and Li, 07] [Yue, 07] • Semi-supervised: [Duh, 08] [Amini, 08] [Hoi and Jin, 08] • Ranking adaptation: [Chen, 08] • Transfer learning • Instance-based: [Dai, 07] [Gao, 08] • Feature-based: [Jebara, 04] [Argyriou, 06] [Raina, 07] [Lee, 07] [Blitzer, 06] [Blitzer, 07] • Model-based: [Bonilla, 08]

Outline • Related Work • Heterogeneous cross domain ranking • Basic idea • Proposed algorithm: HCDRank • Experiments • Conclusion

Query: “data mining” Conference Expert Target Domain Source Domain might be empty! (no labelled data in target domain) mis-ranked pairs mis-ranked pairs Latent Space

Learning Task • In the HCD ranking problem, the transfer ranking task can be defined as: • Given limited number of labeled data L_T, a large number of unlabeled data S from the target domain, and sufficiently labeled data L_S from the source domain, the goal is to learn a ranking function f_T^* for predicting the rank levels of unlabeled data in the target domain. • Key issues: • Different feature distributions/different feature spaces • Number of rank levels different • Number of labeled training examples very unbalanced (thousands vs a few)

How to optimize? The Proposed Algorithm —HCDRank How to define? penalty Loss function in source domain Loss function in target domain Non-convex Loss function: Number of mis-ranked pairs C: cost-sensitive parameter which deals with imalance of labeled data btwn domains \lambda: balances the empirical loss and the penalty unsolvable Dual problem

alternately optimize matrix M and D O((2T+1)*sNlog(N) + d3 O(2T*sN logN) Construct transformation matrix d: feature number, N = nr of instance pairs for training, s: number of non-zero features O(d3) learning in latent space Learn weight vector of target domain O(sN logN) Apply learnt weight vector to predict

Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Ranking on Homogeneous data • Ranking on Heterogeneous data • Ranking on Heterogeneous tasks • Conclusion

Experiments • Data sets • Homogeneous data set: LETOR_TR • 50/75/106 queries with 44/44/25 features for TREC2003_TR, TREC2004_TR and OHSUMED_TR • Heterogeneous academic data set: ArnetMiner.org • 14,134 authors, 10,716 papers, and 1,434 conferences • Heterogeneous task data set: • 9 queries, 900 experts, 450 best supervisor candidates • Evaluation measures • P@n : Precision@n • MAP : mean average precision • NDCG : normalized discount cumulative gain

Ranking on Homogeneous data • LETOR_TR • We made a slight revision of LETOR 2.0 to fit into the cross-domain ranking scenario • three sub datasets: TREC2003_TR, TREC2004_TR, and OHSUMED_TR • Baselines

TREC2003_TR TREC2004_TR Cosine Similarity=0.01 Cosine Similarity=0.23 OHSUMED_TR Cosine Similarity=0.18

Observations • Ranking accuracyHCDRank is +5.6% to +6.1% in terms of MAP better • Effect of differencewhen cosine similarity is high (TREC2004), simply combining the two domains would result in a better ranking performance • Training time: next slide

Training Time BUT: HCDRank can easily be parallelized And training process only needs to be run once on a data set

Ranking on Heterogeneous data • ArnetMiner data set (www.arnetminer.org) 14,134 authors, 10,716 papers, and 1,434 conferences • Training and test data set: • 44 most frequent queried keywords from log file • Author collection: Libra, Rexa and ArnetMiner • Conference collection: Libra, ArnetMiner • Ground truth: • Conference: online resources • Expert: two faculty members and five graduate students from CS provided human judgments for expert ranking

Feature Definition 16 features for a conference, 17 features for an expert

Expert Finding Results

Observations • Ranking accuracyHCDRank outperforms the baselinesespecially the two unsupervised systems • Feature analysisnext slide: final weight vectors which exploits the data information from two domains and adjusts the weight learn from single domain data • Training time: next slide

Feature Correlation Analysis

Ranking on Heterogeneous tasks • Expert finding task v.s. best supervisor finding task • Training and test data set: • expert finding task: ranking lists from ArnetMiner or annotated lists • best supervisor finding task: 9 most frequent queries from log file of ArnetMiner • For each query, we collected 50 best supervisor candidates, and sent emails to 100 researchers for annotation • Ground truth: • Collection of feedbacks about the candidates (yes/ no/ not sure)

Best supervisor finding Training/test set and ground truth 724 mails sent Fragment of mail • Feedbacks in effect > 82 (increasing) • Rate each candidate by the definite feedbacks (yes/no) 24

Feature Definition

Best supervisor finding results

Outline • Related Work • Heterogeneous cross domain ranking • Experiments • Conclusion

Conclusion • Formally define the problem of heterogeneous cross domain ranking and propose a general framework • We provide a preferred solution under the regularized framework by simultaneously minimizingtwo ranking loss functions in two domains • The experimental results on three different genres of data setsverified the effectiveness of the proposed algorithm

Heterogeneous Cross Domain Ranking in Latent Space

Heterogeneous Cross Domain Ranking in Latent Space

Presentation Transcript

Cross-domain Collaboration Recommendation

Cross-domain Collaboration Recommendation

Cross-Domain Secure Computation

Cross Domain Patient Identity Management

Cross Domain Patient Identity Management

Relevance Ranking in the Scholarly Domain

Heterogeneous Domain Adapation using Manifold Alignment

C ontingency Ranking by Time Domain Simulations

Cross Domain Review Laboratory

Cross Domain Review PCC

Cross-domain concepts

Managing a Space of Heterogeneous Data

Cross Domain Review Eye Care

Latent Space Domain Transfer between High Dimensional Overlapping Distributions

Modeling in the Time Domain - State-Space

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Heterogeneous Cross Domain Ranking in Latent Space

Latent Semantic Indexing (mapping onto a smaller space of latent concepts)

DOMAIN AUTHORITY INFLUENCES ON GOOGLE RANKING

LOGAN: Unpaired Shape Transform in Latent O vercomplete Space

Cross Domain Review Cardiology

Content and QoS Policies in Multi-domain Heterogeneous Mobile Systems