360 likes | 540 Views
Heterogeneous Cross Domain Ranking in Latent Space. Bo Wang Joint work with Jie Tang, Wei Fan and Songcan Chen. Framework of Learning to Rank. Example: Academic Network. Ranking over Web 2.0. Traditional Web: standard (long) documents
E N D
Heterogeneous Cross Domain Ranking in Latent Space Bo Wang Joint work with Jie Tang, Wei Fan and Songcan Chen
Ranking over Web 2.0 • Traditional Web: standard (long) documents • relevance measures such as BM25 and PageRank score may play a key role • Web 2.0: shorter non-standard documents • users' click-through data and users' comments might be much more important
Heterogeneous transfer ranking • If there isn't sufficient supervision on the domain of interest, how could one borrow labeled information from a related but heterogeneous domain to build an accurate model? • Differences from transfer learning • What to transfer • Instance type • What we care • Feature extraction
Main Challenges • How to formalize the problem in a unified framework? As both feature distributions and objects' types in the source domain and the target domain may be different. • How to transfer the knowledge of heterogeneous objects across domains? • How to preserve the preference relationships between instances across heterogeneous data sources?
Outline • Motivation • Problem Formulation • Transfer Ranking • Basic Idea • The proposed algorithm • Generalization bound • Experiment • Ranking on Homogeneous data • Ranking on Heterogeneous data • Conclusion
Problem Formulation • Source domain: • Instance space: • Rank level set: where • Target domain: and • The two domains are heterogeneous but related • Problem Definition: given and , the goal is to learn a ranking function for predicting the rank levels of test set
Outline • Motivation • Problem Formulation • Transfer Ranking • Basic Idea • The proposed algorithm • Generalization bound • Experiment • Ranking on Homogeneous data • Ranking on Heterogeneous data • Conclusion
Basic Idea • Because the feature distributions or even objects' types may be different across domains, we resort to finding a common latent space in which the preference relationships in source and target domains are all preserved • We can directly use a ranking loss function to evaluate how well the preferences are preserved in that latent space • Optimize the two ranking loss functions simultaneously in order to find the best latent space
The Proposed Algorithm Given the labeled data in source domain We aim to learn a ranking function which satisfies: The ranking loss function can be defined as: The latent space can be described by The Framework:
Scalability Let d is the total number of different features in two domains, then matrix D is d*d and W is d*2, so it can be applied to very large scale data without too many features • Complexity Ranking SVM training has O((n1 + n2)3) time and O((n1 + n2)2) space complexity, in our algorithm Tr2SVM, T is the maximal iteration number, then Tr2SVM has O((2T +1)(n1 + n2)3) time and O((n1 + n2)2) space complexity for training
Outline • Motivation • Problem Formulation • Transfer Ranking • Basic Idea • The proposed algorithm • Generalization bound • Experiment • Ranking on Homogeneous data • Ranking on Heterogeneous data • Conclusion
Data Set • LETOR 2.0 • three sub datasets: TREC2003, TREC2004, and OHSUMED • query-document pairs collection • TREC data: a topic distillation task which aims to find good entry points principally devoted to a given topic • OHSUMED data: a collection of records from medical journals • LETOR_TR • three sub datasets: TREC2003_TR, TREC2004_TR, and OHSUMED_TR
Experiment Setting • Baselines: • Measures: MAP (mean average precision) and NDCG (normalized discount cumulative gain) • Three transfer ranking tasks: • From S1 to T1 • From S2 to T2 • From S3 to T3
Why effective? • Why transfer ranking is effective on LETOR_TR dataset? Because the features used in ranking already contain relevance information between queries and documents.
Outline • Motivation • Problem Formulation • Transfer Ranking • Basic Idea • The proposed algorithm • Generalization bound • Experiment • Ranking on Homogeneous data • Ranking on Heterogeneous data • Conclusion
Data Set • A subset of ArnetMiner: 14,134 authors, 10,716 papers, and 1,434 conferences. • 8 most frequent queries from log file: • 'information extraction', 'machine learning', 'semantic web', 'natural language processing', 'support vector machine', 'planning', 'intelligent agents' and 'ontology alignment' • Author collection: • For each query, we gathered authors from Libra, Rexa and ArnetMiner • Conference collection: • For each query, we gathered conferences from Libra and ArntetMiner • Evaluation • Onefaculty and two graduates to judge the relevance between query and authors/conferences
Feature Definition • All the features are defined between queries and virtual documents • Conference • Use all the paper titles published on a conference to form a conference "document" • Author • Use all the paper titles authored by an expert as the expert's "document"
Why effective? • Why our approach can be effective on the heterogeneous network? Because of latent dependencies between the objects, some common features can still be extracted from the latent dependencies.
Conclusion (Cont’d) • We formally define transfer ranking problem and propose a general framework • We provide a preferred solution under the regularized framework by simultaneously minimize two ranking loss functions in two domains and derive the generalization bound • The experimental results on LETOR and a heterogeneous academic network verified the effectiveness of the proposed algorithm
Future Work • Develop new algorithms under the framework • Reduce the time complexity for online usage • Negative transfer • Similarity between queries • Actively select similar queries
Thanks! Your Question. Our Passion.