380 likes | 567 Views
NUS April 12, 2006. Statistical Learning Methods for Information Retrieval. Hang Li Microsoft Research Asia. Talk Outline. Expert Search: Two-Stage Model Relevance Ranking: Ranking SVM for IR. Two Stage Model for Expert Search.
E N D
NUS April 12, 2006 Statistical Learning Methods for Information Retrieval Hang Li Microsoft Research Asia
Talk Outline • Expert Search: Two-Stage Model • Relevance Ranking: Ranking SVM for IR
Two Stage Model for Expert Search Yunbo Cao, Jingjing Liu, Shenghua Bao, Hang Li, Nick Craswell
Expert Search Who knows about X query people
Expert Search -- Example • Who knows about digital ink? Person Query
Expert Search -- Example • Who knows about digital ink? Person Query
Related Work • Profile-based approach [Craswell] Co-occurrences between keywords and personal names
Two Stage Model for Expert Search • Rank people using two probability models • Relevance model • Co-occurrence model Co-occurrence Prior Relevance
d1 e1 q d2 e2 d3 Two Stage Model for Expert Search
Two-stage model • Model • Document Relevance Model --- Language Model • Co-occurrence Model --- Mixture of Submodels
Document Relevance Model • Who knows about timed text?
Window-based Sub-model irrelevant person relevant person query
Title-Author Sub-model query author
Block-based Sub-model • The co-occurrences appear in the tree structure of <section>s (tags: <H1> <H2> <H3> <H4> <H5> <H6> ) Query: W3C Management Team <H1> <H2> persons <H2>
Neighbor-based Sub-model irrelevant person query relevant person
Cluster-based Sub-Model • People that often co-occur share same expertise areas • Cluster people and then use cluster-based model
Expert Search -Implementation Co-occurrence Model Query
TREC Expert Search • Document collection • A crawl of W3C site (http://w3c.org) in June 2004 • 331,307 web pages • Ground truth • W3C working groups with the names of groups as query topics and the members of groups as experts (10 training topics and 50 test topics)
Ranking SVM for IR Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon
General Model for Ranking documents (information) relevance scores for ranking query (or question)
Learning to Rank (Herbrich et al., 2000; Chris Burges 2005) documents relevance scores for ranking query (or question) Multiple ranks Ranking SVM, RankNet
Evaluation Measures • MRR (Mean Reciprocal Rank) • WTA (Winners Take All) • MAP (Mean Average Precision) • NDCG (Normalized Discounted Cumulative Gain)
NDCG • Query: • DCG at position m: • NDDG at position m: average over queries • Example • (3, 3, 2, 2, 1, 1, 1) • (7, 7, 3, 3, 1, 1, 1) • (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) • (7, 11.41, 12.91, 14.2, 14.59, 14.95, 15.28) rank r gain discount
Ranking SVM • Given • We learn a function • Consider a linear function • Transforming to classification
Ranking SVM (cont’) • Ranking Model: f(x)
Direct Application of Ranking SVM to Document Retrieval • Query document pair feature vector • Combining instance pairs from all queries
Problems with Direct Application • Cost sensitiveness: negative effects of making errors on top d: definitely relevant, p: partially relevant, n: not relevant ranking 1: p d p n n n n ranking 2: d p n p n n n • Query normalization: number of instance pairs varies according to query q1: d p p n n n n q2: d d p p p n n n n n q1 pairs: 2*(d, p) + 4*(d, n) + 8*(p, n) = 14 q2 pairs: 6*(d, p) + 10*(d, n) + 15*(p, n) = 31