Statistical Learning Methods for Information Retrieval

NUS April 12, 2006 Statistical Learning Methods for Information Retrieval Hang Li Microsoft Research Asia

Talk Outline • Expert Search: Two-Stage Model • Relevance Ranking: Ranking SVM for IR

Two Stage Model for Expert Search Yunbo Cao, Jingjing Liu, Shenghua Bao, Hang Li, Nick Craswell

Expert Search Who knows about X query people

Expert Search -- Example • Who knows about digital ink? Person Query

Related Work • Profile-based approach [Craswell] Co-occurrences between keywords and personal names

Two Stage Model for Expert Search • Rank people using two probability models • Relevance model • Co-occurrence model Co-occurrence Prior Relevance

d1 e1 q d2 e2 d3 Two Stage Model for Expert Search

Two-stage model • Model • Document Relevance Model --- Language Model • Co-occurrence Model --- Mixture of Submodels

Document Relevance Model • Who knows about timed text?

Window-based Sub-model irrelevant person relevant person query

Title-Author Sub-model query author

Block-based Sub-model • The co-occurrences appear in the tree structure of <section>s (tags: <H1> <H2> <H3> <H4> <H5> <H6> ) Query: W3C Management Team <H1> <H2> persons <H2>

Neighbor-based Sub-model irrelevant person query relevant person

Cluster-based Sub-Model • People that often co-occur share same expertise areas • Cluster people and then use cluster-based model

Expert Search -Implementation Co-occurrence Model Query

TREC Expert Search • Document collection • A crawl of W3C site (http://w3c.org) in June 2004 • 331,307 web pages • Ground truth • W3C working groups with the names of groups as query topics and the members of groups as experts (10 training topics and 50 test topics)

Experimental Results

Ranking SVM for IR Yunbo Cao, Jun Xu, Tie-Yan Liu, Hang Li, Yalou Huang, Hsiao-Wuen Hon

General Model for Ranking documents (information) relevance scores for ranking query (or question)

Learning to Rank (Herbrich et al., 2000; Chris Burges 2005) documents relevance scores for ranking query (or question) Multiple ranks Ranking SVM, RankNet

Learning Model

Evaluation Measures • MRR (Mean Reciprocal Rank) • WTA (Winners Take All) • MAP (Mean Average Precision) • NDCG (Normalized Discounted Cumulative Gain)

NDCG • Query: • DCG at position m: • NDDG at position m: average over queries • Example • (3, 3, 2, 2, 1, 1, 1) • (7, 7, 3, 3, 1, 1, 1) • (1, 0.63, 0.5, 0.43, 0.39, 0.36, 0.33) • (7, 11.41, 12.91, 14.2, 14.59, 14.95, 15.28) rank r gain discount

Ranking SVM • Given • We learn a function • Consider a linear function • Transforming to classification

Ranking SVM (cont’) • Ranking Model: f(x)

Direct Application of Ranking SVM to Document Retrieval • Query document pair  feature vector • Combining instance pairs from all queries

Problems with Direct Application • Cost sensitiveness: negative effects of making errors on top d: definitely relevant, p: partially relevant, n: not relevant ranking 1: p d p n n n n ranking 2: d p n p n n n • Query normalization: number of instance pairs varies according to query q1: d p p n n n n q2: d d p p p n n n n n q1 pairs: 2*(d, p) + 4*(d, n) + 8*(p, n) = 14 q2 pairs: 6*(d, p) + 10*(d, n) + 15*(p, n) = 31

Rank Pair Discrepancy

Query Normalization

New Loss function

Optimization (Gradient Descent)

Optimization (Quadratic Programming)

Experimental Results (OHSUMED)

Experimental Results (MSN)

Thank You!

Statistical Learning Methods for Information Retrieval

Statistical Learning Methods for Information Retrieval

Presentation Transcript

Information Retrieval: Models and Methods

Statistical Methods For Engineers

Galago for Information Retrieval

Machine Learning for multimedia information retrieval

Statistical Learning Methods

Statistical Models for Information Retrieval and Text Mining

Learning Techniques for Information Retrieval

Chapter 11 Supervised Learning: STATISTICAL METHODS

Performance of Statistical Learning Methods

STATISTICAL LEARNING METHODS FOR MICROSTRUCTURES

Statistical Learning Methods

Genetic Learning for Information Retrieval

Matrix Decomposition Methods in Information Retrieval

Machine Learning and Information Retrieval

Statistical Language Modeling for Speech Recognition and Information Retrieval

Learning to Rank for Information Retrieval

Introduction to Machine Learning for Information Retrieval

Statistical Learning Methods in HEAP

Chapter 11 Supervised Learning: STATISTICAL METHODS

Statistical Methods

Structured Prediction and Active Learning for Information Retrieval

Statistical Methods For Engineers