Discriminative Models for Information Retrieval

Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004

Abstract • Discriminative model vs. Generative model • Discriminative – attractive theoretical properties • Performance Comparison • Discriminative – Maximum Entropy, Support Vector Machine • Generative – Language Modeling • Experiment • Ad-hoc Retrieval • ME is worse than LM, SVM are on par with LM • Home-Page Finding • Prefer SVM over LM

Introduction • Traditional IR • A problem of measuring the similarity between docs and query, such as Vector Space Model • Shortcoming • Term-weights – empirically tuned • No theoretical basis for computing optimum weights • Binary independence retrieval (BIR) • Robertson and Sparck Jones (1976) • First model that viewed IR as a classification problem • This allows us to leverage many sophisticated techniques developed in ML domanin • Discriminative models • Good success in many applications of ML

Discriminative and Generative Classifiers • Pattern Classification • The problem of classifying an example based on its vector of features x into its class C through a posterior probability P(C|x) or simply a confidence score g(C|x) • Discriminative models • Model the posterior directly or learn a direct map from inputs x to the class labels C • Generative models • Model the class-conditional probability P(x|C) and the prior probability P(C) and estimate the posterior through the Bayes’ rule

Probabilistic IR models as Classifiers (1/3) • Binary Independence Retrieval (BIR) model • Ranking is done by the log-likelihood ration of relevance • Ranking is done by the log-likelihood ration of relevance • The model has not met with good empirical success owing to the difficulty in estimating the class conditional P(xi=1|R) • Assume uniform probability distribution over the entire vocabulary and update the probabilities as relevant docs are provided by the user

Probabilistic IR models as Classifiers (2/3) • Two-Poisson model • Follow the same framework as that of BIR model, but they use a mixture of two Poisson distributions to model the class conditions and • This also is a generative model • Similar to the BIR model, it also needs relevance feedback for accurate parameter estimation

Probabilistic IR models as Classifiers (3/3) • Language Models • Ponte and Croft (1998) • The ranking of a doc is given by the probability of generation of the query from doc’s language model • This model circumvent the problem of estimating the model of relevant documents that the BIR model and Two-Poisson suffer from • LM can be considered generative classifiers in a multi-class classification sense

The Case for Discriminative Models for IR (1/3) • Discriminative vs. Generative • One should solve the problem (classification) directly and never solve a more general problem (class-conditional) as an intermediate step • Model Assumptions • GM • Terms are conditionally independent • LM assume docs obey a multinomial distribution of terms • DM • Typically make few assumptions and in a sense, let the data speak for itself.

The Case for Discriminative Models for IR (2/3) • Expressiveness • GM - LM are not expressive enough to incorporate many features into the model • DM - It can include all features effortlessly into a single model • Learning arbitrary features • In view of the many query dependent and query-independent doc features and user-preferences that influence features, we believe that a DM that learns all the features is best suited for the generalized IR problem

The Case for Discriminative Models for IR (3/3) • Notion of Relevance • In LM, there is no explicit notion of relevance. There has been considerable controversy on the missing relevance variable in LM. • We believe that Robertson’s view of IR as a binary classification problem of relevance is more realistic than the implicit notion of relevance as it exists in LM.

Discriminative Models Used in Current Work (1/2) • Maximum Entropy Model • The principle of ME – model all that is known and assume nothing about that which is unknown • The parametric form of the ME probability function can be expressed by • The feature weights (λ) are learned from training data using a fast gradient descent algorithm • As in Robertson’s BIR model, we use the log-likelihood ratio as the scoring function for ranking as shown follows

Discriminative Models Used in Current Work (2/2) • Support Vector Machines (SVM) • Basic idea – find the hyper-plane that separate the two classes of training examples with the largest margin • If f(D,Q) is the vector of features, then the discriminative function is given by • The SVM is trained such that g(R|D,Q)>=1 for positive (relevant) examples and g(R|D,Q) <= -1 for negative (non-relevant) examples as long as the data is separable • Both DMs • Retaining the basic framework of the BIR model, while avoid estimating the class-conditional and instead directly compute the posterior P(R|Q,D) or the mapping function g(R|R,Q)

Other Modeling Issues • Out of Vocabulary words (OOV) problem • Test queries are almost always guaranteed to contain words that are not seen in the training queries • The features are not based on words themselves, but on query-based statistics of documents such as the total frequency of occurrences or the sum-total of the idf-values of the query terms • Unbalanced data • The classes (non-relevant) is a large portion of all the examples, while the other (relevant) class have only a small percent of of the examples • Over-sampling the minority class • Under-sampling the majority class

Experiments and Results (1/8) • Ad-hoc retrieval • Data set • Preprocessing • K-stemmer and removing stop-words • Only use title queries for retrieval • LM • Training the LM consists of learning the optimal value of the smoothing parameter • All LM runs were performed using Lemur

Experiments and Results (2/8) • DM • Features • SVM • svm-light for SVM runs • Linear kernel gives the best performance on most data set (converge rapidly) • ME • The toolkit of Zhang

Experiments and Results (3/8) • The comparison of performance of LM, SVM and ME • 50% (8/16) is indistinguishable; 12.5% (2/16) is that SVM is statistically better than LM; 37.5 (6/16) is that LM is superior to SVM

Experiments and Results (4/8) • Discussion • Official TREC runs – query expansion • DM can improve the performance by including other features such as proximity of query terms, occurrence of query terms as noun-phrases, etc. Such features would not be easy to incorporate into the LM framework • In the emergence of modern IR collections such as web and scientific literature that are characterized by a diverse variety of features, we will increasingly rely on models that can automatically learn these features from examples

Experiments and Results (5/8) • Home-page finding on web collection • Choose the home-page finding task of TREC-10 where many features such as title, anchor-text and link structure influence relevance • Example – returned the web-page http://trec.nist.gov where the query “Text Retrieval Conference” is issued • Corpus – WT10G • Queries • 50 for training, 50 for development and 145 for testing • Evaluation Metrics • The mean reciprocal rank (MRR) • Success rate – an answer is found in top 10 • Failure rate – no answer is returned in top 100

Experiments and Results (6/8) • Three Indexes • A content index consisting of the textual content of the documents with all the HTML tags removed • An index of the anchor text documents • An index of the titles of all documents • 20 Features • Use 6 previous features from each of the three indexes • Two additional features • URL-depth –a home-page typically is at depth 1 • Link-Factor

Experiments and Results (7/8) • Performance on development set • Performance on test set

Experiments and Results (8/8) • Discussion • SVMs leverage a variety of features and improve on the baseline LM performance by 48.6% in MRR • The best run in TREC-10 achieved an MRR of 0.77 on the test set; however, their feature weights were optimized using empirical means while our models learn them automatically. • Only demonstrate the learning ability of SVMs • We believe there is a lot more that needs to be in defining the right kind of features, such as PageRank for the link-factor feature.

Related Works • A few attempts in applying discriminative models for IR • Cooper and Huizinga make a strong case for applying the maximum entropy approach to the problems of information retrieval • Kantor and Lee extend the analysis of the principle of maximum entropy in the context of information retrieval • Greiff and Ponte showed that the classic binary independence model and the maximum entropy approach are equivalent • Gey suggested the method of logistic regression, which is equivalent to the method of maximum entropy used in our work

Conclusion and Future Work (1/2) • Treat IR as a problem of binary classification • Quantify relevance explicitly • Permit us apply sophisticated pattern classification technique • Explore SVMs and MEs to IR • Their main utility to IR lies in their ability to learn automatically a variety of features that influence relevance • Ad-hoc retrieval • SVMs perform as well as LMs • Home-page finding • SVMs outperform the baseline runs by about 50% in MRR

Conclusion and Future Work (2/2) • Future Work • Further improvement through better feature engineering and by leveraging a huge body of literature on SMVs and other learning algorithms • Evaluate the performance of SVMs on ad-hoc retrieval task with longer queries • Enhance features such as proximity of query terms, synonyms, etc. • Study user modeling by incorporating user-preferences as features in the SVMs

Discriminative Models for Information Retrieval

Discriminative Models for Information Retrieval

Presentation Transcript

Language Models for Information Retrieval

Relevance Models In Information Retrieval

Generative Models vs. Discriminative models

Information Retrieval – Language models for IR

Discriminative Learning for Hidden Markov Models

Advanced Information- Retrieval Models

Information Retrieval Models

Two-stage Language Models for Information Retrieval

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Information Retrieval Models

Structurally Discriminative Graphical Models for ASR

Information Retrieval: Models and Methods

Discriminative Models for Spoken Language Understanding

Hidden-Variable Models for Discriminative Reranking

Discriminative Probabilistic Models for Relational Data

Classification 2: discriminative models