250 likes | 461 Views
Discriminative Models for Information Retrieval. Ramesh Nallapati UMass SIGIR 2004. Abstract. Discriminative model vs. Generative model Discriminative – attractive theoretical properties Performance Comparison Discriminative – Maximum Entropy, Support Vector Machine
E N D
Discriminative Models for Information Retrieval Ramesh Nallapati UMass SIGIR 2004
Abstract • Discriminative model vs. Generative model • Discriminative – attractive theoretical properties • Performance Comparison • Discriminative – Maximum Entropy, Support Vector Machine • Generative – Language Modeling • Experiment • Ad-hoc Retrieval • ME is worse than LM, SVM are on par with LM • Home-Page Finding • Prefer SVM over LM
Introduction • Traditional IR • A problem of measuring the similarity between docs and query, such as Vector Space Model • Shortcoming • Term-weights – empirically tuned • No theoretical basis for computing optimum weights • Binary independence retrieval (BIR) • Robertson and Sparck Jones (1976) • First model that viewed IR as a classification problem • This allows us to leverage many sophisticated techniques developed in ML domanin • Discriminative models • Good success in many applications of ML
Discriminative and Generative Classifiers • Pattern Classification • The problem of classifying an example based on its vector of features x into its class C through a posterior probability P(C|x) or simply a confidence score g(C|x) • Discriminative models • Model the posterior directly or learn a direct map from inputs x to the class labels C • Generative models • Model the class-conditional probability P(x|C) and the prior probability P(C) and estimate the posterior through the Bayes’ rule
Probabilistic IR models as Classifiers (1/3) • Binary Independence Retrieval (BIR) model • Ranking is done by the log-likelihood ration of relevance • Ranking is done by the log-likelihood ration of relevance • The model has not met with good empirical success owing to the difficulty in estimating the class conditional P(xi=1|R) • Assume uniform probability distribution over the entire vocabulary and update the probabilities as relevant docs are provided by the user
Probabilistic IR models as Classifiers (2/3) • Two-Poisson model • Follow the same framework as that of BIR model, but they use a mixture of two Poisson distributions to model the class conditions and • This also is a generative model • Similar to the BIR model, it also needs relevance feedback for accurate parameter estimation
Probabilistic IR models as Classifiers (3/3) • Language Models • Ponte and Croft (1998) • The ranking of a doc is given by the probability of generation of the query from doc’s language model • This model circumvent the problem of estimating the model of relevant documents that the BIR model and Two-Poisson suffer from • LM can be considered generative classifiers in a multi-class classification sense
The Case for Discriminative Models for IR (1/3) • Discriminative vs. Generative • One should solve the problem (classification) directly and never solve a more general problem (class-conditional) as an intermediate step • Model Assumptions • GM • Terms are conditionally independent • LM assume docs obey a multinomial distribution of terms • DM • Typically make few assumptions and in a sense, let the data speak for itself.
The Case for Discriminative Models for IR (2/3) • Expressiveness • GM - LM are not expressive enough to incorporate many features into the model • DM - It can include all features effortlessly into a single model • Learning arbitrary features • In view of the many query dependent and query-independent doc features and user-preferences that influence features, we believe that a DM that learns all the features is best suited for the generalized IR problem
The Case for Discriminative Models for IR (3/3) • Notion of Relevance • In LM, there is no explicit notion of relevance. There has been considerable controversy on the missing relevance variable in LM. • We believe that Robertson’s view of IR as a binary classification problem of relevance is more realistic than the implicit notion of relevance as it exists in LM.
Discriminative Models Used in Current Work (1/2) • Maximum Entropy Model • The principle of ME – model all that is known and assume nothing about that which is unknown • The parametric form of the ME probability function can be expressed by • The feature weights (λ) are learned from training data using a fast gradient descent algorithm • As in Robertson’s BIR model, we use the log-likelihood ratio as the scoring function for ranking as shown follows
Discriminative Models Used in Current Work (2/2) • Support Vector Machines (SVM) • Basic idea – find the hyper-plane that separate the two classes of training examples with the largest margin • If f(D,Q) is the vector of features, then the discriminative function is given by • The SVM is trained such that g(R|D,Q)>=1 for positive (relevant) examples and g(R|D,Q) <= -1 for negative (non-relevant) examples as long as the data is separable • Both DMs • Retaining the basic framework of the BIR model, while avoid estimating the class-conditional and instead directly compute the posterior P(R|Q,D) or the mapping function g(R|R,Q)
Other Modeling Issues • Out of Vocabulary words (OOV) problem • Test queries are almost always guaranteed to contain words that are not seen in the training queries • The features are not based on words themselves, but on query-based statistics of documents such as the total frequency of occurrences or the sum-total of the idf-values of the query terms • Unbalanced data • The classes (non-relevant) is a large portion of all the examples, while the other (relevant) class have only a small percent of of the examples • Over-sampling the minority class • Under-sampling the majority class
Experiments and Results (1/8) • Ad-hoc retrieval • Data set • Preprocessing • K-stemmer and removing stop-words • Only use title queries for retrieval • LM • Training the LM consists of learning the optimal value of the smoothing parameter • All LM runs were performed using Lemur
Experiments and Results (2/8) • DM • Features • SVM • svm-light for SVM runs • Linear kernel gives the best performance on most data set (converge rapidly) • ME • The toolkit of Zhang
Experiments and Results (3/8) • The comparison of performance of LM, SVM and ME • 50% (8/16) is indistinguishable; 12.5% (2/16) is that SVM is statistically better than LM; 37.5 (6/16) is that LM is superior to SVM
Experiments and Results (4/8) • Discussion • Official TREC runs – query expansion • DM can improve the performance by including other features such as proximity of query terms, occurrence of query terms as noun-phrases, etc. Such features would not be easy to incorporate into the LM framework • In the emergence of modern IR collections such as web and scientific literature that are characterized by a diverse variety of features, we will increasingly rely on models that can automatically learn these features from examples
Experiments and Results (5/8) • Home-page finding on web collection • Choose the home-page finding task of TREC-10 where many features such as title, anchor-text and link structure influence relevance • Example – returned the web-page http://trec.nist.gov where the query “Text Retrieval Conference” is issued • Corpus – WT10G • Queries • 50 for training, 50 for development and 145 for testing • Evaluation Metrics • The mean reciprocal rank (MRR) • Success rate – an answer is found in top 10 • Failure rate – no answer is returned in top 100
Experiments and Results (6/8) • Three Indexes • A content index consisting of the textual content of the documents with all the HTML tags removed • An index of the anchor text documents • An index of the titles of all documents • 20 Features • Use 6 previous features from each of the three indexes • Two additional features • URL-depth –a home-page typically is at depth 1 • Link-Factor
Experiments and Results (7/8) • Performance on development set • Performance on test set
Experiments and Results (8/8) • Discussion • SVMs leverage a variety of features and improve on the baseline LM performance by 48.6% in MRR • The best run in TREC-10 achieved an MRR of 0.77 on the test set; however, their feature weights were optimized using empirical means while our models learn them automatically. • Only demonstrate the learning ability of SVMs • We believe there is a lot more that needs to be in defining the right kind of features, such as PageRank for the link-factor feature.
Related Works • A few attempts in applying discriminative models for IR • Cooper and Huizinga make a strong case for applying the maximum entropy approach to the problems of information retrieval • Kantor and Lee extend the analysis of the principle of maximum entropy in the context of information retrieval • Greiff and Ponte showed that the classic binary independence model and the maximum entropy approach are equivalent • Gey suggested the method of logistic regression, which is equivalent to the method of maximum entropy used in our work
Conclusion and Future Work (1/2) • Treat IR as a problem of binary classification • Quantify relevance explicitly • Permit us apply sophisticated pattern classification technique • Explore SVMs and MEs to IR • Their main utility to IR lies in their ability to learn automatically a variety of features that influence relevance • Ad-hoc retrieval • SVMs perform as well as LMs • Home-page finding • SVMs outperform the baseline runs by about 50% in MRR
Conclusion and Future Work (2/2) • Future Work • Further improvement through better feature engineering and by leveraging a huge body of literature on SMVs and other learning algorithms • Evaluate the performance of SVMs on ad-hoc retrieval task with longer queries • Enhance features such as proximity of query terms, synonyms, etc. • Study user modeling by incorporating user-preferences as features in the SVMs