1 / 11

LM Approaches to Filtering

LM Approaches to Filtering. Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS. Topics. LM approach What is it? Why is it preferred? Controlling Filtering decision. What is LM Approach?. We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches.

evelia
Download Presentation

LM Approaches to Filtering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS

  2. Topics • LM approach • What is it? • Why is it preferred? • Controlling Filtering decision

  3. What is LM Approach? • We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches. • The tf-idf metric computes various statistics of words and documents. • By ‘probabilistic’ approaches, we (I) mean methods where we compute the probability of a document being relevant to a user’s need, given the query, the document, and the rest of the world, using a formula that arguably computes P(Doc is Relevant | Query, Document, Collection, etc.) • If we use Bayes’ rule, we end up with the prior for each document, p(Doc is Relevant | Everything except Query) and the likelihood of the query p(Q | Doc is Relevant) • The LM approach is a solution to the second part of this. • The prior probability component is also important.

  4. What it is not • If we compute a LM for the query and a document and ask the probability that the two underlying LMs are the same, I would NOT call this a posterior probability model. • The LMs would not be expected to be the same even with long queries.

  5. Issues in LM Approaches for Filtering • We (ideally) have three sets of documents: • Positive documents • Negative documents • Large corpus of unknown (mostly negative) documents • We can estimate a model for both positive and negative documents • We can find more positive documents in large corpus • We use large corpus to smooth models from positive and negative documents • We compute the probability of each of each new document given each of the models • The log of the ratio of these two likelihoods is a score that indicates whether the document is positive or negative.

  6. Language Modeling Choices • We can model the probability of the document given the topic in many ways. • A simple unigram mixture works surprisingly well. • Weighted mixture of distributions from the topic training and the full corpus • We improve over the ‘naïve Bayes’ model significantly by using the Estimate Maximize technique • We can extend the model in many ways: • Ngram model of words • Phrases: proper names, collocations • Because we use a formal generative model, we know how to incorporate any effect we want. • E.g., probability of features of top-5 documents given some document is relevant

  7. How to Set the Threshold • For filtering, we are required to make a hard decision of whether to accept the document, rather than just rank the documents. • Problems: • The score for a particular document depends on many factors that are not important for the decision • Length of document • Percentage of low-likelihood words • The range of scores depends on the particular topic. • Would like to map the score for any document and topic into a real posterior probability

  8. Score Normalization Techniques • By using the relative score for two models, we remove some of the variance due to the particular document. • We can normalize for the peculiarities of the topic by computing the distribution of scores for Off-Topic documents. • Advantages of using Off-Topic documents: • We have a very large number of documents • We can fix the probability of false alarms

  9. The Bottom Line • For TDT tracking, the probabilistic approach to modeling the document and to score normalization results in better performance, whether for mono-language, cross-language, speech recognition output, etc. • Large improvement will come after multiple sites start using similar techniques.

  10. Grand Challenges • Tested in TDT • Operating with small amounts of training data for each category • 1 to 4 documents per event • Robustness to changes over time • adaptation • Multi-lingual domains • How to set threshold for filtering • Using model of ‘eventness’ • Large hierarchical category sets • How to use the structure • Effective use of prior knowledge • Predicting performance and characterizing classes • Need a task where both the discriminative and the LM approach will be tested.

  11. What do you really want? • If a user provides a document about the 9/11 World Trade Center crash and says they want “more like this”, do they want documents about: • Airplane crashes • Terrorism • Building fires • Injuries and Death • Some combination of the above • In general, we need a way to clarify which combination of topics the user wants • In TDT, we predefine the task to mean we want more about this specific event (and not about some other terrorist airplane crash into a building).

More Related