110 likes | 233 Views
LM Approaches to Filtering. Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS. Topics. LM approach What is it? Why is it preferred? Controlling Filtering decision. What is LM Approach?. We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches.
E N D
LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS
Topics • LM approach • What is it? • Why is it preferred? • Controlling Filtering decision
What is LM Approach? • We distinguish all ‘statistical’ approaches from ‘probabilistic’ approaches. • The tf-idf metric computes various statistics of words and documents. • By ‘probabilistic’ approaches, we (I) mean methods where we compute the probability of a document being relevant to a user’s need, given the query, the document, and the rest of the world, using a formula that arguably computes P(Doc is Relevant | Query, Document, Collection, etc.) • If we use Bayes’ rule, we end up with the prior for each document, p(Doc is Relevant | Everything except Query) and the likelihood of the query p(Q | Doc is Relevant) • The LM approach is a solution to the second part of this. • The prior probability component is also important.
What it is not • If we compute a LM for the query and a document and ask the probability that the two underlying LMs are the same, I would NOT call this a posterior probability model. • The LMs would not be expected to be the same even with long queries.
Issues in LM Approaches for Filtering • We (ideally) have three sets of documents: • Positive documents • Negative documents • Large corpus of unknown (mostly negative) documents • We can estimate a model for both positive and negative documents • We can find more positive documents in large corpus • We use large corpus to smooth models from positive and negative documents • We compute the probability of each of each new document given each of the models • The log of the ratio of these two likelihoods is a score that indicates whether the document is positive or negative.
Language Modeling Choices • We can model the probability of the document given the topic in many ways. • A simple unigram mixture works surprisingly well. • Weighted mixture of distributions from the topic training and the full corpus • We improve over the ‘naïve Bayes’ model significantly by using the Estimate Maximize technique • We can extend the model in many ways: • Ngram model of words • Phrases: proper names, collocations • Because we use a formal generative model, we know how to incorporate any effect we want. • E.g., probability of features of top-5 documents given some document is relevant
How to Set the Threshold • For filtering, we are required to make a hard decision of whether to accept the document, rather than just rank the documents. • Problems: • The score for a particular document depends on many factors that are not important for the decision • Length of document • Percentage of low-likelihood words • The range of scores depends on the particular topic. • Would like to map the score for any document and topic into a real posterior probability
Score Normalization Techniques • By using the relative score for two models, we remove some of the variance due to the particular document. • We can normalize for the peculiarities of the topic by computing the distribution of scores for Off-Topic documents. • Advantages of using Off-Topic documents: • We have a very large number of documents • We can fix the probability of false alarms
The Bottom Line • For TDT tracking, the probabilistic approach to modeling the document and to score normalization results in better performance, whether for mono-language, cross-language, speech recognition output, etc. • Large improvement will come after multiple sites start using similar techniques.
Grand Challenges • Tested in TDT • Operating with small amounts of training data for each category • 1 to 4 documents per event • Robustness to changes over time • adaptation • Multi-lingual domains • How to set threshold for filtering • Using model of ‘eventness’ • Large hierarchical category sets • How to use the structure • Effective use of prior knowledge • Predicting performance and characterizing classes • Need a task where both the discriminative and the LM approach will be tested.
What do you really want? • If a user provides a document about the 9/11 World Trade Center crash and says they want “more like this”, do they want documents about: • Airplane crashes • Terrorism • Building fires • Injuries and Death • Some combination of the above • In general, we need a way to clarify which combination of topics the user wants • In TDT, we predefine the task to mean we want more about this specific event (and not about some other terrorist airplane crash into a building).