250 likes | 272 Views
Explore language models representing documents as multinomial distributions, Maximum Likelihood Estimation, importance & types of smoothing techniques, challenges, & more. Dive into methods like Laplace/Additive Smoothing, Interpolation, Dirichlet Smoothing, Back-off, & Absolute Discounting in language modeling. Discover topics like Cluster Based Smoothing, Relevance Modeling, Blind Relevance Feedback, and the use of Relative Entropy. Evaluate language models in ad-hoc document retrieval & unsupervised morphological analysis.
E N D
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan
So what did we talk about last week? • Language models represent documents as multinomial distributions • What is a multinomial? • The Maximum Likelihood Estimate calculates document model • What is the Maximum Likelihood Estimate? • Smoothing document models
Why is smoothing so important? • Maximum Likelihood Estimate gives 0 probabilities • Why is that an issue? • What does smoothing do? • What types of smoothing are there
Challenge questions • What is common in every smoothing technique that we have covered? • What does smoothing really do? • Do it make for a more accurate document model? • Replace the need for more data?
A Study of Smoothing Methods of Language Models Applied to Ad Hoc Information Retrieval • Thoughts? • What is Additive? • What is Interpolation? • What is Backoff?
Laplace / Additive Smoothing • Just increasing raw term frequencies • Is that representative of the document model? • How hard is this to implement? • What happens if the constant added is really large?
Interpolation • Jelinek Mercer ps(t) = p(t|d) + (1-)p(t|corpus) • Dirichlet • Anyone know what this is? • Remember Gaussian? Poisson? Beta? Gamma? • Distributions for Binomials • Distribution for Multinomials
Dirichlet / Absolute Discounting • What do Absolute Discounting do? • How is it different from Laplace? Jelinek Mercer? • What is the key difference between the d in Jelinek Mercer and d in Dirichlet and Absolute Discounting • d is used to determine how much probability mass is subtracted from seen terms and added to unseen ones
Back off • What is the idea here? • Do not pad the probability of seen terms • Any idea why this isn’t work? • The seen terms have their probabilities decreased • Too much smoothing?
Pause… Review • Why do we smooth? • Does smoothing make sense? • What is Laplace? • What is Jelinek Mercer? • What is the Dirichlet smoothing? • What is Absolute Discounting? • What is Back off
Let’s beat this horse some more! • Everyone know what mean average precision is? • Let’s have a look at the results • Are these really improvements • What is an increase of .05 precision really mean? • Will that matter to the user?
And now we come full circle • What is a real performance improvement? • Cranfield paradigm evaluation • Corpus • Queries • Qrels • User trials • Satisfaction • Effectiveness • Efficiency
Cluster Based Smoothing • What will clustering give us? • Cluster the corpus • Find clusters for each document • Mixture model now involves • Document model • Cluster model • Corpus model • Some performance gains • Significant but not so special
Relevance Modeling • Blind Relevance Feedback approach • Top documents in the result set used as feedback • A language model is constructed from these top ranked documents for each query • This model is used as the relevance model for probabilistic retrieval
One the topic of Blind Relevance Feedback • How can we use Relative Entropy here? • Derive a model that minimizes the relative entropy between the documents in the top rank • Does Relevance Modeling make sense? • Does using Relative Entropy make sense?
The big assumption • Top ranked documents are a good source of relevant text • This obviously is not always true • There is a lot of noise • Are top rank representative of the relevant set? • Relevance modeling and Relative Entropy BRF approaches have been shown to improve performance • But not really…
Review • What is average precision? • What is the Cranfield paradigm? • What alternative sources can be used for smoothing? • Do Blind Relevance Feedback make sense? • Why does it work?
You have been a good class • We have covered • Language Modeling for ad-hoc document retrieval • Unigram model • Maximum Likelihood Estimate • Smoothing Techniques • Different mixture models • Blind Relevance Feedback for Language Modeling
Questions for you • Why do we work with the unigram model? • Why is smoothing important? • How does a language model represent a document? • What is interpolation?
Another application of language modeling • Unsupervised Morphological Analysis • A morpheme is a basic unit of meaning in a language pretested : pre - test - ed • English is a relatively easy language • Turkish, Finnish, German are agglomerative • Very hard
Morfessor • All terms in the vocabulary are candidate morphemes • Terms are recursively split • Build up the candidate morpheme set • Repeatedly analyze the whole vocabulary until the candidate morpheme set can no longer be improved
Swordfish • Ngram based unsupervised morpheme analyzer • Character Ngrams • Substrings • A language model is constructed over all ngrams of all lengths • Maximum Likelihood Estimate • Terms recursive split based on the likelihood of the ngrams
Swordfish Results • Reasonable Results • Character ngrams are useful in finding morphemes • All morphemes are ngrams but not all ngrams are morphemes • The most prominent ngrams appear to be morphemes • How one defines prominent is an open question • Check out the PASCAL Morpho-Challenge