1 / 10

Efficient Probabilistic Language Model for Document Retrieval

Utilizing a Naïve Bayes approach, this probabilistic language model-based document retrieval system classifies queries and ranks documents based on posterior probabilities. Proper linear interpolation smoothing enhances performance, with experimental results showing improved performance compared to Laplace smoothing. Larger scale TREC experiments indicate the LM approach works slightly better than vector-space approaches, emphasizing efficiency through inverted indexing.

walterfry
Download Presentation

Efficient Probabilistic Language Model for Document Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ProbabilisticLanguage-Model BasedDocument Retrieval

  2. Naïve Bayes for Retrieval Naïve Bayes can also be used for ad-hoc document retrieval. Treat each of the n documents as a category with only one training example, the document itself. Classify queries using this n-way categorization. Rank documents based on the posterior probability of their category. For historical reasons, this is called the “language model” (LM) approach.

  3. Generative Model for Retrieval ? ? ? ? ? ? this test urn has many words this test urn has many words this test urn has many words this test urn has many words this test urn has many words this test urn has many words this test urn has many words this test urn has many words this test urn has many words Query Ranked Retrievals: Doc 1 0.3 Doc 6 0.28 Doc 5 0.18 Doc 2 0.1 Doc 3 0.08 Doc 4 0.06 Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 0.3 0.1 0.08 0.06 0.18 0.28

  4. Smoothing Proper smoothing is important for this approach to work well. Laplace smoothing does not work well for this application. Better to use linear interpolation for smoothing.

  5. Linear Interpolation Smoothing Estimate conditional probabilities P(Xi | Y) as a mixture of conditioned and unconditioned estimates: P(Xi | Y) is the probability of drawing word Xi from the urn of words in category (i.e. document) Y. P(Xi) is the probability of drawing word Xi from the urn of words in the entire corpus (i.e. all document urns combined into one big urn). ^ ^

  6. Amount of Smoothing Value of λ controls the amount of smoothing. The lower λ is, the more smoothing there is since the unconditioned term is weighted higher (1 ─ ). Setting λ properly is important for good performance. Set λ manually or automatically based on maximizing performance on a development set of queries. Lower  tends to work better for long queries, high λ for short queries.

  7. Experimental Results on CF CorpusEffect of  Parameter Given long queries, large amount of smoothing (λ=0.1) seems to work best

  8. Experimental Results on CF CorpusEffect of Laplace Smoothing Parameter

  9. Experimental Results on CF CorpusComparison of Smothing Methods and VSR Laplace smoothing does much worse. Linear interp does about the same as vector-space

  10. Performance of Language Model Approach Larger scale TREC experiments demonstrate that the LM approach with proper smoothing works slightly better than a well-tuned vector-space approach. Need to make LM approach efficient by exploiting inverted index. Don’t bother to the compute probability of documents that do not contain any of the query words.

More Related