300 likes | 562 Views
Chapter 23: Probabilistic Language Models. April 13, 2004. Corpus-Based Learning. Information Retrieval Information Extraction Machine Translation. 23.1 Probabilistic Language Models. There are several advantages Can be trained from data Robust (accept any sentence)
E N D
Chapter 23: Probabilistic Language Models April 13, 2004
Corpus-Based Learning • Information Retrieval • Information Extraction • Machine Translation
23.1 Probabilistic Language Models • There are several advantages • Can be trained from data • Robust (accept any sentence) • Reflect fact that not all speakers agree on which sentences are part of a language • Can be used for disambiguation
Unigram Model P(wi) • Bigram Model P(wi | wi-1) • Trigram Model P(wi | wi-2, wi-1)
Smoothing • Problem: many pairs (triples, etc.) of words never occur in the training text. • N: words in corpus • B: possible bigrams • c: actual count of bigram • Add-One Smoothing (c + 1) / (N + B)
Smoothing • Linear Interpolation Smoothing P(wi | wi-2, wi-1) = c3 P(wi | wi-2, wi-1) + c2 P(wi | wi-1) + c1 P(wi) c1 + c2 + c3 = 1
Segmentation • The task is to find the word boundaries in a text with no spaces • P(“with”) = .2 • P(“out”) = .1 • P(“with out”) = .02 (unigram model) • P(“without”) = .05 • Figure 23.1, Viterbi-based segmentation algorithm
Probabilistic CFG (PCFG) • N-Gram models have no notion of grammar at distances greater than n • Figure 23.2, PCFG example • Figure 23.3, PCFG parse • Problem: context-free • Problem: preference for short sentences
Learning PCFG Probabilities • Parsed Data: straight forward • Unparsed Data: two challenges • Learning the structure of the grammar rules. A Chomsky Normal Form bias can be used (X Y Z, X t). Something similar to SEQUITUR can be used. • Learning the probabilities associated with each rule (inside-outside algorithm, based on dynamic programming)
23.2 Information Retrieval • Components of IR System: • Document Collection • Query Posed in Query Language • Result Set • Presentation of Result Set
Boolean Keyword Model • Boolean queries • Each word in a document is treated as a boolean feature • Drawbacks • Each word is a single bit of relevance • Boolean logic can be difficult to use correctly for the average user
General Framework • r: Boolean random variable indicating relevance that has the value true • D: Document • Q: Query • P( r | D, Q) • Order results by decreasing probability
Language Modeling • P(r | D, Q) • = P(D, Q | r) * P(r) / P(D, Q) Baye’s • = P(Q | D, r) * P(D | r) * P(r) / P(D, Q) chain rule • = P(Q | D, r) * * P(r | D) * P(r) / P(D, Q) Baye’s rule, fixed D • maximize P(r | D, Q) / P( r | D, Q)
Language Modeling • = P(Q | D, r) * P(r | D) / P(Q | D, r) * P( r | D) • Eliminate P(Q | D, r). If a document is irrelevant to a query, then knowing the document won’t help determine the query. • = P(Q | D, r) * P(r | D) / P(r | D)
Language Modeling • P(r | D) / P(r | D) is a query independent measure of document quality. This can be estimated by references to the document, the recency of the document, etc. • P(Q | D, r) = j P(Qj | D, r) where each Qj is a words in the query. • Figure 23.4.
Evaluating IR Systems • Precision. Proportion of documents in result set that are actually relevant. • Recall. Proportion of relevant documents in the collection that are in the result set. • Average Reciprocal Rank. • Time to Answer. Length of time for user to find desired answer
IR Refinements • Stemming. Can help recall, can hurt precision. • Case Folding. • Synonyms. • Use a bigram model. • Spelling Corrections. • Metadata.
Result Sets • Relevance feedback from user. • Document classification. • Document clustering. • K-Means clustering • 1. Pick k documents at random as category seeds • 2. Assign every document to the closest category • 3. Computer the mean of each cluster and uses these means as the new seeds. • 4. Go to step 2 until convergence occurs.
Implementing IR Systems • Lexicon. Given a word, return the location in the inverted index. Stop words are often omitted. • Inverted Index. Might be a list of (document, count) pairs.
Vector Space Model • Used more often in practice than the probabilistic model • Documents are represented as vectors of unigram word frequencies. • A query is represented as a vector consisting of 0s and 1s, e.g. [0 1 1 0 0].