Chapter 23: Probabilistic Language Models

Chapter 23: Probabilistic Language Models April 13, 2004

Corpus-Based Learning • Information Retrieval • Information Extraction • Machine Translation

23.1 Probabilistic Language Models • There are several advantages • Can be trained from data • Robust (accept any sentence) • Reflect fact that not all speakers agree on which sentences are part of a language • Can be used for disambiguation

Unigram Model  P(wi) • Bigram Model  P(wi | wi-1) • Trigram Model  P(wi | wi-2, wi-1)

Smoothing • Problem: many pairs (triples, etc.) of words never occur in the training text. • N: words in corpus • B: possible bigrams • c: actual count of bigram • Add-One Smoothing (c + 1) / (N + B)

Smoothing • Linear Interpolation Smoothing P(wi | wi-2, wi-1) = c3 P(wi | wi-2, wi-1) + c2 P(wi | wi-1) + c1 P(wi) c1 + c2 + c3 = 1

Segmentation • The task is to find the word boundaries in a text with no spaces • P(“with”) = .2 • P(“out”) = .1 • P(“with out”) = .02 (unigram model) • P(“without”) = .05 • Figure 23.1, Viterbi-based segmentation algorithm

Probabilistic CFG (PCFG) • N-Gram models have no notion of grammar at distances greater than n • Figure 23.2, PCFG example • Figure 23.3, PCFG parse • Problem: context-free • Problem: preference for short sentences

Learning PCFG Probabilities • Parsed Data: straight forward • Unparsed Data: two challenges • Learning the structure of the grammar rules. A Chomsky Normal Form bias can be used (X  Y Z, X  t). Something similar to SEQUITUR can be used. • Learning the probabilities associated with each rule (inside-outside algorithm, based on dynamic programming)

23.2 Information Retrieval • Components of IR System: • Document Collection • Query Posed in Query Language • Result Set • Presentation of Result Set

Boolean Keyword Model • Boolean queries • Each word in a document is treated as a boolean feature • Drawbacks • Each word is a single bit of relevance • Boolean logic can be difficult to use correctly for the average user

General Framework • r: Boolean random variable indicating relevance that has the value true • D: Document • Q: Query • P( r | D, Q) • Order results by decreasing probability

Language Modeling • P(r | D) / P(r | D) is a query independent measure of document quality. This can be estimated by references to the document, the recency of the document, etc. • P(Q | D, r) = j P(Qj | D, r) where each Qj is a words in the query. • Figure 23.4.

Evaluating IR Systems • Precision. Proportion of documents in result set that are actually relevant. • Recall. Proportion of relevant documents in the collection that are in the result set. • Average Reciprocal Rank. • Time to Answer. Length of time for user to find desired answer

IR Refinements • Stemming. Can help recall, can hurt precision. • Case Folding. • Synonyms. • Use a bigram model. • Spelling Corrections. • Metadata.

Result Sets • Relevance feedback from user. • Document classification. • Document clustering. • K-Means clustering • 1. Pick k documents at random as category seeds • 2. Assign every document to the closest category • 3. Computer the mean of each cluster and uses these means as the new seeds. • 4. Go to step 2 until convergence occurs.

Implementing IR Systems • Lexicon. Given a word, return the location in the inverted index. Stop words are often omitted. • Inverted Index. Might be a list of (document, count) pairs.

Vector Space Model • Used more often in practice than the probabilistic model • Documents are represented as vectors of unigram word frequencies. • A query is represented as a vector consisting of 0s and 1s, e.g. [0 1 1 0 0].

Chapter 23: Probabilistic Language Models

Chapter 23: Probabilistic Language Models

Presentation Transcript

Chapter 9: Probabilistic Scheduling Models

Chapter 23 Probabilistic Language Processing

Probabilistic models

Probabilistic Models

Temporal Probabilistic Models

Temporal Probabilistic Models

Probabilistic Models

Probabilistic graphical models

Probabilistic Graphical Models

Probabilistic Graphical Models

Temporal Probabilistic Models

Probabilistic Models

Probabilistic Models

Probabilistic Topic Models

Chapter 23 Probabilistic Language Processing

Probabilistic Graphical Models

Probabilistic Models

Probabilistic Models

Probabilistic models

Probabilistic Topic Models