1 / 63

Text Models

Text Models. Why?. To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging. Simple application: spelling suggestions. Say that we have a dictionary of words Real dictionary or the result of crawling Sentences instead of words

Download Presentation

Text Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Models

  2. Why? • To “understand” text • To assist in text search & ranking • For autocompletion • Part of Speech Tagging

  3. Simple application: spelling suggestions • Say that we have a dictionary of words • Real dictionary or the result of crawling • Sentences instead of words • Now we are given a word w not in the dictionary • How can we correct it to something in the dictionary

  4. String editing • Given two strings (sequences) the “distance” between the two strings is defined by the minimum number of “character edit operations” needed to turn one sequence into the other. • Edit operations: delete, insert, modify (a character) • Cost assigned to each operation (e.g. uniform =1 )

  5. Edit distance • Already a simple model for languages • Modeling the creation of strings (and errors in them) through simple edit operations

  6. Distance between strings • Edit distance between strings = minimum number of edit operations that can be used to get from one string to the other • Symmetric because of the particular choice of edit operations and uniform cost • distance(“WillliamCohon”,“William Cohen”) • 2

  7. Finding the edit distance • An “alignment” problem • Deciding how to align the two strings • Can we try all alignments? • How many (reasonable options) are there?

  8. Dynamic Programming • An umbrella name for a collection of algorithms • Main idea: reuse computation for sub-problems, combined in different ways

  9. Example: Fibonnaci if n = 0 or n = 1 return n else return fib(n-1) + fib(n-2) Exponential time!

  10. Fib with Dynamic Programming table = {} def fib(n): global table if table.has_key(n): return table[n] if n == 0 or n == 1: table[n] = n return n else: value = fib(n-1) + fib(n-2) table[n] = value return value

  11. Using a partial solution • Partial solution: • Alignment of s up to location i, with t up to location j • How to reuse? • Try all options for the “last” operation

  12. Base case : D(i,0)=I, D(0,i)=i for i inserts \ deletions • Easy to generalize to arbitrary cost functions!

  13. Models • Bag-of-words • N-grams • Hidden Markov Models • Probabilistic Context Free Grammar

  14. Bag-of-words • Every document is represented as a bag of the words it contains • Bag means that we keep the multiplicity (=number of occurrences) of each word • Very simple, but we lose all track of structure

  15. n-grams • Limited structure • Sliding window of n words

  16. n-gram model

  17. How would we infer the probabilities? • Issues: • Overfitting • Probability 0

  18. How would we infer the probabilities? • Maximum Likelihood:

  19. "add-one" (Laplace) smoothing • V = Vocabulary size

  20. Good-Turing Estimate

  21. Good-Turing

  22. More than a fixed n..Linear Interpolation

  23. Precision vs. Recall

  24. Richer Models • HMM • PCFG

  25. Motivation: Part-of-Speech Tagging • Useful for ranking • For machine translation • Word-Sense Disambiguation • …

  26. Part-of-Speech Tagging • Tag this word. This word is a tag. • He dogs like a flea • The can is in the fridge • The sailor dogs me every day

  27. A Learning Problem • Training set: tagged corpus • Most famous is the Brown Corpus with about 1M words • The goal is to learn a model from the training set, and then perform tagging of untagged text • Performance tested on a test-set

  28. Simple Algorithm • Assign to each word its most popular tag in the training set • Problem: Ignores context • Dogs, tag will always be tagged as a noun… • Can will be tagged as a verb • Still, achieves around 80% correctness for real-life test-sets • Goes up to as high as 90% when combined with some simple rules

  29. (HMM)Hidden Markov Model • Model: sentences are generated by a probabilistic process • In particular, a Markov Chain whose states correspond to Parts-of-Speech • Transitions are probabilistic • In each state a word is outputted • The output word is again chosen probabilistically based on the state

  30. HMM • HMM is: • A set of N states • A set of M symbols (words) • A matrix NXN of transition probabilities Ptrans • A vector of size N of initial state probabilities Pstart • A matrix NXM of emissions probabilities Pout • “Hidden” because we see only the outputs, not the sequence of states traversed

  31. Example

  32. 3 Fundamental Problems 1) Compute the probability of a given observation Sequence (=sentence) 2) Given an observation sequence, find the most likely hidden state sequence This is tagging 3) Given a training set find the model that would make the observations most likely

  33. Tagging • Find the most likely sequence of states that led to an observed output sequence • Problem: exponentially many possible sequences!

  34. Viterbi Algorithm • Dynamic Programming • Vt,k is the probability of the most probable state sequence • Generating the first t + 1 observations (X0,..Xt) • And terminating at state k

  35. Viterbi Algorithm • Dynamic Programming • Vt,k is the probability of the most probable state sequence • Generating the first t + 1 observations (X0,..Xt) • And terminating at state k • V0,k = Pstart(k)*Pout(k,X0) • Vt,k= Pout(k,Xt)*max{Vt-1k’ *Ptrans(k’,k)}

  36. Finding the path • Note that we are interested in the most likely path, not only in its probability • So we need to keep track at each point of the argmax • Combine them to form a sequence • What about top-k?

  37. Complexity • O(T*|S|^2) • Where T is the sequence (=sentence) length, |S| is the number of states (= number of possible tags)

  38. Computing the probability of a sequence • Forward probabilities: αt(k) is the probability of seeing the sequence X1…Xt and terminating at state k • Backward probabilities: βt(k) is the probability of seeing the sequence Xt+1…Xn given that the Markov process is at state k at time t.

  39. Computing the probabilities Forward algorithm α0(k)= Pstart(k)*Pout(k,X0) αt(k)= Pout(k,Xt)*Σk’{αt-1k’ *Ptrans(k’,k)} P(O1,…On)= Σk αn(k) Backward algorithm βt(k) = P(Ot+1…On| state at time t is k) βt(k) = Σk’{Ptrans(k,k’)* Pout(k’,Xt+1)* βt+1(k’)} βn(k) = 1 for all k P(O)= Σk β0(k)* Pstart(k)

  40. Learning the HMM probabilities • Expectation-Maximization Algorithm • Start with initial probabilities • Compute Eij the expected number of transitions from i to j while generating a sequence, for each i,j (see next) • Set the probability of transition from i to j to be Eij/ (ΣkEik) 4. Similarly for omission probability 5. Repeat 2-4 using the new model, until convergence

  41. Estimating the expectancies • By sampling • Re-run a random a execution of the model 100 times • Count transitions • By analysis • Use Bayes rule on the formula for sequence probability • Called the Forward-backward algorithm

  42. Accuracy • Tested experimentally • Exceeds 96% for the Brown corpus • Trained on half and tested on the other half • Compare with the 80-90% by the trivial algorithm • The hard cases are few but are very hard..

  43. NLTK • http://www.nltk.org/ • Natrual Language ToolKit • Open source python modules for NLP tasks • Including stemming, POS tagging and much more

  44. Context Free Grammars • Context Free Grammars are a more natural model for Natural Language • Syntax rules are very easy to formulate using CFGs • Provably more expressive than Finite State Machines • E.g. Can check for balanced parentheses

  45. Context Free Grammars • Non-terminals • Terminals • Production rules • V → w where V is a non-terminal and w is a sequence of terminals and non-terminals

  46. Context Free Grammars • Can be used as acceptors • Can be used as a generative model • Similarly to the case of Finite State Machines • How long can a string generated by a CFG be?

  47. Stochastic Context Free Grammar • Non-terminals • Terminals • Production rules associated with probability • V → w where V is a non-terminal and w is a sequence of terminals and non-terminals

More Related