560 likes | 700 Views
CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing. Jiang Bian, Fall 2012 University of Arkansas at Little Rock. Natural Language Processing. Understanding natural languages:
E N D
CPSC 7373: Artificial IntelligenceLecture 13: Natural Language Processing Jiang Bian, Fall 2012 University of Arkansas at Little Rock
Natural Language Processing • Understanding natural languages: • Philosophically: We—human—have defined ourselves in terms of our ability to speak with and understand each other. • Application-wise: We want to be able to talk to the computers. • Learning: We want the computers to be smarter, and learn human knowledge from text-books.
Language Models • Two types of language models • Represented as a sequence of letters/words. • Probabilistic: the probability of a sequence: P(word1, word2, …) • Mostly are word-based; and • Learned from data • Trees and abstract structure of words. • Logical: L = {S1, S2, …} • Abstraction: trees/categories • Hand-coded S NP VP Name Verb Sam Slept
Bag of Words HONK OF • A bag rather than a sequence • Unigram, Naïve Bayes model: • Each individual word is treated as a separate factor that unrelated or unconditionally independent of all the other words. • Possible to take the sequence into account. MODEL THE IF YOU LOVE WORDS BAG
Probabilistic Models • P(w1 w2 w3 … wn) = P(W1:n) • = ∏iP(wi|w1:i-1) • Markov Assumption: • the effect of one variable on another will be local; • the nth word is only relevant to its previous k words. • P(wi|w1:i-1) = P(wi|wi-k:i-1) • For first-order Markov model: P(wi|wi-1) • Stationary Assumption: • the probability of each variable is the same • i.e., the word probability only depends on its surrounding words in a sentence, but does not depend on which sentence I am saying… • P(wi|wi-1)=P(wj|wj-1)
Applications of Language Models • Classification (e.g., spam) • Clustering (e.g., news stories) • Input correction (spelling, segmentation) • Sentiment analysis (e.g., product reviews) • Information retrieval (e.g., web search) • Question answering (e.g., IBM’s Watson) • Machine translation (e.g., Chinese to English) • Speech recognition (e.g., Apple’s Siri)
N-gram Model • Ann-gram is a contiguous sequence of n items from a given sequence of text or speech. • Language Models (LM) • Unigrams, Bigrams, Trigrams… • Applications: • Speech recognition/data compression • Predict the next word • Information Retrieve • Retrieved documents are ranked based on the probability of the query and the document’s language model • P(Q|Md)
N-gram examples • S = “I saw the red house” • Unigram: • P(S) = P(I, saw, the, red, house) = P(I)P(saw)P(the)P(red)P(house) • Bigram – Markov assumption • P(S) = P(I|<s>)P(saw|I)P(the|saw)P(red|the)P(house|red)P(</s>|house) • Trigram: • P(S) = P(I|<s>, <s>)P(saw|<s>, I)P(the|I, saw)P(red|saw, the)P(house|the, red)P(</s>|red, house)
How do we train these models? • Very large corpora: collections of text and speech • Shakespeare • Brown Corpus • Wall Street Journal • AP newswire • Hansards • Timit • DARPA/NIST text/speech corpora (Call Home, Call Friend, ATIS, Switchboard, Broadcast News, Broadcast Conversation, TDT, Communicator) • TRAINS, Boston Radio News Corpus
A Simple Bigram Example • Estimate the likelihood of the sentence I want to eat Chinese food. • P(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food) • What do we need to calculate these likelihoods? • Bigram probabilities for each word pair sequence in the sentence • Calculated from a large corpus
Eat on .16 Eat Thai .03 Eat some .06 Eat breakfast .03 Eat lunch .06 Eat in .02 Eat dinner .05 Eat Chinese .02 Eat at .04 Eat Mexican .02 Eat a .04 Eat tomorrow .01 Eat Indian .04 Eat dessert .007 Eat today .03 Eat British .001 Early Bigram Probabilities from BERP
<start> I .25 Want some .04 <start> I’d .06 Want Thai .01 <start> Tell .04 To eat .26 <start> I’m .02 To have .14 I want .32 To spend .09 I would .29 To be .02 I don’t .08 British food .60 I have .04 British restaurant .15 Want to .65 British cuisine .01 Want a .05 British lunch .01
P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080 • Suppose P(<end>|food) = .2? • How would we calculate I want to eat Chinese food ? • Probabilities roughly capture ``syntactic'' facts and ``world knowledge'' • eat is often followed by an NP • British food is not too popular • N-gram models can be trained by counting and normalization
I Want To Eat Chinese Food lunch I 8 1087 0 13 0 0 0 Want 3 0 786 0 6 8 6 To 3 0 10 860 3 0 12 Eat 0 0 2 0 19 2 52 Chinese 2 0 0 0 0 120 1 Food 19 0 17 0 0 0 0 Lunch 4 0 0 0 0 1 0 Early BERP Bigram Counts
I Want To Eat Chinese Food Lunch 3437 1215 3256 938 213 1506 459 Early BERP Bigram Probabilities • Normalization: divide each row's counts by appropriate unigram counts for wn-1 • Computing the bigram probability of I I • C(I,I)/C( I in call contexts ) • p (I|I) = 8 / 3437 = .0023 • Maximum Likelihood Estimation (MLE): relative frequency
What do we learn about the language? • What's being captured with ... • P(want | I) = .32 • P(to | want) = .65 • P(eat | to) = .26 • P(food | Chinese) = .56 • P(lunch | eat) = .055 • What about... • P(I | I) = .0023 • P(I | want) = .0025 • P(I | food) = .013
P(I | I) = .0023 I I I I want • P(I | want) = .0025 I want I want • P(I | food) = .013 the kind of food I want is ...
Approximating Shakespeare • Generating sentences with random unigrams... • Every enter now severally so, let • Hill he late speaks; or! a more to leg less first you enter • With bigrams... • What means, sir. I confess she? then all sorts, he is trim, captain. • Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. • Trigrams • Sweet prince, Falstaff shall die. • This shall forbid it should be branded, if renown made it empty.
Quadrigrams • What! I will go seek the traitor Gloucester. • Will you not tell me who I am? • What's coming out here looks like Shakespeare because it is Shakespeare • Note: As we increase the value of N, the accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained
N-Gram Training Sensitivity • If we repeated the Shakespeare experiment but trained our n-grams on a Wall Street Journal corpus, what would we get? • Note: This question has major implications for corpus selection or design
Probabilistic Letter Models • The probability of a sequence of letters. • What can we do with letter models? • Language identification
Language Identification Bigram Model:
Language Identification Trigram Model:
Classification Naïve Bayes, k-Nearest Neighbor, Support Vector Machine, Logistic Regression GzipCommond???
Gzip • EN • Hello world! • This is a file full of English words… • AZ • Salam Dunya! • Bu faylAzƏrbaycan tam sozlƏr… • DE • Hallo Welt! • Dies isteineDateivoll von deutschenWorte … This is a new piece of text to be classified. (echo `cat new EN | gzip | wc –c` EN; \ echo `cat new DE | gzip | wc –c` DE; \ echo `cat new AZ | gzip | wc –c` AZ; \ | sort –n | head -1
Segmentation • Given a sequence of words, how to break it up into meaningful segments. • e.g., 羽西中国新锐画家大奖 • Written English has spaces in between words: • e.g., words have spaces • Speech Recognition • URL: choosespain.com • Choose Spain • Chooses pain
Segmentation • The best segmentation is the one that maximizes the joint probability of the segmentation. • S* = max P(w1:n) = max ∏iP(wi|w1:i-1) • Markov assumption: • S* ≈ max∏iP(wi|wi-1) • Naïve Bayes assumption: words don’t depend on each other • S* ≈ maxP(wi)
Segmentation • “nowisthetime”: 12 letters • How many possible segmentations? • n-1 • (n-1)^2 • (n-1)! • 2^(n-1) • Naïve Bayes assumption: • S* = argmaxs=f+rP(f)P(S*(r)) • 1) Computationally easy • 2) Learning is easier: it’s easier to calculate the unigram probabilities
Best Segmentation • S* = argmaxs=f+rP(f)P(S*(r)) • “nowisthetime”
Segmentation Examples • Trained on 4 billions words corpus • e.g., • Baseratesoughtto • Base rate sought to • Base rates ought to • smallandinsignificant • small and in significant • small and insignificant • Ginormousego • G in or mouse go • Ginormous ego What to do to improve? More data ??? Markov assumption ??? Smoothing ???
Spelling Correction • Given a misspelled the word, find the best correction: • C* = argmaxcP(c|w) • Bayes theorem: C* = argmaxcP(w|c)P(c) • P(c) = from data counts • P(w|c) = from spelling correction data
Spelling Data • c:w => P(w|c) • pulse: pluse • elegant: elagent, elligit • second: secand, sexeon, secund, seconnd, seond, sekon • sailed: saled, saild • blouse: boludes • thunder: thounder • cooking: coking, chocking, kooking, cocking • fossil: fosscil • We cannot have all the common misspelling cases. • Letter-based models, e.g., • ul:lu
Correction Example • w = “thew” => P(w|c)P(c)
Sentence Structure • P(Fed raises interest rates) = ??? S S VP VP N N V N N N V N NP NP NP NP Fed raises interest rates Fed raises interest rates
Context Free Grammar Parsing • Sentence structure trees are constructed according to grammar. • A grammar is a list of rules: e.g., • S -> NP VP • NP -> N | D (determiners: e.g., the, a) N | NN | NNN (mortgage interest rates), etc. • VP -> V NP | V | V NP NP (e.g., give me the money) • N -> interest | Fed | rates | raises • V -> interest | rates | raises • D -> the | a
Ambiguity How many parsing options do I have? ? ? • The Fed raises interest rates • The Fed raises raises • Raises raisesinterest raises
Ambiguity How many parsing options do I have? ? ? • The Fed raises interest rates (2) • The Fed (NP) raises (V) interest rates (NP) • The Fed raises (NP) interest (V) rates (NP) • The Fed raises raises (1) • The Fed (NP) raises (V) raises (NP) • Raises raisesinterest raises • Raises (NP) raises (V) interest raises (NP) • Raises (NP) raises (V) interest (NP) raises (NP) • Raises raises(NP) interest (V) raises (NP) • Raises raisesinterest (NP) raises (V)
Problems and Solutions Problems: Solutions:
Problems and Solutions Problems: Solutions:
Problems of writing grammars • Natural languages are messy unorganized things evolved through the human history in variety contexts. • It is naturally hard to specify a set of grammar rules that can comprehend all possibilities with out introduce errors. • Ambiguity is the “enemy”…
Probabilistic Context-Free Grammar • S -> NP VP (1) • NP -> • | N (.3) • | DN (.4) • | NN (.2) • | NNN (.1) • VP -> • | V NP (.4) • | V (.4) • | V NP NP (.2) • N -> • | interest (.3) • | Fed (.3) • | rates (.3) • | raises (.1) • V -> • | interest (.1) • | rates (.3) • | raises (.6) • D -> • | the (.5) • | a (.5)
Probabilistic Context-Free Grammar • S -> NP VP (1) • NP -> • | N (.3) • | DN (.2) • | NN (.2) • | NNN (.1) • VP -> • | V NP (.4) • | V (.4) • | V NP NP (.2) • N -> • | interest (.3) • | Fed (.3) • | rates (.3) • | raises (.1) • V -> • | interest (.1) • | rates (.3) • | raises (.6) • D -> • | the (.5) • | a (.5) S VP N V N N NP NP 1 Fed raises interest rates .4 P() = 0.0003888 0.039% .3 .2 .3 .6 .3 .3
Probabilistic Context-Free Grammar • S -> NP VP (1) • NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1) • VP -> V NP (.4) | V (.4) | V NP NP (.2) • N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1) • V -> interest (.1) | rates (.3) | raises (.6) • D -> the (.5) | a (.5) N V N N N N N V S S P() = ???% P() = ???% VP VP NP NP NP NP Raises raises interest rates Raises raises interest rates
Probabilistic Context-Free Grammar • S -> NP VP (1) • NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1) • VP -> V NP (.4) | V (.4) | V NP NP (.2) • N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1) • V -> interest (.1) | rates (.3) | raises (.6) • D -> the (.5) | a (.5) N V N N N N N V S S P() = .012% P() = .00072% VP VP NP NP NP NP Raises raises interest rates Raises raises interest rates
Statistical Parsing • Where are these probabilities coming from? • Training from large annotated corpus • e.g., The Penn Treebank Project (1990): The Penn Treebank Project annotates naturally-occuring text for linguistic structure. • S -> NP VP (1) • NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1) • VP -> V NP (.4) | V (.4) | V NP NP (.2) • N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1) • V -> interest (.1) | rates (.3) | raises (.6) • D -> the (.5) | a (.5)
The Penn Treebank Project • ( (S • (NP-SBJ (NN Stock-market) (NNS tremors) ) • (ADVP-TMP (RB again) ) • (VP (VBD shook) • (NP (NN bond) (NNS prices) ) • (, ,) • (SBAR-TMP (IN while) • (S • (NP-SBJ (DT the) (NN dollar) ) • (VP (VBD turned) • (PRT (RP in) ) • (NP-PRD (DT a) (VBN mixed) (NN performance) ))))) • (. .) ))
Resolving Ambiguity • Ambiguity: • Syntactical – more than one possible structure for the same string of words. • e.g., We need more intelligent leaders. • need more or more intelligent? • lexical (homonymity) – a word form has more than one meaning. • e.g., Did you see the bat? • e.g., Where is the bank?
“The boy saw the man with the telescope” S NP VP Det N V NP PP Det N P NP Det N The boy saw the man with the telescope
“The boy saw the man with the telescope” S NP VP Det NP N V Det N PP P NP Det N The boy saw the man with the telescope