CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing

CPSC 7373: Artificial IntelligenceLecture 13: Natural Language Processing Jiang Bian, Fall 2012 University of Arkansas at Little Rock

Natural Language Processing • Understanding natural languages: • Philosophically: We—human—have defined ourselves in terms of our ability to speak with and understand each other. • Application-wise: We want to be able to talk to the computers. • Learning: We want the computers to be smarter, and learn human knowledge from text-books.

Language Models • Two types of language models • Represented as a sequence of letters/words. • Probabilistic: the probability of a sequence: P(word1, word2, …) • Mostly are word-based; and • Learned from data • Trees and abstract structure of words. • Logical: L = {S1, S2, …} • Abstraction: trees/categories • Hand-coded S NP VP Name Verb Sam Slept

Bag of Words HONK OF • A bag rather than a sequence • Unigram, Naïve Bayes model: • Each individual word is treated as a separate factor that unrelated or unconditionally independent of all the other words. • Possible to take the sequence into account. MODEL THE IF YOU LOVE WORDS BAG

Probabilistic Models • P(w1 w2 w3 … wn) = P(W1:n) • = ∏iP(wi|w1:i-1) • Markov Assumption: • the effect of one variable on another will be local; • the nth word is only relevant to its previous k words. • P(wi|w1:i-1) = P(wi|wi-k:i-1) • For first-order Markov model: P(wi|wi-1) • Stationary Assumption: • the probability of each variable is the same • i.e., the word probability only depends on its surrounding words in a sentence, but does not depend on which sentence I am saying… • P(wi|wi-1)=P(wj|wj-1)

Applications of Language Models • Classification (e.g., spam) • Clustering (e.g., news stories) • Input correction (spelling, segmentation) • Sentiment analysis (e.g., product reviews) • Information retrieval (e.g., web search) • Question answering (e.g., IBM’s Watson) • Machine translation (e.g., Chinese to English) • Speech recognition (e.g., Apple’s Siri)

N-gram Model • Ann-gram is a contiguous sequence of n items from a given sequence of text or speech. • Language Models (LM) • Unigrams, Bigrams, Trigrams… • Applications: • Speech recognition/data compression • Predict the next word • Information Retrieve • Retrieved documents are ranked based on the probability of the query and the document’s language model • P(Q|Md)

How do we train these models? • Very large corpora: collections of text and speech • Shakespeare • Brown Corpus • Wall Street Journal • AP newswire • Hansards • Timit • DARPA/NIST text/speech corpora (Call Home, Call Friend, ATIS, Switchboard, Broadcast News, Broadcast Conversation, TDT, Communicator) • TRAINS, Boston Radio News Corpus

A Simple Bigram Example • Estimate the likelihood of the sentence I want to eat Chinese food. • P(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food) • What do we need to calculate these likelihoods? • Bigram probabilities for each word pair sequence in the sentence • Calculated from a large corpus

Eat on .16 Eat Thai .03 Eat some .06 Eat breakfast .03 Eat lunch .06 Eat in .02 Eat dinner .05 Eat Chinese .02 Eat at .04 Eat Mexican .02 Eat a .04 Eat tomorrow .01 Eat Indian .04 Eat dessert .007 Eat today .03 Eat British .001 Early Bigram Probabilities from BERP

<start> I .25 Want some .04 <start> I’d .06 Want Thai .01 <start> Tell .04 To eat .26 <start> I’m .02 To have .14 I want .32 To spend .09 I would .29 To be .02 I don’t .08 British food .60 I have .04 British restaurant .15 Want to .65 British cuisine .01 Want a .05 British lunch .01

P(I want to eat British food) = P(I|<start>) P(want|I) P(to|want) P(eat|to) P(British|eat) P(food|British) = .25*.32*.65*.26*.001*.60 = .000080 • Suppose P(<end>|food) = .2? • How would we calculate I want to eat Chinese food ? • Probabilities roughly capture ``syntactic'' facts and ``world knowledge'' • eat is often followed by an NP • British food is not too popular • N-gram models can be trained by counting and normalization

I Want To Eat Chinese Food lunch I 8 1087 0 13 0 0 0 Want 3 0 786 0 6 8 6 To 3 0 10 860 3 0 12 Eat 0 0 2 0 19 2 52 Chinese 2 0 0 0 0 120 1 Food 19 0 17 0 0 0 0 Lunch 4 0 0 0 0 1 0 Early BERP Bigram Counts

I Want To Eat Chinese Food Lunch 3437 1215 3256 938 213 1506 459 Early BERP Bigram Probabilities • Normalization: divide each row's counts by appropriate unigram counts for wn-1 • Computing the bigram probability of I I • C(I,I)/C( I in call contexts ) • p (I|I) = 8 / 3437 = .0023 • Maximum Likelihood Estimation (MLE): relative frequency

P(I | I) = .0023 I I I I want • P(I | want) = .0025 I want I want • P(I | food) = .013 the kind of food I want is ...

Approximating Shakespeare • Generating sentences with random unigrams... • Every enter now severally so, let • Hill he late speaks; or! a more to leg less first you enter • With bigrams... • What means, sir. I confess she? then all sorts, he is trim, captain. • Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. • Trigrams • Sweet prince, Falstaff shall die. • This shall forbid it should be branded, if renown made it empty.

Quadrigrams • What! I will go seek the traitor Gloucester. • Will you not tell me who I am? • What's coming out here looks like Shakespeare because it is Shakespeare • Note: As we increase the value of N, the accuracy of an n-gram model increases, since choice of next word becomes increasingly constrained

N-Gram Training Sensitivity • If we repeated the Shakespeare experiment but trained our n-grams on a Wall Street Journal corpus, what would we get? • Note: This question has major implications for corpus selection or design

WSJ is not Shakespeare: Sentences Generated from WSJ

Probabilistic Letter Models • The probability of a sequence of letters. • What can we do with letter models? • Language identification

Language Identification Bigram Model:

Language Identification Trigram Model:

Classification Naïve Bayes, k-Nearest Neighbor, Support Vector Machine, Logistic Regression GzipCommond???

Segmentation • Given a sequence of words, how to break it up into meaningful segments. • e.g., 羽西中国新锐画家大奖 • Written English has spaces in between words: • e.g., words have spaces • Speech Recognition • URL: choosespain.com • Choose Spain • Chooses pain

Segmentation • The best segmentation is the one that maximizes the joint probability of the segmentation. • S* = max P(w1:n) = max ∏iP(wi|w1:i-1) • Markov assumption: • S* ≈ max∏iP(wi|wi-1) • Naïve Bayes assumption: words don’t depend on each other • S* ≈ maxP(wi)

Segmentation • “nowisthetime”: 12 letters • How many possible segmentations? • n-1 • (n-1)^2 • (n-1)! • 2^(n-1) • Naïve Bayes assumption: • S* = argmaxs=f+rP(f)P(S*(r)) • 1) Computationally easy • 2) Learning is easier: it’s easier to calculate the unigram probabilities

Best Segmentation • S* = argmaxs=f+rP(f)P(S*(r)) • “nowisthetime”

Segmentation Examples • Trained on 4 billions words corpus • e.g., • Baseratesoughtto • Base rate sought to • Base rates ought to • smallandinsignificant • small and in significant • small and insignificant • Ginormousego • G in or mouse go • Ginormous ego What to do to improve? More data ??? Markov assumption ??? Smoothing ???

Spelling Correction • Given a misspelled the word, find the best correction: • C* = argmaxcP(c|w) • Bayes theorem: C* = argmaxcP(w|c)P(c) • P(c) = from data counts • P(w|c) = from spelling correction data

Spelling Data • c:w => P(w|c) • pulse: pluse • elegant: elagent, elligit • second: secand, sexeon, secund, seconnd, seond, sekon • sailed: saled, saild • blouse: boludes • thunder: thounder • cooking: coking, chocking, kooking, cocking • fossil: fosscil • We cannot have all the common misspelling cases. • Letter-based models, e.g., • ul:lu

Correction Example • w = “thew” => P(w|c)P(c)

Sentence Structure • P(Fed raises interest rates) = ??? S S VP VP N N V N N N V N NP NP NP NP Fed raises interest rates Fed raises interest rates

Ambiguity How many parsing options do I have? ? ? • The Fed raises interest rates • The Fed raises raises • Raises raisesinterest raises

Ambiguity How many parsing options do I have? ? ? • The Fed raises interest rates (2) • The Fed (NP) raises (V) interest rates (NP) • The Fed raises (NP) interest (V) rates (NP) • The Fed raises raises (1) • The Fed (NP) raises (V) raises (NP) • Raises raisesinterest raises • Raises (NP) raises (V) interest raises (NP) • Raises (NP) raises (V) interest (NP) raises (NP) • Raises raises(NP) interest (V) raises (NP) • Raises raisesinterest (NP) raises (V)

Problems and Solutions Problems: Solutions:

Problems of writing grammars • Natural languages are messy unorganized things evolved through the human history in variety contexts. • It is naturally hard to specify a set of grammar rules that can comprehend all possibilities with out introduce errors. • Ambiguity is the “enemy”…

Probabilistic Context-Free Grammar • S -> NP VP (1) • NP -> • | N (.3) • | DN (.4) • | NN (.2) • | NNN (.1) • VP -> • | V NP (.4) • | V (.4) • | V NP NP (.2) • N -> • | interest (.3) • | Fed (.3) • | rates (.3) • | raises (.1) • V -> • | interest (.1) • | rates (.3) • | raises (.6) • D -> • | the (.5) • | a (.5)

Probabilistic Context-Free Grammar • S -> NP VP (1) • NP -> • | N (.3) • | DN (.2) • | NN (.2) • | NNN (.1) • VP -> • | V NP (.4) • | V (.4) • | V NP NP (.2) • N -> • | interest (.3) • | Fed (.3) • | rates (.3) • | raises (.1) • V -> • | interest (.1) • | rates (.3) • | raises (.6) • D -> • | the (.5) • | a (.5) S VP N V N N NP NP 1 Fed raises interest rates .4 P() = 0.0003888 0.039% .3 .2 .3 .6 .3 .3

Statistical Parsing • Where are these probabilities coming from? • Training from large annotated corpus • e.g., The Penn Treebank Project (1990): The Penn Treebank Project annotates naturally-occuring text for linguistic structure. • S -> NP VP (1) • NP -> N (.3) | DN (.2) | NN (.2) | NNN (.1) • VP -> V NP (.4) | V (.4) | V NP NP (.2) • N -> interest (.3) | Fed (.3) | rates (.3) | raises (.1) • V -> interest (.1) | rates (.3) | raises (.6) • D -> the (.5) | a (.5)

The Penn Treebank Project • ( (S • (NP-SBJ (NN Stock-market) (NNS tremors) ) • (ADVP-TMP (RB again) ) • (VP (VBD shook) • (NP (NN bond) (NNS prices) ) • (, ,) • (SBAR-TMP (IN while) • (S • (NP-SBJ (DT the) (NN dollar) ) • (VP (VBD turned) • (PRT (RP in) ) • (NP-PRD (DT a) (VBN mixed) (NN performance) ))))) • (. .) ))

Resolving Ambiguity • Ambiguity: • Syntactical – more than one possible structure for the same string of words. • e.g., We need more intelligent leaders. • need more or more intelligent? • lexical (homonymity) – a word form has more than one meaning. • e.g., Did you see the bat? • e.g., Where is the bank?

“The boy saw the man with the telescope” S NP VP Det N V NP PP Det N P NP Det N The boy saw the man with the telescope

“The boy saw the man with the telescope” S NP VP Det NP N V Det N PP P NP Det N The boy saw the man with the telescope

CPSC 7373: Artificial Intelligence Lecture 13: Natural Language Processing