680 likes | 932 Views
Survey of NLP. JILLIAN K. CHAVES CUBRC, Inc. Survey of NLP. Module 1 Introduction Tokenization Sentence Breaking Module 2 Part-of-Speech (POS) Tagging N-gram Analysis Module 3 Phrase Structure Parsing Syntactic Parsing Module 4 Semantic Analysis NLP & Ontologies. Introduction.
E N D
Survey of NLP JILLIAN K. CHAVES CUBRC, Inc.
Survey of NLP • Module 1 • Introduction • Tokenization • Sentence Breaking • Module 2 • Part-of-Speech (POS) Tagging • N-gram Analysis • Module 3 • Phrase Structure Parsing • Syntactic Parsing • Module 4 • Semantic Analysis • NLP & Ontologies
Introduction • What is Natural Language? • A set of subconscious rules about the pronunciation (phonology), order (syntax), and meaning (semantics) of linguistic expressions. • What is Linguistics? • The scientific study of language use, acquisition, and evolution. • What is Computation? • Computation is the manipulation of information according to a specific method (e.g., algorithm) for determining an output value from a set of input values. • What is Computational Linguistics? • The study of the computational processes that are necessary for the generation and understanding of natural language.
Introduction • Processing natural language is far from trivial • Language is: • based on very large vocabularies (± 20,000 words) • rich in meaning (sometimes vague and context-dependent) • regulated by complicated patterns and subconscious rules • massively ambiguous (resolved only by world knowledge) • noisy (speakers routinely produce and are tolerant to errors) • produced and comprehended very quickly (and usually effortlessly) Humans are specially equipped to handle these difficulties, but machines are not (yet). Is it possible to make a machine understand and use natural language as a human does, or even approximate the same utility?
A Typical NLP Pipeline • More-or-less standardized approach • Tokenization: Isolate all words and word parts • SentenceSegmentation: Isolate each individual sentence • POS Tagging: Assign part(s) of speech for each word • Phrase Structure Parsing: Isolate constituent boundaries • Syntactic Parsing: Identify argument structures • Semantic Analysis: Divine the meaning of a sentence • Ontology Translation: Map meaning to a concept model
Problems for NLP: Ambiguity • Speech Segmentation • Misheard song lyrics, for example • Discourse phenomena such as casual speech • Lexical Categorization • I saw her duck. • She fed her baby carrots. • Lexical/Phrasal Structure • British Prime Minister • The Prime Minister of Britain? • A Prime Minister (of some unknown country) who is of British descent? • Unlockable • Something that can be unlocked? • Something that can not be locked? • Analogous to mathematical order of operations: 12 ÷ 2 + 1 = 7 or 4?
Problems for NLP: Ambiguity • Sentence Structure • People with kids who use drugs should be locked up. • I forgot how good beer tastes. • Semantic Structure • Someone always wins the game. [reference ambiguity] • Every arrow hit a target. [scope ambiguity] • Implicitness • Can you open the door? A) Are you able to open the door? B) Open the door! • What is the dog doing in the garage? A) What activity is the dog carrying out? B) The dog doesn’t belong there. • Yeah, right. A) Yes, that is correct. (= agreement) B) No, that is incorrect. (= sarcasm)
Survey of NLP • Module 1 • Introduction • Tokenization • Sentence Segmentation • Module 2 • Part-of-Speech (POS) Tagging • N-gram Analysis • Module 3 • Phrase Structure Parsing • Syntactic Parsing • Module 4 • Semantic Analysis • NLP & Ontologies
Tokenization • Type (= ) • The set of “word form” types in language is the lexicon • Token (= ) • A single instance of a linguistic type (word or contracted word) • I am hungry. { I | am | hungry | . } (=4; =4) • He’s Mary’s friend? { He | ’s | Mary | ’s | friend | ? } (=5; =6) • The blue car chased the red car. (=6; =8) • Types vs. Tokens in Comparative Corpora
Tokenization • Tokenization • The process of individuating/indexing all tokens in a text • Very difficult in writing systems with lax compounding rules or flexible word boundaries • German: der Donaudampfschifffahrtsgesellschaftskapitän THE DANUBE· STEAMBOAT· VOYAGE· COMPANY· CAPTAIN (“The Danube Steamship Company captain”) • English: gonna, wanna, shoulda, hafta, … • Every token has a unique (within context) part-of-speech category and semantics • Cross-POS homography • Verb/Noun: record, progress, attribute, ... • Syncretism • Simple past and past participle: bought, cost, led, meant, …
Tokenization • The problem is token delineation • Spaces: United States of America • Hyphens: well-rounded; father-in-law • Multiple “spellings”: • US, USA, U.S., U.S.A., United States, … • 1/11/11, 01/11/11, January 11, 2011, 11 January 2011, 2011-01-11, … • (716) 555-5555, 716-555-5555, 716.555.55.55, … • The solution is normalization • Lemmatization: identifying the root (lemma) of each token • Lemma: open • Inflectional Paradigm: open, opens, opening, opened, … • Lemma: be • Inflectional Paradigm: am, is, are, was, were, being, been, isn’t, aren’t, …
Lemmatization • Lemma linguistic type • The set of possible words is much bigger than , thanks to derivation and inflection • Nouns/verbs • bike, skate, shelf, fax, email, Facebook, Google, … • Plural (-s) combines with most singular common nouns • Cat(s), table(s), day(s), idea(s), … • Genitive (-’s) combines with most nominals (simple or complex) • John’s cat, the black cat’s food, the Queen of England’s hat, the girl I met yesterday’s car • Progressive (-ing) attaches to almost any verb • Biking, skating, shelving, faxing, emailing, Facebooking, Googling, … • …which again can be ambiguous with another POS, e.g., shelving
Inflection and Derivation • Inflection • The paradigm (aka conjugation) of a single verb to account for person, number, and tense agreement • Regular • I act, he acts, you acted, we areacting, they have acted, he willact • Irregular • I go, he goes, you went, we are going, they have gone, she will go • I catch, he catches, you caught, we are catching, they have caught, she will catch • New/introduced verbs (e.g., tweet, Google) have regular inflection • Derivation • The process of deriving new words from a single root word • Nation (n.) national (adj.) nationalize (v.) nationalization (n.)
The Importance of Accurate Tokenization • Better downstream syntactic parsing • Stochastic (statistical) parsing thrives on high-quality input • Better downstream semantic assessment • Stable but rare lexical composition patterns • Anti-tank-missile (= a missile that targets tanks) • Anti-missile-missile (= a missile that targets missiles) • Anti-anti-missile-missile-missile (= a missile that targets anti-missile-missiles) • Great-grandfather (= a grandparent’s father) • Great-great-grandfather (= a grandparent’s parent’s father) • Great-great-great-grandfather … • Reliable lexical decomposition, especially with new/nonce words • IYandexed it. {v|Yandex}simple past • I’m a Yandexer. {v|Yandex}agentive nominalization • I can’t stop Yandexing. {v|Yandex}progressive aspect
Survey of NLP • Module 1 • Introduction • Tokenization • Sentence Breaking • Module 2 • Part-of-Speech (POS) Tagging • N-gram Analysis • Module 3 • Phrase Structure Parsing • Syntactic Parsing • Module 4 • Semantic Analysis • NLP & Ontologies
Sentence Segmentation • Naïve approach to identifying a sentence boundary: • If the current token is a period, it’s the end of sentence • If the preceding token is on a list of known abbreviations, then the period might not end the sentence • If the following token is capitalized, then the period ends the sentence • Shockingly: 95% accuracy! Demo: An Online Sentence Breaker • Mr. and Mrs. Jack Giancarlo of Lancaster celebrated their 50th wedding anniversary with a family cruise to the Bahamas. Mr. Giancarlo and Patricia Keenan were married September 28, 1963, in Holy Angels Catholic Church, Buffalo. He is a retired inspector for the Ford Motor Co. Buffalo Stamping Plant; she is working as a tax preparer for H&R Block. They have five children and 13 grandchildren.1 • The bookkeeper/office manager at an Amherst jewelry store has admitted stealing more than $51,000.00 in cash from daily sales at the business. Rena Carrow, 44, of Lancaster, pleaded guilty to third-degree grand larceny in the theft at Andrews Jewelers on Transit Road, according to Erie County District Attorney Frank A. Sedita III. Carrow admitted that between Aug. 31, 2011 and Dec. 5, 2012 she stole $51,069.14. She faces up to seven years in prison when she is sentenced Jan. 16 by Erie County Judge Kenneth F. Case.2 1Adapted from http://www.buffalonews.com/life-arts/golden-weddings/patricia-and-jack-giancarlo-20131010, accessed 10 October 2013. 2Adapted from http://www.buffalonews.com/city-region/amherst/jewelry-store-bookkeeper-admits-to-stealing-more-than-51000-20131010, accessed 10 October 2013.
End of Module 1 Questions?
Survey of NLP • Module1 • Introduction • Tokenization • Sentence Breaking • Module 2 • Part-of-Speech (POS) Tagging • N-gram Analysis • Module 3 • Phrase Structure Parsing • Syntactic Parsing • Module 4 • Semantic Analysis • NLP & Ontologies
Parts of Speech • Closed class (function words) • Pronouns: I, me, you, he, his, she, her, it, … • Possessive: my, mine, your, his, her, their, its, … • Wh-pronouns: who, what, which, when, whom, whomever, … • Prepositions: in, under, to, by, for, about, … • Determiners: a, an, the, each, every, some, ... • Conjunctions • Coordinating: and, or, but, as, … • Subordinating: that, then, who, because, … • Particles: up, down, off, on, .. • Numerals: one, two, three, first, second, … • Auxiliary verbs: can, may, should, could, … • Open class (content words) • Nouns • Proper nouns: Jackie, Microsoft, France, Jupiter, … • Common nouns • Count nouns: cat, table, dream, height, … • Mass (non-count) nouns: milk, oil, mail, music, furniture, fun, … • Verbs: read, eat, paint, think, tell, sleep, … • Adjectives: purple, bad, false, original, … • Adverbs: quietly, always, very, often, never, …
POS Annotation Tagsets • Penn Treebank • A syntactically-annotated corpus of 5M words, using a set of 45 POS tags devised by UPenn (sampling of tagset below)
POS Annotation Tagsets • Comparison (Corpus : Word Count: Tagset Size) Penn Treebank 4.5M n = 45 British National Corpus (BNC) 100M n = 61 Brown Corpus (Brown University) 1M n = 82 Corpus of Contemporary American English (COCA) 450M n = 137 Global Web-Based English (GloWBE) 1.9B n = 137 • Why such a range across tagsets? • Occurrence of “complex” tags • Penn: [isn’t] is/VBZ n’t/RB • Brown: [isn’t] VBZ* (‘*’ indicates negation) • Most category distinctions are recoverable by context • A more exhaustive list of available corpora is available here.
POS Annotation Tagsets • Each token is assigned its possible POS tags • Ambiguity resolved with statistical likelihood measures • e.g., nouns more likely than verbs to begin sentences, etc. • 41 x 33 x 23 x 11 = 864 possible tag combinations • Given the syntactic patterns of English, only 1 is statistically likely: Bill/NNP saw/VBD her/PRP$ father/NN ’s/VBZ bike/NN yesterday/RB /.
POS Annotation Tagsets • Lexical ambiguity metrics: Brown Corpus • 11.5% of words (tokens) are ambiguous • However, those 11.5% tend to be the mostfrequent types: • I know that/IN she is honest. • Yes, that/DT concert was fun. • I’m not that/RB hungry. • In fact, those 11.5% of types account for 40% of the Brown corpus!
Methods & Accuracy • Rule-based POS Tagging 50.0% - 90.0% • Probability-based (Trigram HMM) 55.0% - 95.0% • Maximum Entropy P(t|w) 93.7% - 82.6% • TnT (HMM++) 96.2% - 86.9% • MEMM Tagger 96.9% - 86.9% • Dependency Parser (Stanford) 97.2% - 90.0% • Manual (Human) 98% upper bound “Current part-of-speech taggers work rapidly and reliably, with per-token accuracies of slightly over 97%. [...] Good taggers have sentence accuracies around 55-57%.” Source: Manning 2011
Rule-based Method • Create a list of words with their most likely parts of speech • For each word in a sentence, tag it by looking up its most likely tag • e.g., dog/NN > dog/VB > dog/VBP • Correct for errors with tag-changing rules • Contextual rules: revise the tag based on the surrounding words or the tags of the surrounding words • IN DT NEXTTAG NN (IN becomes DT if next tag is NN) • that/IN cat/NN that/DT cat/NN • Lexical rules: revise the tag based on an analysis of the stemmed word, in concert with the understanding of derivational rules of English
Stemming • Affixation • Regular but not universal • -ize modernize, legalize, finalize *newize, *lawfulize, *permanentize • un- unhealthy, unhappy, unstable *unsick, *unsad, *unmiserable • -s (plural) cats, dogs, birds *oxs (oxen), *mouses (mice) *hippopotamuss (hippopotami or hippopotamuses) • Irregular verbs • Root form changes for tense/aspect • sink sank sunk • begin began begun • go went gone • do did done • Unstable paradigms • dive dove? dived? (= usually a dialectal variation)
Stemming: Variation Predictability • Pluralization via affix • cf. root change, e.g., man men • A singular root that does not end in “s”, “z”, “sh”, ch”, “dg” sounds or a vowel will take ‘-s’ in the plural form. • cat, dog, lab, map, batter, seagull, button, firm, … • A singular root ending in “s”, “z”, “sh”, “ch”, or “dg” sounds will take ‘-es’ in the plural form; if this results in an overlapping orthographic ‘e’, they will collapse. • loss + es = losses / bus + es = buses / house + es /… • buzz + es = buzzes / waltz + es = waltzes / … • ash + es = ashes / match + es = matches / hedge + s = hedges / … • Corollary: A singular root ending in a singular ‘z’ will geminate in the plural form. • quiz + es = quizzes / … • Predictable variation can be captured with rules
Survey of NLP • Module1 • Introduction • Tokenization • Sentence Breaking • Module 2 • Part-of-Speech (POS) Tagging • N-gram Analysis • Module 3 • Phrase Structure Parsing • Syntactic Parsing • Module 4 • Semantic Analysis • NLP & Ontologies
N-grams • Probabilistic language modeling • Goal: determine probability P of a sequence of words • Applications: • POS Tagging • P(ShePRPbikesVBG) > P(ShePRPbikesNNS) • Spellchecking • P(their cat is sick) > P(there cat is sick) • Speech Recognition • P(I can forgive you) > P(I can for give you) • Machine translation, natural language generation, language identification, authorship (genre) identification, word similarity, sentiment analysis, etc.
N-grams • N-gram: a sequence of n words • Unigram: occurrence of a single isolated word • Bigram: a sequence of two words • Trigram: a sequence of three words • 4-gram: a sequence of four words • … • Resources/demonstrations • Online N-gram calculator • GoogleBooks N-gram Viewer • Automatic random language generation • (based on N-gram probabilities of input text)
N-grams: Scope of Usefulness • In a text… • The set of bigrams is large and exhibits high frequencies • The set of trigrams is fewer than the bigrams and also less frequent • … • The set of 15-grams is small and each probably occurs only once • Zipf’s Law (long tail phenomenon): the frequency of a word is inversely correlated with its semantic specificity • Related Task • Compute probability of an upcoming word: • “The probability of the next word being w5 given the preceding environment w1 followed by w2 followed by w3 followed by w4.” • Example: • What is the value of P(the|is,easy,to,see)?
N-grams: Scope of Usefulness • What is the value of P (the|it,is,easy,to,see)? • Approach #1: Counting! • Per Google (as of 22-Oct-2013): • Problem: not all possible sequences occur very often
N-grams: Scope of Usefulness • What is the value of P (the|it,is,easy,to,see)? • Approach #2:Estimate with N-grams • Joint probabilities P (w) * P (w2|w1) * P (w3|w1,w2) * … * P (wn|w1,w2,…,wn-1) • Complex, time-consuming, and, in the end, not very helpful • Limitations • N-gram probability analysis doesn’t give the whole picture • “Garden path” sentences • The man that I saw with her bikes to work every day. • The man that I saw with her bikes was a thief. • News headlines (“Journalese”) • Corn maze cutter stalks fall fun across country • After Earth Lost To Both Fast & Curious And Now You See Me At Friday Box Office • Jury awards $6.5M in CA case of nozzle thought gun
Recurring Problem: Non-linearity • Predictive sequence models fail because they assume that: • Syntax is linear (cf. hierarchical) • “She sent a postcard to her friend from Australia.” • L: She sent a postcard to [her friend from Australia]. • H: [She sent a postcard] to her friend [from Australia]. • All dependencies are local (cf. long-distance) • Which instrument did you play? • Deconstruction: • Determine the value of x such that x is an instrument and you play x • Which instrument did your college roommate try to annoy you by playing? • Deconstruction: • Define set vthat is identical to the set of your roommates • Define subset x of set v as the set of roommates from college • Define subset y of set v that played an instrument w • Define subset z of set v that played w to annoy you • Determine the value of w
End of Module 2 Questions?
Survey of NLP • Module1 • Introduction • Tokenization • Sentence Breaking • Module 2 • Part-of-Speech (POS) Tagging • N-gram Analysis • Module 3 • Phrase Structure Parsing • Syntactic Parsing • Module 4 • Semantic Analysis • NLP & Ontologies
Phrase Structures • Computational Analogy: base-10 arithmetic • Lexicon: • N 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • O + | - | x | = • Grammar: N N O N N (1+2) x (3+4) N 9 – ((2 x 3) + 1) N 3 x 7 N 9 – (6 + 1) 21 3 x 7 N 9 – 7 2 9 – 7
Phrase Structures • Natural language has a bigger lexicon and more rules • How? • Recursion: a phrase defined in terms of itself • A noun phrase can be rewritten as (for instance): • NP DT N “the dog” • N N PP “dog in the yard” • A prepositional phrase is rewritten as a preposition (relational term) and a noun phrase. • PP P NP • These three rules alone allow for infinite recursion! • Example: • “Put the ring in the box on the table at the end of the hallway.” • Where is the ring now? Where is it going?
Phrase Structures • Phrasal rewrite rules • Additional rules of English • S NP VP [the dog] [barked] • NP DT N [the dog] • VP IV [barked] • N AdjP N [big] [dog] • VP TV NP [gnawed] [the bone] • VP DTV NP NP [gave] [Mary] [a kiss] • VP DTV NP PP [gave] [a kiss] [to Mary] • PDV DTV [was given] • VP DTV NP PP [was given] [a kiss] [by the dog] • VP VP PP [went] [to the park] • PP P NP [to] [the park]
Phrase Structures • Syntactic tree structure • “The woman called a friend from Australia.” • Is this parse predicted by the grammar rules? [The woman] called a friend [from Australia]. Parse #1: The woman [called] a friend [from Australia]. Parse #2: The woman called [a friend from Australia].
Phrase Structures • Other common sources of recursion • Complex/non-canonical phrases • VP AUX VP • By this time next month, I [will [have [been [married]]]] for 10 years. • Complex/non-canonical phrases • NP GerundVP • [Swimming] is fun. GerundVP VBG • [Going to the beach] is a great way to relax. GerundVP VBG PP • [Visiting the cemetery] was very sad. GerundVP VBG NP • Reiteration within rules • NP DT AdjP N “the big dog” • AdjP Adj* “big brown furry” • AdjP (Adv*) Adj* “[awesomely [big]] [really [furry]]”
Phrase Structures • How do we know phrase structure rules exist? • Ability to parse novel grammatical sentences • “They laboriously cavorted with intrepid neighbors.” • Ability to intuit when a sentence is ungrammatical. • “Like almost eyes feel been have fully indigo.” • How many rules are there? • Nobody knows! Open problem since the 1950s. • The statistical universals have been identified – • Existing phrase structure rules account for 97% of natural language constructions • Psycholinguists focus on the remaining 3% via the grammaticality/acceptability interface
Survey of NLP • Module1 • Introduction • Tokenization • Sentence Breaking • Module 2 • Part-of-Speech (POS) Tagging • N-gram Analysis • Module 3 • Phrase Structure Parsing • Syntactic Parsing • Module 4 • Semantic Analysis • NLP & Ontologies
Online Parsers • Phrase Structure Parsers • Probabilistic LFG F-structure parsing • Link Grammar • ZZCad • Dependency Parsers • Stanford Parser • Connexor ROOT The woman called a friend. det(woman-2, the-1) nsubj(called-3, woman-2) root(root-0, called-3) det(friend-5, a-4) dobj(called-3, friend-5)
Long-distance Dependencies • Local • Which instrument did you play? det(instrument-2, which-1) dobj(play-5, instrument-2) aux(play-5, did-3) nsubj(play-5, you-4) root(root-0, play-5) • Long-distance • Which instrument did your college roommate try to annoy you by playing? det(instrument-2, which-1) dep(try-7, instrument-2) aux(try-7, did-3) poss(roommate-6, your-4) nn(roommate-6, college-5) nsubj(try-7, roommate-6) xsubj(annoy-9, roommate-6) root(root-0, try-7) aux(annoy-9, to-8) xcomp(try-7, annoy-9) dobj(annoy-9, you-10) prep(annoy-9, by-11) pobj(by-11, playing-12)
End of Module 3 Questions?
Survey of NLP • Module1 • Introduction • Tokenization • Sentence Breaking • Module 2 • Part-of-Speech (POS) Tagging • N-gram Analysis • Module 3 • Phrase Structure Parsing • Syntactic Parsing • Module 4 • Semantic Analysis • NLP & Ontologies
The Syntax-Semantics Interface • Can we automate the process of associating semantic representations with parsed natural language expressions? • Is the association even systematic?
The Syntax-Semantics Interface • The meaning of an expression is a function of the meanings of its parts and the way the parts are combined syntactically • [The cat] chased the dog. • [The cat] was chased by the dog. • The dog chased [the cat]. • The meaning of [the cat] is fairly stable, but its role in the sentence is determined by syntax • The primary tenet of the syntax-semantics interface is this Principle of Compositionality
Compositionality • Semantic -calculus • Notational extension of First-Order Logic • Grammar is extended with semantic representations • Proper names: (PN; tom) Tom; (PN; mia) Mia • Intrans. verbs: (IV; snores • Transitive verbs: (TV; likes • Phrasal rules: • Sentence (S; ()()) (NP; )(VP; ) • Noun Phrase (NP; ) (PN; ) • Intransitive VP (VP; ) (IV; ) • Transitive VP (VP; () ()) (TV; )(NP; )