CSA2050: Introduction to Computational Linguistics

CSA2050:Introduction to Computational Linguistics Part of Speech (POS) Tagging I Rule-based Tagging Stochastic Tagging

Acknowledgment • Most slides taken from Bonnie Dorr’s course notes:www.umiacs.umd.edu/~bonnie/courses/cmsc723-03 • In turn based on Jurafsky & Martin Chapter 8 CLINT Lecture IV

Outline • The Task • Tags • Rule-based Tagging • Stochastic Tagging CLINT Lecture IV

WORDS TAGS the girl kissed the boy on the cheek N V P DET Definition: PoS-Tagging “Part-of-Speech Tagging is the process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin) CLINT Lecture IV

Motivation • Corpus analysis of tagged corpora yields useful information • Speech synthesis — pronunciation CONtent (N) vs. conTENT (Adj) • Speech recognition — word class-based N-grams predict category of next word. • Information retrieval • stemming • selection of high-content words • Word-sense disambiguation CLINT Lecture IV

Word Classes • Parts of Speech: words in the same class (same POS) have similar behaviour w.r.t. • Morphology (what kinds of affixes they take) • Syntax (what kinds of words can occur nearby) • Open Classes • Closed Classes CLINT Lecture IV

Open Class Words • Nouns: people, places, things • Classes of nouns • proper vs. common • count vs. mass • Verbs: actions and processes • Adjectives: properties, qualities • Adverbs: many different kinds CLINT Lecture IV

Closed Class Words • Fixed membership • Examples: • prepositions: on, under, over, … • particles: up, down, on, off, … • determiners: a, an, the, … • pronouns: she, who, I, .. • conjunctions: and, but, or, … • auxiliary verbs: can, may should, … • numerals: one, two, three, third, … CLINT Lecture IV

Prepositions from CELEX Frequencies from COBUILD 16M word corpus CLINT Lecture IV

Tagsets: how detailed? CLINT Lecture IV

Penn Treebank Tagset PRP PRP$ CLINT Lecture IV

Example of Penn Treebank Tagging of Brown Corpus Sentence The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. VB DT NN .Book that flight . VBZ DT NN VB NN ?Does that flight serve dinner ? CLINT Lecture IV

The Problem • He can can a can. • I canlight a fire and you canopen a can of beans. Now the can is open, and we can eat in the light of the fire. • Flying planes can be dangerous. CLINT Lecture IV

The Problem • Words often belong to more than one word class: this • This is a nice day = PRP • This day is nice = DT • You can go this far = RB (adverb) • Many of the most common words are ambiguous CLINT Lecture IV

How Hard is the Tagging Task? • In the Brown Corpus • 11.5% of word types are ambiguous • 40% of word tokens are ambiguous • Most words in English are unambiguous. • Many of the most common words are ambiguous. • Typically ambiguous tags are not equally probable. CLINT Lecture IV

Unambiguous (1 tag): 35,340 types Ambiguous (2-7 tags): 4,100 types . Word Class Ambiguity(in the Brown Corpus) (Derose, 1988) CLINT Lecture IV

3 Approaches to Tagging • Rule-Based Tagger: ENGTWOL Tagger(Voutilainen 1995) • Stochastic Tagger: HMM-based Tagger • Transformation-Based Tagger: Brill Tagger(Brill 1995) CLINT Lecture IV

Rule-Based Tagger • Basic Idea: • Assign all possible tags to words • Remove tags according to set of rulesif word+1 is an adj, adv, or quantifier and the following is a sentence boundary and word-1 is not a verb like “consider” then eliminate non-adv else eliminate adv. • Typically more than 1000 hand-written rules, but may be machine-learned. CLINT Lecture IV

ENGTWOL • Based on two-level morphology • 56,000 entries for English word stems • Each entry annotated with morphological and syntactic features CLINT Lecture IV

Sample ENGTWOL Lexicon CLINT Lecture IV

ENGTWOL Tagger • Stage 1: Run words through morphological analyzer to get all parts of speech. • E.g. for the phrase “the tables”, we get the following output:"<the>" "the" <Def> DET CENTRAL ART SG/PL "<tables>" "table" N NOM PL "table" <SVO> V PRES SG3 VFIN • Stage 2: Apply constraints to rule out incorrect POSs CLINT Lecture IV

Examples of Constraints • Discard all verb readings if to the left there is an unambiguous determiner, and between that determiner and the ambiguous word itself, there are no nominals (nouns, abbreviations etc.). • Discard all finite verb readings if the immediately preceding word is to. • Discard all subjunctive readings if to the left, there are no instances of the subordinating conjunction that or lest. • The first constraint would discard the verb reading in the previous representation. • There are about 1,100 constraints CLINT Lecture IV

Example Pavlov PVLOV N NOM SG PROPER had HAVE V PAST VFIN SVO HAVE PCP2 SVO shown SHOW PCP2 SVOO SVO SV that ADV PRON DEM SG DET CENTRAL SEM SG CS salivation N NOM SG CLINT Lecture IV

Actual Constraint Syntax Given input: “that”If (+1 A/ADV/QUANT) (+2 SENT-LIM) (NOT -1 SVOC/A)Then eliminate non-ADV tagsElse eliminate ADV tag • this rule eliminates the adverbial sense of that as in “it isn’t that odd” CLINT Lecture IV

Stochastic Tagging • Based on the probability of a certain tag given various possibilities. • Necessitates a training corpus. • Difficulties • There are no probabilities for words that are not in the training corpus.  Smoothing • Training corpus may be too different from test corpus. CLINT Lecture IV

Stochastic Tagging Simple Method: Choose the most frequent tag in the training text for each word! • Result: 90% accuracy !! • But we can do better than that by employing a more elaborate statistical model • Hidden Markov Models (HMM) are a class of such models. CLINT Lecture IV

Hidden Markov Model(for pronunciation) CLINT Lecture IV

Three Fundamental Questions for HMMs • Given an HMM, how likely is a given observation sequence? • Given an observation sequence, how do we choose a state sequence that best explains the observations? • Given an observation sequence and a space of possible HMMs, how do we find the HMM that best fits the observed data? CLINT Lecture IV

Two Observation Sequences for Tagging CLINT Lecture IV

Two Kinds of Probability involved in generating a sequence t1 t2 t3 t5 t6 w1 w2 w3 w4 w5 Transitional t1 t2 t4 t5 t6 P(tag|previous n tags) t1 t2 t3 t5 t6 w1 w2 w3 w4 w5 Output t1 t2 t4 t5 t6 P(w|t) CLINT Lecture IV

Simplifying Assumptions cannot handle all phenomena • Limited Horizon: a given tag depends only upon a N previous tags – usually N=2. • central embedding?The cat the dog the bird saw bark meowed. • long distance dependenciesChris is easy to consider it impossible for anyone but a genius to try to talk to __. • Time (sentence position) invariance: (P,V) may not be equally likely at beginning/end of sentence CLINT Lecture IV

Estimating N-gram probabilities • To estimate the probability that “such” appears after “to create”: • count how many times “to create such” appears= A • count how many times “to create” appears = B • Estimate = A/B • Same principle applies for tags • We can use these estimates to rank alternative tags for a given word. CLINT Lecture IV

Data Used for Training a Hidden Markov Model • Estimate the probabilities from relative frequencies. • Transitional probabilities: probability that a sequence of tags t1, ... tn is followed by a tag t P(t|t1..tn) = count(t1..tnfollowed by t)/count(t1..tn) • Output probabilities: probability that a given tag t will be realised as a word w: P(w|t) = count(w tagged as t)/count(t) CLINT Lecture IV

An Example • Secretariat/NNP is/VBZ expected/VBN to/TOrace/VB tomorrow/NN • People/NNS continue/VBP to/TO inquire/VB the DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN • Consider first sentence; choose between A = to/TO race/VBB = to/TO race/NN • We need to choose maximum probability: • P(A) = P(VB|TO) × P(race|VB) • P(B) = P(NN|TO) × P(race|NN)] CLINT Lecture IV

Calculating Maximum CLINT Lecture IV

Remarks • We have shown how to calculate the most probable tag for one word. • Normally we are interested in the most probable sequence of tags for the entire sentence. • This is achieved by the Viterbi algorithm. • See Jurafsky and Martin Ch 5 for an introduction to this algorithm CLINT Lecture IV

CSA2050: Introduction to Computational Linguistics

CSA2050: Introduction to Computational Linguistics

Presentation Transcript

CSA2050 Introduction to Computational Linguistics

CMSC 723 / LING 645: Intro to Computational Linguistics

Introduction to Computational Linguistics

Email:zhanghpsoftware.ict.ac Homepage:pipy_world.y365;nlp 2002-9-5

LING 6520: Comparative Topics in Linguistics (from a computational perspective)

LING 581: Advanced Computational Linguistics

Computational linguistics

LING 581: Advanced Computational Linguistics

LING 581: Advanced Computational Linguistics

Introduction to Computational Linguistics

Computational linguistics

LING/C SC/PSYC 438/538 Computational Linguistics

Introduction to Natural Language Processing (aka, Computational Linguistics)

Introduction to Computational Linguistics

ACL Call for New Initiatives Computational Linguistics Video Archive

Computational Linguistics Introduction

LING/C SC/PSYC 438/538 Computational Linguistics

LING/C SC/PSYC 438/538 Computational Linguistics

Computational Linguistics

LING/C SC/PSYC 438/538 Computational Linguistics

LING 438/538 Computational Linguistics

LING/C SC/PYSC 438/538 Computational Linguistics