560 likes | 686 Views
EECS 595 / LING 541 / SI 661. Natural Language Processing. Fall 2005 Lecture Notes #4. Features and unification. Introduction. Grammatical categories have properties Constraint-based formalisms Example: this flights : agreement is difficult to handle at the level of grammatical categories
E N D
EECS 595 / LING 541 / SI 661 Natural Language Processing Fall 2005 Lecture Notes #4
Introduction • Grammatical categories have properties • Constraint-based formalisms • Example: this flights: agreement is difficult to handle at the level of grammatical categories • Example: many water: count/mass nouns • Sample rule that takes into account features: S NP VP (but only if the number of the NP is equal to the number of the VP)
Feature structures CAT NP NUMBER SINGULAR PERSON 3 CAT NP AGREEMENT NUMBER SG PERSON 3 Feature paths: {x agreement number}
Unification [NUMBER SG] [NUMBER SG] + [NUMBER SG] [NUMBER PL] - [NUMBER SG] [NUMBER []] = [NUMBER SG] [NUMBER SG] [PERSON 3] = ?
Agreement • S NP VP{NP AGREEMENT} = {VP AGREEMENT} • Does this flight serve breakfast? • Do these flights serve breakfast? • S Aux NP VP{Aux AGREEMENT} = {NP AGREEMENT}
Agreement • These flights • This flight • NP Det Nominal{Det AGREEMENT} = {Nominal AGREEMENT} • Verb serve{Verb AGREEMENT NUMBER} = PL • Verb serves{Verb AGREEMENT NUMBER} = SG
Subcategorization • VP Verb{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = INTRANS • VP Verb NP{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = TRANS • VP Verb NP NP{VP HEAD} = {Verb HEAD}{VP HEAD SUBCAT} = DITRANS
Regular expressions • Searching for “woodchuck” • Searching for “woodchucks” with an optional final “s” • Regular expressions • Finite-state automata (singular: automaton)
Regular expressions • Basic regular expression patterns • Perl-based syntax (slightly different from other notations for regular expressions) • Disjunctions [abc] • Ranges [A-Z] • Negations [^Ss] • Optional characters ? and * • Wild cards . • Anchors ^ and $, also \b and \B • Disjunction, grouping, and precedence |
Writing correct expressions • Exercise: write a Perl regular expression to match the English article “the”: /the//[tT]he//\b[tT]he\b//[^a-zA-Z][tT]he[^a-zA-Z]//(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
A more complex example • Exercise: Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: /$[0-9]+//$[0-9]+\.[0-9][0-9]//\b$[0-9]+(\.[0-9][0-9])?\b//\b[0-9]+ *(MHz|[Mm]egahertz|Ghz| [Gg]igahertz)\b//\b[0-9]+ *(Mb|[Mm]egabytes?)\b//\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Substitutions and memory • Substitutions s/colour/color/ • Memory (\1, \2, etc. refer back to matches) s/([0-9]+)/<\1>/
Eliza [Weizenbaum, 1966] User: Men are all alike ELIZA: IN WHAT WAY User: They’re always bugging us about something or other ELIZA: CAN YOU THINK OF A SPECIFIC EXAMPLE? User: Well, my boyfriend made me come here ELIZA: YOUR BOYFRIEND MADE YOU COME HERE User: He says I’m depressed much of the time ELIZA: I AM SORRY TO HEAR THAT YOU ARE DEPRESSED
Eliza-style regular expressions Step 1: replace first person references with second person referencesStep 2: use additional regular expressions to generate replies Step 3: use scores to rank possible transformations s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/ s/.* YOU ARE (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/ s/.* all .*/IN WHAT WAY/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
Finite-state automata • Finite-state automata (FSA) • Regular languages • Regular expressions
Finite-state automata (machines) baa! baaa! baaaa! baaaaa! ... baa+! a b a a ! q0 q1 q2 q3 q4 finalstate state transition
Input tape q0 a b a ! b
Finite-state automata • Q: a finite set of N states q0, q1, … qN • : a finite input alphabet of symbols • q0: the start state • F: the set of final states • (q,i): transition function
The FSM toolkit and friends • Developed at AT&T Research (Riley, Pereira, Mohri, Sproat) • Download: http://www.research.att.com/sw/tools/fsm/tech.htmlhttp://www.research.att.com/sw/tools/lextools/ • Tutorial available • 4 useful parts: FSM, Lextools, GRM, Dot (separate) • /data2/tools/fsm-3.6/bin • /data2/tools/lextools/bin • /data2/tools/dot/bin
D-RECOGNIZE function D-RECOGNIZE (tape, machine) returns accept or rejectindex Beginning of tapecurrent-state Initial state of machineloopif End of input has been reached thenif current-state is an accept state thenreturn acceptelsereturn rejectelsiftransition-table [current-state, tape[index]] is empty thenreturn rejectelsecurrent-state transition-table [current-state, tape[index]]index index + 1end
Adding a failing state a b a a ! q0 q1 q2 q3 q4 ! ! b ! b ! b b a a qF
Languages and automata • Formal languages: regular languages, non-regular languages • deterministic vs. non-deterministic FSAs • Epsilon () transitions
Using NFSAs to accept strings • Backup: add markers at choice points, then possibly revisit underexplored markers • Look-ahead: look ahead in input • Parallelism: look at alternatives in parallel
More about FSAs • Transducers • Equivalence of DFSAs and NFSAs • Recognition as search: depth-first, breadth-search
Regular languages • Operations on regular languages and FSAs: concatenation, closure, union • Properties of regular languages (closed under concatenation, union, disjunction, intersection, difference, complementation, reversal, Kleene closure)
An exercise • J&M 2.8. Write a regular expression for the language accepted by the NFSA in the Figure.
Morphology and Finite-State Transducers
Morphemes • Stems, affixes • Affixes: prefixes, suffixes, infixes: hingi (borrow) – humingi (agent) in Tagalog, circumfixes: sagen – gesagt in German • Concatenative morphology • Templatic morphology (Semitic languages) : lmd (learn), lamad (he studied), limed (he taught), lumad (he was taught)
Morphological analysis • rewrites • unbelievably
Inflectional morphology • Tense, number, person, mood, aspect • Five verb forms in English • 40+ forms in French • Six cases in Russian:http://www.departments.bucknell.edu/russian/language/case.html • Up to 40,000 forms in Turkish (you cause X to cause Y to … do Z)
Derivational morphology • Nominalization: computerization, appointee, killer, fuzziness • Formation of adjectives: computational, embraceable, clueless
Finite-state morphological parsing • Cats: cat +N +PL • Cat: cat +N +SG • Cities: city +N +PL • Geese: goose +N +PL • Ducks: (duck +N +PL) or (duck +V +3SG) • Merging: +V +PRES-PART • Caught: (catch +V +PAST-PART) or (catch +V +PAST)
Principles of morphological parsing • Lexicon • Morphotactics (e.g., plural follows noun) • Orthography (easy easier) • Irregular nouns: e.g., geese, sheep, mice • Irregular verbs: e.g., caught, ate, eaten
FSA for adjectives • Big, bigger, biggest • Cool, cooler, coolest, coolly • Red, redder, reddest • Clear, clearer, clearest, clearly, unclear, unclearly • Happy, happier, happiest, happily • Unhappy, unhappier, unhappiest, unhappily • What about: unbig, redly, and realest?
Using FSA for recognition • Is a string a legitimate word or not? • Two-level morphology: lexical level + surface level (Koskenniemi 83) • Finite-state transducers (FST) – used for regular relations • Inversion and composition of FST
Orthographic rules • Beg/begging • Make/making • Watch/watches • Try/tries • Panic/panicked
Combining FST lexicon and rules • Cascades of transducers:the output of one becomes the input of another
Phonetic symbols • IPA • Arpabet • Examples
Using WFST for language modeling • Phonetic representation • Part-of-speech tagging
Some POS statistics • Preposition list from COBUILD • Single-word particles • Conjunctions • Pronouns • Modal verbs
Tagsets for English • Penn Treebank • Other tagsets (see Week 1 slides)
POS ambiguity • Degrees of ambiguity (DeRose 1988) • Rule-based POS tagging • ENGTWOL (Voutilainen et al. ) • Sample rule: • Adverbial-That rule (“it isn’t that odd”) (“Given input: “that”if (+1 A/ADV/QUANT); (+2 SENT-LIM); (NOT –1 SVOC/A); (not a verb like “consider”)then eliminate non-ADV tagselse eliminate ADV tag