460 likes | 601 Views
CPSC 503 Computational Linguistics. Lecture 4 Giuseppe Carenini. Today 1/23. Finite State Transducers (FSTs) and Morphological Parsing Stemming (Porter Stemmer). Computational problems in Morphology. Recognition : recognize whether a string is an English word (FSA) Parsing/Generation :.
E N D
CPSC 503Computational Linguistics Lecture 4 Giuseppe Carenini CPSC503 Spring 2004
Today 1/23 • Finite State Transducers (FSTs) and Morphological Parsing • Stemming (Porter Stemmer) CPSC503 Spring 2004
Computational problems in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation: stem, class, lexical features …. word …. lie +N +PL e.g., lies lie +V +3SG • Stemming: stem word …. CPSC503 Spring 2004
Finite State Transducers (FSTs) • FSA cannot help …… • Need to extend FSA • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL” (or vice versa) CPSC503 Spring 2004
FSTs as translators parsing generation CPSC503 Spring 2004
Example +PL:s l:l i:i e:e +N:ε q0 q1 q2 q3 q4 q6 q5 q7 +V:ε +3SG:s Transitions (as a translator): • l:l means read a l on one tape and write a l on the other (or vice versa) • +N:ε means read a +N symbol on one tape and write nothing on the other (or vice versa) • +PL:s means read +PL and write an s (or vice versa) • … CPSC503 Spring 2004
Examples (as a translator) lexical surface l i e s lexical l i e +V +3SG surface CPSC503 Spring 2004
Examples (as a recognizer and a generator) l i e +V +3SG lexical surface l i e s lexical surface CPSC503 Spring 2004
FST definition • Q: a finiteset of states • I,O: input and an output alphabets (which may include ε) • Σ: a finite alphabet of complex symbols i:o, iI and oO • Q0: the start state • F: a set of accept/final states (FQ) • A transition relation δ that maps QxΣ to Q CPSC503 Spring 2004
FST can be used as… • Translators: input one string from I, output another from O (or vice versa) • Recognizers: input a strings from IxO • Generator: output a string from IxO Terminology warning! CPSC503 Spring 2004
A step back: FSA can represent morphological knowledge • Lexicon: list of stem and affixes, together with basic information about them • Morphotactics: the rules governing the ordering of morphemes • Orthographics rules: model changes in morphemes when they combine CPSC503 Spring 2004
FSA for inflectional morphology of plural Some regular-nouns i Some irregular-nouns CPSC503 Spring 2004
FST for inflectional morphology of plural Some regular-nouns Some irregular-nouns o:i CPSC503 Spring 2004
Examples lexical surface m i c e lexical c a t +N +PL surface CPSC503 Spring 2004
Problems/Challenges • Ambiguity: one word can correspond to multiple structures • Spelling changes: may occur when two morphemes are combined (inflectionally) e.g. butterfly + -s -> butterflies CPSC503 Spring 2004
Ambiguity • ND recognition: multiple paths through a machine may lead to an accept state (Didn’t matter which path was actually traversed) • In ND parsing the path to an accept state does matter: differ paths represent different parses and different outputs will result +PL:s l:l i:i e:e +N:ε q0 q1 q2 q3 q4 q6 q5 q7 +V:ε CPSC503 Spring 2004 +PL:s
Ambiguity: more complex example • What’s the right parse for Unionizable? • Union-ize-able • Un-ion-ize-able • Each would represent a valid path through an FST for derivational morphology. CPSC503 Spring 2004
Deal with Morphological Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored Then Part-of-speech tagging to choose CPSC503 Spring 2004
Spelling Changes When morphemes are combined inflectionally the spelling at the boundaries may change • Examples • E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x (e.g., kiss, miss, waltz, bush, watch, rich, box) • Y-replacement: when –s or -ed are added to a word ending with a –y, -y changes to –ie or –i respectively (e.g., try, butterfly) CPSC503 Spring 2004
Solution: Multi-Tape Machines • Add intermediate tape • Use the output of one tape machine as the input to the next • Add intermediate symbols • ^ morpheme boundary • # word boundary CPSC503 Spring 2004
Multi-Level Tape Machines FST-1 FST-2 • FST-1 translates between the lexical and the intermediate level • FTS-2 handles the spelling changes (due to one rule) to the surface tape CPSC503 Spring 2004
FST-1 for inflectional morphology of plural Some regular-nouns +PL:^s# # # # Some irregular-nouns o:i ε:s ε:# +PL:^ CPSC503 Spring 2004
Example lexical f o x +N +PL intemediate lexical m o u s e +N +PL intemediate CPSC503 Spring 2004
FST-2 for E-insertion(Intermediate to Surface) E-insertion: when –s is added to a word, -e is inserted if word ends in –s, -z, -sh, -ch, -x …as in fox^s# <-> foxes #: ε CPSC503 Spring 2004
Examples intemediate f o x ^ s # surface intemediate b o x ^ i n g # surface CPSC503 Spring 2004
Where are we? CPSC503 Spring 2004
Final Scheme: Part 1 CPSC503 Spring 2004
Final Scheme: Part 2 CPSC503 Spring 2004
Intersection (T1,T2) • States of T1 and T2 : Q1 and Q2 • States of intersection: Q1 x Q2 • Transitions of T1 and T2 : δ1, δ2 • Transitions of intersection : δ3 δ3((xa,ya), i:c) = (xb,yb) iff • δ1(xa, i:c) = xb AND • δ2(ya, i:c) = yb CPSC503 Spring 2004
Composition(T1,T2) • States of T1 and T2 : Q1 and Q2 • States of composition : Q1 x Q2 • Transitions of T1 and T2 : δ1, δ2 • Transitions of composition : δ3 δ3((xa,ya), i:o) = (xb,yb) iff • There exists c such that • δ1(xa, i:c) = xb AND • δ2(ya, c:o) = yb CPSC503 Spring 2004
Other important applications of FTS in NLP • Segmentation: finding word boundaries in text (?!) • Shallow syntactic parsing: e.g., find only noun phrases • Dialogue Act Disambiguation: “right” (IUI-04) • Phonological Rules…. CPSC503 Spring 2004
FSTs in Practice • Install an FST package…… (pointers) • Describe your “formal language” (e.g, lexicon, morphotactic and rules) in a RegExp like notation (pointer) • Your specification is compiled in an FST NOTE: FSTs for the morphology of a natural language may have 105 – 107 states and arcs CPSC503 Spring 2004
Computational problems in Morphology • Recognition: recognize whether a string is an English word (FSA) • Parsing/Generation (FST): stem, class, lexical features word …. …. lie +N +PL e.g., lies lie +V +3SG • Stemming: stem word …. CPSC503 Spring 2004
Stemmer • E.g. the Porter algorithm (Appendix B), which is based on a series of sets of simple cascaded rewrite rules: • ATIONAL ATE (relational relate) • ING if stem contains vowel (motoring motor) • Cascade of rules applied to: computerization • ization -> -ize computerize • ize -> εcomputer • Errors occur: • organization organ, doing doe university universe CPSC503 Spring 2004
Stemming mainly used in Information Retrieval • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Compute similarity between queries and documents (based on stems they contain) CPSC503 Spring 2004
Porter as an FST • The original exposition of the Porter stemmer did not describe it as a transducer but… • Each stage is a separate transducer • The stages can be composed to get one big transducer CPSC503 Spring 2004
Formalisms and associated Algorithms Linguistic Knowledge State Machines (no prob.) • Finite State Automata (and Regular Expressions) • Finite State Transducers (English) Morphology Syntax Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Semantics Logical formalisms (First-Order Logics) Pragmatics Discourse and Dialogue AI planners CPSC503 Spring 2004
Next Time • Intro to probability and information theory • On your preferred source read about • Conditional probability • Bayes’ rule • Independence • Entropy • Conditional Entropy and Mutual Information CPSC503 Spring 2004
Lexical to Intermediate Level CPSC503 Spring 2004
FST for inflectional morphology of plural Some regular-nouns Some irregular-nouns CPSC503 Spring 2004
Foxes CPSC503 Spring 2004
FST Review • FSTs allow us to take an input and deliver a structure based on it • Or… take a structure and create a surface form • Or take a structure and create another structure CPSC503 Spring 2004
Formalisms and associated Algorithms Linguistic Knowledge State Machines (no prob.) • Finite State Automata (and Regular Expressions) • Finite State Transducers (English) Morphology Syntax Rule systems (and prob. version) (e.g., (Prob.) Context-Free Grammars) Semantics Logical formalisms (First-Order Logics) Pragmatics Discourse and Dialogue AI planners CPSC503 Spring 2004
Review • In many applications its convenient to decompose the problem into a set of cascaded transducers where • The output of one feeds into the input of the next. CPSC503 Spring 2004
English Spelling Changes • We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape CPSC503 Spring 2004
FST can be used as… • Translators: input one string (a sequence from I), output another one (a sequence from O)……or viceversa • Recognizers: input both strings (a sequence from IxO) • Generator: output both strings (a sequence from IxO) CPSC503 Spring 2004