590 likes | 609 Views
Natural Language Processing. Finite State Transducers. Acknowledgement. Material in this lecture derived/copied from Richard Sproat CL46 Lectures Lauri Karttunen LSA lectures 2005. Three Key Concepts. Regular Relations. Finite State Transducers. Computational Morphology.
E N D
Natural Language Processing Finite State Transducers
Acknowledgement • Material in this lecture derived/copied from • Richard Sproat CL46 Lectures • Lauri Karttunen LSA lectures 2005 CSA3180 NLP
Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CSA3180 NLP
Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CSA3180 NLP
A Regular Set L1 ab abab ababab abababab ababababab . . CSA3180 NLP
Two Regular Sets L1 L2 ab abab ababab abababab ababababab . . ba baba bababa babababa bababababa . . CSA3180 NLP
A Regular Relation L1 x L2 L1 L2 ab abab ababab abababab ababababab . . ba baba bababa babababa bababababa . . or {("ab","ba"), ("abab","baba"),...} CSA3180 NLP
Some closure properties for regular relations • Concatenation [R1 R2] • Power (Rn) • Reversal • Inversion (R-1) • Composition: R1 ○ R2 CSA3180 NLP
Concatenation and Power Concatenation R1 = {("a","b")} R2 = {("c","d")} [R1 R2] = {("ac","bd")} Power R1+ = {("a","b"),("aa","bb"), ("aaa","bbb"), ...} CSA3180 NLP
Composition • R1 ○ R2 denotes the composition of relations R1 and R2. • Definition If R1 contains <x,y> And R2 contains <y,z> Then R1 ○ R2 contains <x,z> • R1 and R2 and B must be relations. If either is just a language, it is assumed to abbreviate the identity relation. • R1 ○ R2 is written [R1 .o. R2] in xfst 61 CSA3180 NLP
Closure Properties of Regular Languages and Relations Operation Regular Languages Regular Relations Union yes yes Concatenation yes yes Iteration yes yes Intersection yes no Subtraction yes no Complementation yes no Composition n/a yes CSA3180 NLP
Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CSA3180 NLP
Morphology • Study of word structure and formation in terms of morphemes • Morpheme: smallest independent unit in a word that bear some meaning, such as rabbit and s. • Two kinds of morphology • Inflectional • Derivational CSA3180 NLP
Inflectional+s plural+ed past category preserving productive: always applies (esp. new words, e.g. fax) systematic: same semantic effect Derivational+ment category changingallot+ment not completely productive: detractment* not completely systematic: department Inflectional/DerivationalMorphology CSA3180 NLP
Example: English Noun Inflections CSA3180 NLP
Analysis Generation hang V Past leaf N Pl leave N Pl leave V Sg3 leaves hanged hung Computational morphology CSA3180 NLP
Two Challenges for Computational Morphology • Handling Morphotactics • Words are composed of smaller elements that must be combined in a certain order: • piti-less-ness is English • piti-ness-less is not English • Handling Phonological Alternations • The shape of an element may vary depending on the context • pity is realized as pitiin pitilessness • die becomes dy in dying CSA3180 NLP
Morphological Parsing • The goal of morphological parsing is to find out what morphemes a given word is built from. cats catNoun PL mice mouseNoun PL foxes fox N PL foxes fox V 3SING PRES • Two kinds of information • Lexeme (which dictionary word) • Morphosyntactic information CSA3180 NLP
Morphological Parsing Output Analysis cat N PL Input Word cats Morphological Parser • Issues • Non-determinism • Arbitrary choice of morpheme symbols CSA3180 NLP
Morphological Generation Input Analysis cat N PL Output Word cats Morphological Generator • Issues • Reversibility: how much can be shared • regardless of direction? CSA3180 NLP
Morphology as a Regular Relation lexical language surface language cat cats mice lives . . . cat cat+N+PL mouse+N+PL life+N+PL live+V+3SING . . or {("cat,cat"),("cats","cat+N+PL"),......} CSA3180 NLP
Three Key Concepts Regular Relations Finite State Transducers Computational Morphology CSA3180 NLP
FSA a • Used for • Recognition • Generation CSA3180 NLP
Finite State Transducers • A finite state transducer (FST) is essentially an FSA finite state automaton that works on two (or more) tapes. • The most common way to think about transducers is as a kind of translating machine which works by reading from one tape and writing onto the other. CSA3180 NLP
FST Definition • A 2 way FST is a quintuple (K,s,F,ixo,) where • i, o are input and output alphabets • K is a finite set of states • s K is an initial state • FK are final states • is a transition relation of typeK x i x o x K CSA3180 NLP
FST a upper tape • Used for • Recognition • Generation • Translation lower tape b CSA3180 NLP
A Very Simple Transducer a b Relation { ("a","b") } Notation a:b encodes the transition CSA3180 NLP
A Very Simple Transducer a b also written as a:b CSA3180 NLP
a Symbol Pairs • Symbols vs. symbol pairs • In general, no distinction is made in xfst betweena the language {“a”}a:a the identity relation {(“a”, “a”)} CSA3180 NLP
A Very Simple Transducer upper side a b lower side a:b CSA3180 NLP
Relation { ("a","b"), ("aa","bb"), ...} Notationa:b* N.B. with this notation a and b must be single symbols A (more interesting) Transducer a:b 1 CSA3180 NLP
Transducer have SeveralModes of Operation • generation mode: It writes on both tapes. A string of as on one tape and a string of bs on the other tape. Both strings have the same length. • recognition mode: It accepts when the word on the first tape consists of exactly as many as as the word on the second tape consists of bs. • translation mode (left to right): It reads as from the first tape and writes a b for every a that it reads onto the second tape. • translation mode (right to left): It reads bs from the second tape and writes an a for every b that it reads onto the first tape. CSA3180 NLP
The Basic Idea • Morphology is regular • Morphology is finite state CSA3180 NLP
Morphology is Regular • The relation between the surface forms of a language and the corresponding lexical forms can be described as a regular relation, e.g.{ ("leaf+N+Pl","leaves"),("hang+V+Past","hung"),...} • Regular relations are closed under operations such as concatenation, iteration, union, and composition. • Complex regular relations can be derived from simpler relations. CSA3180 NLP
Morphology is finite-state • A regular relation can be defined using the metalanguage of regular expressions. • [{talk} | {walk} | {work}] • [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; • A regular expression can be compiled into a finite-state transducer that implements the relation computationally. CSA3180 NLP
Finite-state transducer +Base: final state +3rdSg:s a t +Progr:i :n :g a l k w o r +Past:e :d initial state Compilation Regular expression • [{talk} | {walk} | {work}] • [%+Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; CSA3180 NLP
Generation work+3rdSg --> works +Base: +3rdSg:s a:a t:t +Progr:i :n :g a:a l:l k:k w:w o:o r:r +Past:e :d CSA3180 NLP
Analysis +Base: +3rdSg:s a:a t:t +Progr:i :n :g a:a l:l k:k w:w o:o r:r +Past:e :d talked --> talk+Past CSA3180 NLP
XFST Demo 2 start xfst % xfst xfst[0]: • xfst[0]: regex • [{talk} | {walk} | {work}] • [% +Base:0 | %+SgGen3:s | %+Progr:{ing} | %+Past:{ed}]; compile a regular expression xfst[1]: apply up walked walk+Past apply the result xfst[1]: apply down talk+SgGen3 talks CSA3180 NLP
vouloir +IndP +SG + P3 Finite-state transducer veut citation form inflection codes v o u l o i r +IndP +SG +P3 v e u t inflected form Lexical transducer • Bidirectional: generation or analysis • Compact and fast • Comprehensive systems have been built for over 40 languages: • English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, Korean, Basque, Greek, Arabic, Hebrew, Bulgarian, … CSA3180 NLP
Morphotactics Lexicon Regular Expression Lexicon FST Lexical Transducer (a single FST) Compiler composition Rules Regular Expressions Rule FSTs Alternations f a t +Adj +Comp t e f a t r How lexical transducers are made CSA3180 NLP
fst 1 fst 2 fst n Sequential Model Lexical form Ordered sequence of rewrite rules (Chomsky & Halle ‘68) can be modeled by a cascade of finite-state transducers Johnson ‘72 Kaplan & Kay ‘81 Intermediate form ... Surface form CSA3180 NLP
Sequential application k a N p a n k a m p a n k a m m a n N -> m / _ p p -> m / m _ CSA3180 NLP
N:m 2 p m N:m ? m 0 ? p 1 N N m p 1 ? m 0 ? p:m Sequential application in detail k a N p a n k a m p a n k a m m a n 0 0 0 2 0 0 0 0 0 0 1 0 0 0 CSA3180 NLP
3 0 1 2 Composition N:m p:m k a N p a n k a m m a n N:m m 0 0 0 3 0 0 0 m ? ? p p:m N:m m N ? N N CSA3180 NLP
fst n Parallel Model Lexical form ... fst 2 fst 1 Surface form Set of parallel of two-level rules (constraints) compiled into finite-state automata interpreted as transducers Koskenniemi ‘83 CSA3180 NLP
Koskenniemi 1983 Chomsky&Halle 1968 Lexical form Lexical form rule 1 rule 1 ... rule 2 rule 1 rule n Intermediate form Surface form intersect ... FST rule n Surface form Sequential vs. parallel rules compose CSA3180 NLP
Rewrite Rules vs. Constraints • Two different ways of decomposing the complex relation between lexical and surface forms into a set of simpler relations that can be more easily understood and manipulated. • One approach may be more convenient than the other for particular applications. CSA3180 NLP
b:y a:x c:0 Crossproduct • A .x. B The relation that maps every string in A to every string in B, and vice versa • A:B Same as [A .x. B]. a b c .x. x y [a b c] : [x y] {abc}:{xy} CSA3180 NLP
b a c b:B a:A c:C b:B c:C a:A d:D Composition • A .o. B The relation C such that if A maps x to y and B maps y to z, C maps x to z. {abc} .o. [a:A | b:B | c:C | d:D]* CSA3180 NLP