230 likes | 389 Views
Finite-State Methods. c. a. e. Finite state acceptors (FSAs). Regexps Union, Kleene *, concat, intersect, complement, reversal Determinization, minimization Pumping, Myhill-Nerode. slide courtesy of L. Karttunen (modified). Network. Wordlist. c. l. e. a. r. clear clever ear
E N D
Finite-State Methods 600.465 - Intro to NLP - J. Eisner
c a e Finite state acceptors (FSAs) • Regexps • Union, Kleene *, concat, intersect, complement, reversal • Determinization, minimization • Pumping, Myhill-Nerode 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) Network Wordlist c l e a r clear clever ear ever fat father e v e compile f a t h Example: Unweighted acceptor FSM 17728 states, 37100 arcs 0.6 sec /usr/dict/words 25K words 206K chars 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) Wordlist clear 0 clever 1 ear 2 ever 3 fat 4 father 5 Example: Weighted acceptor Network c/0 l/0 e/0 a/0 r/0 e/2 v/1 e/0 compile f/4 a/0 t/0 h/1 Computes a perfect hash! 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) Wordlist c/0 l/0 e/0 a/0 r/0 clear 0 clever 1 ear 2 ever 3 fat 4 father 5 e/2 v/1 e/0 f/4 a/0 t/0 h/1 Example: Weighted acceptor Network c l e a r 2 6 1 2 1 2 e v e compile f a t h 2 1 2 2 • Successor states partition the path set • Use offsets of successor states as arc weights • Q: Would this work for an arbitrary numbering of the words? • Compute number of paths from each state (Q: how?) 600.465 - Intro to NLP - J. Eisner
the problem of morphology (“word shape”) - an area of linguistics Example: Unweighted transducer VP [head=vouloir,...] V[head=vouloir,tense=Present,num=SG, person=P3] ... veut 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) vouloir +Pres +Sing + P3 Finite-state transducer veut canonical form inflection codes v o u l o i r +Pres +Sing +P3 v e u t inflected form Example: Unweighted transducer VP [head=vouloir,...] V[head=vouloir,tense=Present,num=SG, person=P3] ... veut the relevant path 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen vouloir +Pres +Sing + P3 Finite-state transducer veut canonical form inflection codes v o u l o i r +Pres +Sing +P3 v e u t inflected form Example: Unweighted transducer • Bidirectional: generation or analysis • Compact and fast • Xerox sells for about 20 languges including English, German, Dutch, French, Italian, Spanish, Portuguese, Finnish, Russian, Turkish, Japanese, ... • Research systems for many other languages, including Arabic, Malay the relevant path 600.465 - Intro to NLP - J. Eisner
Regular Relation (of strings) • Relation: like a function, but multiple outputs ok • Regular: finite-state • Transducer: automaton w/ outputs • b ? a ? • aaaaa ? a:e b:b {b} {} a:a a:c {ac, aca, acab, acabc} ?:c b:b ?:b • Invertible? • Closed under composition? ?:a b:e 600.465 - Intro to NLP - J. Eisner
Regular Relation (of strings) • Can weight the arcs: vs. • b {b} a {} • aaaaa {ac, aca, acab, acabc} • How to find best outputs? • For aaaaa? • For all inputs at once? a:e b:b a:a a:c ?:c b:b ?:b ?:a b:e 600.465 - Intro to NLP - J. Eisner
Acceptors (FSAs) Transducers (FSTs) c c:z a a:x Unweighted e e:y c:z/.7 c/.7 a:x/.5 a/.5 Weighted .3 .3 e:y/.5 e/.5 Function from strings to ... {false, true} strings numbers (string, num) pairs 600.465 - Intro to NLP - J. Eisner
Sample functions Acceptors (FSAs) Transducers (FSTs) {false, true} strings Grammatical? Markup Correction Translation Unweighted numbers (string, num) pairs How grammatical? Better, how likely? Good markups Good corrections Good translations Weighted 600.465 - Intro to NLP - J. Eisner
abcd abcd f g Functions ab?d 600.465 - Intro to NLP - J. Eisner
Functions ab?d abcd Function composition: f g [first f, then g – intuitive notation, but opposite of the traditional math notation] 600.465 - Intro to NLP - J. Eisner
g 3 4 abgd 2 2 abed abed 8 6 abd abjd ... From Functions to Relations f ab?d abcd 600.465 - Intro to NLP - J. Eisner
From Functions to Relations 4 ab?d abgd 2 abed Relation composition: f g 8 abd ... 600.465 - Intro to NLP - J. Eisner
From Functions to Relations ab?d abed Pick best output Often in NLP, all of the functions or relations involved can be described as finite-state machines, and manipulated using standard algorithms. 600.465 - Intro to NLP - J. Eisner
g 3 4 abgd 2 2 abed abed 8 6 abd abjd ... Inverting Relations f ab?d abcd 600.465 - Intro to NLP - J. Eisner
Inverting Relations g-1 f-1 3 4 ab?d abcd abgd 2 2 abed abed 8 6 abd abjd ... 600.465 - Intro to NLP - J. Eisner
Inverting Relations 4 ab?d abgd 2 abed (f g)-1 = g-1 f-1 8 abd ... 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) c l e a r e v e f a t h Lexical Transducer (a single FST) Lexicon FSA composition Regular Expressions for Rules ComposedRule FSTs Compiler b i g +Adj +Comp g e b i g r one path Building a lexical transducer big | clear | clever | ear | fat | ... Regular Expression Lexicon 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) c l e a r e v e f a t h Lexicon FSA Building a lexical transducer • Actually, the lexicon must contain elements likebig +Adj +Comp • So write it as a more complicated expression:(big | clear | clever | fat | ...) +Adj ( | +Comp | +Sup) adjectives | (ear | father | ...) +Noun (+Sing | +Pl) nouns | ... ... • Q: Why do we need a lexicon at all? big | clear | clever | ear | fat | ... Regular Expression Lexicon 600.465 - Intro to NLP - J. Eisner
slide courtesy of L. Karttunen (modified) être+IndP +SG + P1 suivre+IndP+SG+P1 suivre+IndP+SG+P2 suivre+Imp+SG + P2 Weighted version of transducer: Assigns a weight to each string pair “upper language” 4 payer+IndP+SG+P1 19 20 12 Weighted French Transducer 50 3 suis paie “lower language” paye 600.465 - Intro to NLP - J. Eisner