570 likes | 584 Views
This finite-state machine toolkit is a simple and user-friendly tool for building and processing regular expressions. It can be used for homework, commercial projects, and integration with other NLP tools that require finite-state processing.
E N D
Building Finite-State Machines 600.465 - Intro to NLP - J. Eisner
Xerox Finite-State Tool • You’ll use it for homework … • Commercial (we have license; open-source clone is Foma) • One of several finite-state toolkits available • This one is easiest to use but doesn’t have probabilities • Usage: • Enter a regular expression; it builds FSA or FST • Now type in input string • FSA: It tells you whether it’s accepted • FST: It tells you all the output strings (if any) • Can also invert FST to let you map outputs to inputs • Could hook it up to other NLP tools that need finite-state processing of their input or output 600.465 - Intro to NLP - J. Eisner
Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” 600.465 - Intro to NLP - J. Eisner 4
Common Regular Expression Operators (in XFST notation) concatenation EF EF = {ef:e E,f F} ef denotes the concatenation of 2 strings. EF denotes the concatenation of 2 languages. • To pick a string in EF, pick e E andf F and concatenate them. • To find out whether w EF, look for at least one way to split w into two “halves,” w = ef, such thate E andf F. A languageis a set of strings. It is a regular language if there exists an FSA that accepts all the strings in the language, and no other strings. If E and F denote regular languages, than so does EF.(We will have to prove this by finding the FSA for EF!) 600.465 - Intro to NLP - J. Eisner
Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ E* = {e1e2 … en:n0, e1 E, … en E} To pick a string in E*, pick any number of strings in E and concatenate them. To find out whether w E*, look for at least one way to split w into 0 or more sections, e1e2 … en, all of which are in E. E+ = {e1e2 … en:n>0, e1 E, … en E} =EE* 600.465 - Intro to NLP - J. Eisner 6
Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F E | F = {w:w E or w F} = E F To pick a string in E | F, pick a string from either E or F. To find out whether w E | F, check whether w E orw F. 600.465 - Intro to NLP - J. Eisner 7
Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F E & F = {w:w E and w F} = E F To pick a string in E & F, pick a string from E that is also in F. To find out whether w E & F, check whether w E andw F. 600.465 - Intro to NLP - J. Eisner 8
Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E ~E = {e:e E} = * - E E – F = {e:e E ande F} = E & ~F \E = - E (any single character not in E) is set of all letters; so * is set of all strings 600.465 - Intro to NLP - J. Eisner 9
Regular Expressions A languageis a set of strings. It is a regular language if there exists an FSA that accepts all the strings in the language, and no other strings. If E and F denote regular languages, than so do EF, etc. Regular expression:EF*|(F & G)+ Syntax: | + concat & * E F F G Semantics:Denotes a regular language. As usual, can build semantics compositionally bottom-up. E, F, G must be regular languages. As a base case, e denotes {e} (a language containing a single string), so ef*|(f&g)+ is regular. 600.465 - Intro to NLP - J. Eisner 10
Regular Expressionsfor Regular Relations A languageis a set of strings. It is a regular language if there exists an FSA that accepts all the strings in the language, and no other strings. If E and F denote regular languages, than so do EF, etc. A relation is a set of pairs – here, pairs of strings. It is a regular relation if there exists an FST that accepts all the pairs in the language, and no other pairs. If E and F denote regular relations, then so do EF, etc. EF = {(ef,e’f’): (e,e’) E,(f,f’) F} Can you guess the definitions for E*, E+, E | F, E & F when E and F are regular relations? Surprise: E & F isn’t necessarily regular in the case of relations; so not supported. 600.465 - Intro to NLP - J. Eisner 11
Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F E .x. F = {(e,f):e E, f F} Combines two regular languages into a regular relation. 600.465 - Intro to NLP - J. Eisner 12
Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F .o.composition E .o. F E .o. F = {(e,f):m.(e,m) E,(m,f) F} Composes two regular relations into a regular relation. As we’ve seen, this generalizes ordinary function composition. 600.465 - Intro to NLP - J. Eisner 13
Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” E.u = {e:m.(e,m) E} 600.465 - Intro to NLP - J. Eisner 14
Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” 600.465 - Intro to NLP - J. Eisner 15
Acceptors (FSAs) Transducers (FSTs) c c:z a a:x Unweighted e e:y c:z/.7 c/.7 a:x/.5 a/.5 Weighted .3 .3 e:y/.5 e/.5 Function from strings to ... {false, true} strings numbers (string, num) pairs
Weighted Relations • If we have a language [or relation], we can ask it:Do you contain this string [or string pair]? • If we have a weightedlanguage [or relation], we ask:What weight do you assign to this string [or string pair]? • Pick a semiring: all our weights will be in that semiring. • Just as for parsing, this makes our formalism & algorithms general. • The unweighted case is the boolean semring {true, false}. • If a string is not in the language, it has weight . • If an FST or regular expression can choose among multiple ways to match, use to combine the weights of the different choices. • If an FST or regular expression matches by matching multiple substrings, use to combine those different matches. • Remember, is like “or” and is like “and”!
Which Semiring Operators are Needed? concatenation EF * +iteration E*, E+ |union E | F ~ \ - complementation, minus ~E, \x, E-F &intersection E & F .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” to combine the matches against E and F to sum over 2 choices 600.465 - Intro to NLP - J. Eisner 18
Common Regular Expression Operators (in XFST notation) |union E | F E | F = {w:w E or w F} = E F Weighted case: Let’s write E(w) to denote the weight of w in the weighted language E. (E|F)(w)= E(w) F(w) to sum over 2 choices 600.465 - Intro to NLP - J. Eisner 19
Which Semiring Operators are Needed? concatenation EF * +iteration E*, E+ |union E | F ~ \ - complementation, minus ~E, \x, E-F &intersection E & F .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” need both and to combine the matches against E and F to sum over 2 choices 600.465 - Intro to NLP - J. Eisner 20
Which Semiring Operators are Needed? concatenation EF +iteration E*, E+ EF = {ef:e E,f F} Weighted case must match two things (), but there’s a choice () about which two things: (EF)(w)= (E(e) F(f)) need both and e,f such that w=ef 600.465 - Intro to NLP - J. Eisner 21
Which Semiring Operators are Needed? concatenation EF * +iteration E*, E+ |union E | F ~ \ - complementation, minus ~E, \x, E-F &intersection E & F .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” need both and to combine the matches against E and F to sum over 2 choices both and (why?) 600.465 - Intro to NLP - J. Eisner 22
Definition of FSTs • [Red material shows differences from FSAs.] • Simple view: • An FST is simply a finite directed graph, with some labels. • It has a designated initial state and a set of final states. • Each edge is labeled with an “upper string” (in *). • Each edge is also labeled with a “lower string” (in *). • [Upper/lower are sometimes regarded as input/output.] • Each edge and final state is also labeled with a semiring weight. • More traditional definition specifies an FST via these: • a state set Q • initial state i • set of final states F • input alphabet S (also define S*, S+, S?) • output alphabet D • transition function d: Q x S? --> 2Q • output function s: Q x S? x Q --> D? 600.465 - Intro to NLP - J. Eisner
How to implement? concatenation EF * +iteration E*, E+ |union E | F ~ \ - complementation, minus ~E, \x, E-F &intersection E & F .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” slide courtesy of L. Karttunen (modified) 600.465 - Intro to NLP - J. Eisner 24
example courtesy of M. Mohri = Concatenation r r 600.465 - Intro to NLP - J. Eisner
example courtesy of M. Mohri = Union r | 600.465 - Intro to NLP - J. Eisner
example courtesy of M. Mohri = Closure (this example has outputs too) * why add new start state 4? why not just make state 0 final? 600.465 - Intro to NLP - J. Eisner
example courtesy of M. Mohri = Upper language (domain) .u similarly construct lower language .l also called input & output languages 600.465 - Intro to NLP - J. Eisner
example courtesy of M. Mohri .r = Reversal ε:ε/0.7 4 600.465 - Intro to NLP - J. Eisner
example courtesy of M. Mohri .i = Inversion 600.465 - Intro to NLP - J. Eisner
example adapted from M. Mohri 2/0.5 2,0/0.8 2/0.8 2,2/1.3 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 eats/0.6 fat/0.7 pig/0.7 0,0 0,1 1,1 sleeps/1.9 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 = 600.465 - Intro to NLP - J. Eisner
fat/0.5 2/0.5 2,2/1.3 2/0.8 2,0/0.8 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 eats/0.6 fat/0.7 pig/0.7 0,0 0,1 1,1 sleeps/1.9 Intersection = Paths 0012 and 0110 both accept fat pig eats So must the new machine: along path 0,00,11,12,0 600.465 - Intro to NLP - J. Eisner
2/0.8 2/0.5 fat/0.7 0,1 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 = 0,0 Paths 00 and 01 both accept fat So must the new machine: along path 0,00,1 600.465 - Intro to NLP - J. Eisner
2/0.8 2/0.5 pig/0.7 1,1 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 fat/0.7 = 0,0 0,1 Paths 00 and 11 both accept pig So must the new machine: along path 0,11,1 600.465 - Intro to NLP - J. Eisner
2/0.8 2/0.5 2,2/1.3 sleeps/1.9 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 fat/0.7 pig/0.7 = 0,0 0,1 1,1 Paths 12 and 12 both accept fat So must the new machine: along path 1,12,2 600.465 - Intro to NLP - J. Eisner
2,2/1.3 2/0.8 2/0.5 eats/0.6 2,0/0.8 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 fat/0.7 pig/0.7 = 0,0 0,1 1,1 sleeps/1.9 600.465 - Intro to NLP - J. Eisner
g 3 4 abcd 2 2 abed abed 8 6 abd abjd ... What Composition Means f ab?d abcd 600.465 - Intro to NLP - J. Eisner
3+4 2+2 2+8 What Composition Means abcd ab?d abed Relation composition: f g abd ... 600.465 - Intro to NLP - J. Eisner
does not contain any pair of the form abjd ↦ … g 3 4 abcd 2 2 abed abed 8 6 abd abjd ... Relation = set of pairs ab?d↦ abcd ab?d ↦ abed ab?d ↦ abjd … abcd ↦ abcd abed ↦ abed abed ↦ abd … f ab?d abcd 600.465 - Intro to NLP - J. Eisner
fg fg = {e↦f: m (e↦m f and m↦fg)} where e, m, f are strings Relation = set of pairs ab?d↦ abcd ab?d ↦ abed ab?d ↦ abjd … abcd ↦ abcd abed ↦ abed abed ↦ abd … ab?d↦ abcd ab?d ↦ abed ab?d ↦ abd … 4 abcd ab?d 2 abed 8 abd ... 600.465 - Intro to NLP - J. Eisner
Intersection vs. Composition Intersection pig/0.4 pig/0.3 pig/0.7 & = 0,1 1 0 1 1,1 Composition pig:pink/0.4 Wilbur:pink/0.7 Wilbur:pig/0.3 .o. = 0,1 1 0 1 1,1 600.465 - Intro to NLP - J. Eisner
Intersection vs. Composition Intersection mismatch elephant/0.4 pig/0.3 pig/0.7 & = 0,1 1 0 1 1,1 Composition mismatch elephant:gray/0.4 Wilbur:gray/0.7 Wilbur:pig/0.3 .o. = 0,1 1 0 1 1,1 600.465 - Intro to NLP - J. Eisner
example courtesy of M. Mohri Composition .o. =
Composition .o. = a:b .o. b:b = a:b
Composition .o. = a:b .o. b:a = a:a
Composition .o. = a:b .o. b:a = a:a
Composition .o. = b:b .o. b:a = b:a
Composition .o. = a:b .o. b:a = a:a
Composition .o. = a:a .o. a:b = a:b
Composition .o. = b:b .o. a:b = nothing (since intermediate symbol doesn’t match)
Composition .o. = b:b .o. b:a = b:a