1 / 57

Building Finite-State Machines

This finite-state machine toolkit is a simple and user-friendly tool for building and processing regular expressions. It can be used for homework, commercial projects, and integration with other NLP tools that require finite-state processing.

Download Presentation

Building Finite-State Machines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Finite-State Machines 600.465 - Intro to NLP - J. Eisner

  2. Xerox Finite-State Tool • You’ll use it for homework … • Commercial (we have license; open-source clone is Foma) • One of several finite-state toolkits available • This one is easiest to use but doesn’t have probabilities • Usage: • Enter a regular expression; it builds FSA or FST • Now type in input string • FSA: It tells you whether it’s accepted • FST: It tells you all the output strings (if any) • Can also invert FST to let you map outputs to inputs • Could hook it up to other NLP tools that need finite-state processing of their input or output 600.465 - Intro to NLP - J. Eisner

  3. Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” 600.465 - Intro to NLP - J. Eisner 4

  4. Common Regular Expression Operators (in XFST notation) concatenation EF EF = {ef:e E,f  F} ef denotes the concatenation of 2 strings. EF denotes the concatenation of 2 languages. • To pick a string in EF, pick e E andf  F and concatenate them. • To find out whether w  EF, look for at least one way to split w into two “halves,” w = ef, such thate E andf  F. A languageis a set of strings. It is a regular language if there exists an FSA that accepts all the strings in the language, and no other strings. If E and F denote regular languages, than so does EF.(We will have to prove this by finding the FSA for EF!) 600.465 - Intro to NLP - J. Eisner

  5. Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ E* = {e1e2 … en:n0, e1 E, … en E} To pick a string in E*, pick any number of strings in E and concatenate them. To find out whether w  E*, look for at least one way to split w into 0 or more sections, e1e2 … en, all of which are in E. E+ = {e1e2 … en:n>0, e1 E, … en E} =EE* 600.465 - Intro to NLP - J. Eisner 6

  6. Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F E | F = {w:w E or w F} = E  F To pick a string in E | F, pick a string from either E or F. To find out whether w  E | F, check whether w  E orw  F. 600.465 - Intro to NLP - J. Eisner 7

  7. Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F E & F = {w:w E and w F} = E F To pick a string in E & F, pick a string from E that is also in F. To find out whether w  E & F, check whether w  E andw  F. 600.465 - Intro to NLP - J. Eisner 8

  8. Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E ~E = {e:e E} = * - E E – F = {e:e E ande F} = E & ~F \E =  - E (any single character not in E)  is set of all letters; so * is set of all strings 600.465 - Intro to NLP - J. Eisner 9

  9. Regular Expressions A languageis a set of strings. It is a regular language if there exists an FSA that accepts all the strings in the language, and no other strings. If E and F denote regular languages, than so do EF, etc. Regular expression:EF*|(F & G)+ Syntax: | + concat & * E F F G Semantics:Denotes a regular language. As usual, can build semantics compositionally bottom-up. E, F, G must be regular languages. As a base case, e denotes {e} (a language containing a single string), so ef*|(f&g)+ is regular. 600.465 - Intro to NLP - J. Eisner 10

  10. Regular Expressionsfor Regular Relations A languageis a set of strings. It is a regular language if there exists an FSA that accepts all the strings in the language, and no other strings. If E and F denote regular languages, than so do EF, etc. A relation is a set of pairs – here, pairs of strings. It is a regular relation if there exists an FST that accepts all the pairs in the language, and no other pairs. If E and F denote regular relations, then so do EF, etc. EF = {(ef,e’f’): (e,e’)  E,(f,f’)  F} Can you guess the definitions for E*, E+, E | F, E & F when E and F are regular relations? Surprise: E & F isn’t necessarily regular in the case of relations; so not supported. 600.465 - Intro to NLP - J. Eisner 11

  11. Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F E .x. F = {(e,f):e  E, f  F} Combines two regular languages into a regular relation. 600.465 - Intro to NLP - J. Eisner 12

  12. Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F .o.composition E .o. F E .o. F = {(e,f):m.(e,m)  E,(m,f)  F} Composes two regular relations into a regular relation. As we’ve seen, this generalizes ordinary function composition. 600.465 - Intro to NLP - J. Eisner 13

  13. Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” E.u = {e:m.(e,m)  E} 600.465 - Intro to NLP - J. Eisner 14

  14. Common Regular Expression Operators (in XFST notation) concatenation EF * +iteration E*, E+ |union E | F &intersection E & F ~ \ - complementation, minus ~E, \x, F-E .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” 600.465 - Intro to NLP - J. Eisner 15

  15. Acceptors (FSAs) Transducers (FSTs) c c:z a a:x Unweighted e e:y c:z/.7 c/.7 a:x/.5 a/.5 Weighted .3 .3 e:y/.5 e/.5 Function from strings to ... {false, true} strings numbers (string, num) pairs

  16. Weighted Relations • If we have a language [or relation], we can ask it:Do you contain this string [or string pair]? • If we have a weightedlanguage [or relation], we ask:What weight do you assign to this string [or string pair]? • Pick a semiring: all our weights will be in that semiring. • Just as for parsing, this makes our formalism & algorithms general. • The unweighted case is the boolean semring {true, false}. • If a string is not in the language, it has weight . • If an FST or regular expression can choose among multiple ways to match, use to combine the weights of the different choices. • If an FST or regular expression matches by matching multiple substrings, use  to combine those different matches. • Remember,  is like “or” and  is like “and”!

  17. Which Semiring Operators are Needed? concatenation EF * +iteration E*, E+ |union E | F ~ \ - complementation, minus ~E, \x, E-F &intersection E & F .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range”  to combine the matches against E and F  to sum over 2 choices 600.465 - Intro to NLP - J. Eisner 18

  18. Common Regular Expression Operators (in XFST notation) |union E | F E | F = {w:w E or w F} = E  F Weighted case: Let’s write E(w) to denote the weight of w in the weighted language E. (E|F)(w)= E(w) F(w)  to sum over 2 choices 600.465 - Intro to NLP - J. Eisner 19

  19. Which Semiring Operators are Needed? concatenation EF * +iteration E*, E+ |union E | F ~ \ - complementation, minus ~E, \x, E-F &intersection E & F .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” need both  and   to combine the matches against E and F  to sum over 2 choices 600.465 - Intro to NLP - J. Eisner 20

  20. Which Semiring Operators are Needed? concatenation EF +iteration E*, E+ EF = {ef:e E,f  F} Weighted case must match two things (), but there’s a choice () about which two things: (EF)(w)= (E(e) F(f)) need both  and   e,f such that w=ef 600.465 - Intro to NLP - J. Eisner 21

  21. Which Semiring Operators are Needed? concatenation EF * +iteration E*, E+ |union E | F ~ \ - complementation, minus ~E, \x, E-F &intersection E & F .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” need both  and   to combine the matches against E and F  to sum over 2 choices both  and  (why?)   600.465 - Intro to NLP - J. Eisner 22

  22. Definition of FSTs • [Red material shows differences from FSAs.] • Simple view: • An FST is simply a finite directed graph, with some labels. • It has a designated initial state and a set of final states. • Each edge is labeled with an “upper string” (in *). • Each edge is also labeled with a “lower string” (in *). • [Upper/lower are sometimes regarded as input/output.] • Each edge and final state is also labeled with a semiring weight. • More traditional definition specifies an FST via these: • a state set Q • initial state i • set of final states F • input alphabet S (also define S*, S+, S?) • output alphabet D • transition function d: Q x S? --> 2Q • output function s: Q x S? x Q --> D? 600.465 - Intro to NLP - J. Eisner

  23. How to implement? concatenation EF * +iteration E*, E+ |union E | F ~ \ - complementation, minus ~E, \x, E-F &intersection E & F .x.crossproduct E .x. F .o.composition E .o. F .uupper (input) language E.u “domain” .llower (output) language E.l “range” slide courtesy of L. Karttunen (modified) 600.465 - Intro to NLP - J. Eisner 24

  24. example courtesy of M. Mohri = Concatenation r r 600.465 - Intro to NLP - J. Eisner

  25. example courtesy of M. Mohri = Union r | 600.465 - Intro to NLP - J. Eisner

  26. example courtesy of M. Mohri = Closure (this example has outputs too) * why add new start state 4? why not just make state 0 final? 600.465 - Intro to NLP - J. Eisner

  27. example courtesy of M. Mohri = Upper language (domain) .u similarly construct lower language .l also called input & output languages 600.465 - Intro to NLP - J. Eisner

  28. example courtesy of M. Mohri .r = Reversal ε:ε/0.7 4 600.465 - Intro to NLP - J. Eisner

  29. example courtesy of M. Mohri .i = Inversion 600.465 - Intro to NLP - J. Eisner

  30. example adapted from M. Mohri 2/0.5 2,0/0.8 2/0.8 2,2/1.3 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 eats/0.6 fat/0.7 pig/0.7 0,0 0,1 1,1 sleeps/1.9 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 = 600.465 - Intro to NLP - J. Eisner

  31. fat/0.5 2/0.5 2,2/1.3 2/0.8 2,0/0.8 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 eats/0.6 fat/0.7 pig/0.7 0,0 0,1 1,1 sleeps/1.9 Intersection = Paths 0012 and 0110 both accept fat pig eats So must the new machine: along path 0,00,11,12,0 600.465 - Intro to NLP - J. Eisner

  32. 2/0.8 2/0.5 fat/0.7 0,1 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 = 0,0 Paths 00 and 01 both accept fat So must the new machine: along path 0,00,1 600.465 - Intro to NLP - J. Eisner

  33. 2/0.8 2/0.5 pig/0.7 1,1 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 fat/0.7 = 0,0 0,1 Paths 00 and 11 both accept pig So must the new machine: along path 0,11,1 600.465 - Intro to NLP - J. Eisner

  34. 2/0.8 2/0.5 2,2/1.3 sleeps/1.9 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 fat/0.7 pig/0.7 = 0,0 0,1 1,1 Paths 12 and 12 both accept fat So must the new machine: along path 1,12,2 600.465 - Intro to NLP - J. Eisner

  35. 2,2/1.3 2/0.8 2/0.5 eats/0.6 2,0/0.8 Intersection fat/0.5 pig/0.3 eats/0 0 1 sleeps/0.6 pig/0.4 fat/0.2 sleeps/1.3 & 0 1 eats/0.6 fat/0.7 pig/0.7 = 0,0 0,1 1,1 sleeps/1.9 600.465 - Intro to NLP - J. Eisner

  36. g 3 4 abcd 2 2 abed abed 8 6 abd abjd ... What Composition Means f ab?d abcd 600.465 - Intro to NLP - J. Eisner

  37. 3+4 2+2 2+8 What Composition Means abcd ab?d abed Relation composition: f  g abd ... 600.465 - Intro to NLP - J. Eisner

  38. does not contain any pair of the form abjd ↦ … g 3 4 abcd 2 2 abed abed 8 6 abd abjd ... Relation = set of pairs ab?d↦ abcd ab?d ↦ abed ab?d ↦ abjd … abcd ↦ abcd abed ↦ abed abed ↦ abd … f ab?d abcd 600.465 - Intro to NLP - J. Eisner

  39. fg fg = {e↦f: m (e↦m f and m↦fg)} where e, m, f are strings Relation = set of pairs ab?d↦ abcd ab?d ↦ abed ab?d ↦ abjd … abcd ↦ abcd abed ↦ abed abed ↦ abd … ab?d↦ abcd ab?d ↦ abed ab?d ↦ abd … 4 abcd ab?d 2 abed 8 abd ... 600.465 - Intro to NLP - J. Eisner

  40. Intersection vs. Composition Intersection pig/0.4 pig/0.3 pig/0.7 & = 0,1 1 0 1 1,1 Composition pig:pink/0.4 Wilbur:pink/0.7 Wilbur:pig/0.3 .o. = 0,1 1 0 1 1,1 600.465 - Intro to NLP - J. Eisner

  41. Intersection vs. Composition Intersection mismatch elephant/0.4 pig/0.3 pig/0.7 & = 0,1 1 0 1 1,1 Composition mismatch elephant:gray/0.4 Wilbur:gray/0.7 Wilbur:pig/0.3 .o. = 0,1 1 0 1 1,1 600.465 - Intro to NLP - J. Eisner

  42. example courtesy of M. Mohri Composition .o. =

  43. Composition .o. = a:b .o. b:b = a:b

  44. Composition .o. = a:b .o. b:a = a:a

  45. Composition .o. = a:b .o. b:a = a:a

  46. Composition .o. = b:b .o. b:a = b:a

  47. Composition .o. = a:b .o. b:a = a:a

  48. Composition .o. = a:a .o. a:b = a:b

  49. Composition .o. = b:b .o. a:b = nothing (since intermediate symbol doesn’t match)

  50. Composition .o. = b:b .o. b:a = b:a

More Related