270 likes | 424 Views
Composition is Our Friend. Wednesday PM Kenneth R. Beesley Xerox Research Centre Europe. View composition vertically. p a t + i n + a d + i m + a b. Underlying form. e -> i || _ .#. Rule 1. p a t + i n + a d + i m + a b. Intermediate form. d -> j, t -> c || _ (“+”) i. Rule 2.
E N D
Composition is Our Friend Wednesday PM Kenneth R. Beesley Xerox Research Centre Europe
View composition vertically p a t + i n + a d + i m + a b Underlying form e -> i || _ .#. Rule 1 p a t + i n + a d + i m + a b Intermediate form d -> j, t -> c || _ (“+”) i Rule 2 p a c + i n + a j + i m + a b Intermediate form b -> p, d -> t, g -> k || _ .#. Rule n p a c + i n + a j + i m + a p Final form
View composition vertically p a t + i n + a d + i m + a b e -> i || _ .#. .o. A Single FST d -> j, t -> c || _ (“+”) i .o. b -> p, d -> t, g -> k || _ .#. p a c + i n + a j + i m + a p
Composition is Our Friend The composition operation is often the key to building, modifying, filtering and testing finite-state systems.
You Can Compose Transducers • Regular languages (and the networks that encode them) can be unioned, concatenated, intersected, subtracted and complemented. • Regular relations (and the transducers that encode them) can be unioned and concatenated. • But you cannot, in general, intersect, complement, or subtract transducers (relations). This is a mathematical restriction. Relations are not closed for these operations. • But you can compose transducers
An Example for the Mathematicians • Regular relations are not closed under intersection (&), subtraction (-) or complementation (&). • This means that when you intersect, subtract or complement regular relations, the result may no longer be regular. I.e. the result may no longer be finite state, and so cannot be encoded as a finite-state network. • The following example is based on intersection
Intersection of Two Finite-State Relations FST A: [ a:b ]* [ 0:c ]* FST B: [ 0:b ]* [ a:c ]* a:b 0:c 0:b a:c 0:c a:c • On the upper side, some number n of as • On the lower side, n bs, followed by any number of cs • On the upper side, some number n of as • On the lower side, any number of bs, followed by n cs
Attempted Intersection of Two Finite-State Relations (FSTs) 00 0:b a:c a:b 0:c a bc a:c 0:c aa bbcc And the bncn language is known to be context-free in power (i.e. beyond finite-state power). The lower-side language of the resulting relation is bncn aaa bbbccc aaaa bbbbcccc
Back Down to Earth • Just be aware that transducers cannot, in general, be intersected, subtracted, or complemented. • But transducers can be unioned, concatenated, and composed. • Composition is often the key operation for modifying, filtering, and combining transducers.
Phonological/Orthographical Rules Lexicon FST (lexc) “Application” of rules via composition is already familiar to us. .o. Rule 1 .o. Rule 2 .o. Rule n
Orthographical Modification via Composition Standard German spelling uses ü, ö, ä and ß. An alternative orthography, where these letters are not available, replaces them with “ue”, “oe”, “ae” and “ss” respectively. läßt StandardGermanFST with ü, ö , ä and ß on the lower side ModifiedGermanFST with ue, oe, ae and ss on the lower side .o. [ ü -> u e , ö -> o e , ä -> a e, ß -> s s ] laesst How would we modify StandardGermanFST to analyze both über and ueber, läßt and laesst and laeßt and lässt?
Composition: top and bottom If you compose a rule on the bottom of an FST, it modifies only the lower-side language of the FST. CoreFST .o. Rule CoreFST .o. Rule If you compose a rule on the top of an FST, it modifies only the upper-side language of the FST. Rule .o. Rule .o. CoreFST CoreFST
Change a Tagname on the Upper Side via Composition An example of composition on the upper side ... casa[Subst][Masc][Pl] “[Subst]” <- “[Noun]” .o. casa[Noun][Masc][Pl] Baseform+Tags language Core Lexicon casas surface-word language
Simple Filtering to Facilitate Testing Take a “lexical transducer”, remove everything but adjectives. When a simple language is used in composition, it is automatically treated like an identity relation. $“[Adj]” .o. Baseform+Tags language Core Lexicon surface-word language
Simple Filtering II Take a lexical transducer and remove the adjectives (leave the rest). ~$“[Adj]” .o. Baseform+Tags language Core Lexicon surface-word language
Simple Filtering III • Take an English lexical transducer and restrict it to contain • Only adjectives • that end in -ly $”[Adj]” .o. Baseform+Tags language Core Lexicon friendly, lovely, cowardly, dastardly, … surface-word language .o. ?* l y
Mindtuning for Finite-State Development • Try to imagine all the possible uses/users of your system. • Try to create a core system that may, by itself, serve nobody; but which, via filtering, may serve in multiple systems.. • If it seems that you have to decide between choice A and choice B, try to create a single core system, with one set of source files, that supports both A and B • Language dialects • Spelling dialects • Spelling relaxations
Language Dialects: equivalent ways to start Multichar_Symbols ^A ^B +Sg +Pl LEXICON Root Nouns ; LEXICON Nouns jail^A:jail N ; gaol^B:gaol N ; dog N ; LEXICON N +Sg:0 # ; +Pl:s # ; LEXICON Root Nouns ; LEXICON Nouns < j a i l %^A:0 > N ; <g a o l %^B:0 > N ; dog N ; LEXICON N < %+Sg:0 > # ; < %+Pl:s > # ;
One Core, Several Final Products To leave both American and British words in the lexicon, just remove the dialect tags, mapping them to the empty string. 0 <- %^A .o. 0 <- %^B .o. CommonCoreFST
One Core, Several Products To leave just British (and common) words in the lexicon, filter out the exclusively American words. Two equivalent ways: 0 <- %^B .o. ~$[%^A] .o. CommonCoreFST 0 <- %^B .o. ~[?*] <- %^A .o. CommonCoreFST
One Core, Several Products To leave just American (and common) words in the lexicon, filter out the exclusively British words. Two equivalent ways: 0 <- %^A .o. ~$[%^B] .o. CommonCoreFST 0 <- %^A .o. ~[?*] <- %^B .o. CommonCoreFST
Vulgar/Slang/Substandard Use similar feature symbols on the lexical side, e.g. ^V for vulgar words ^S for slang ^D for substandard forms Then filter them out as necessary, via composition, for each version of the final product.
Spelling Distinctions If one dialect makes a spelling distinction, and another ignores it, build your core system to show the distinction. lingüístico Adj ; This is the Spanish spelling used in Latin America. Then for Spain, where the ü is not used, modify the core trivially via composition on both sides: u <- ü .o. CommonCoreFST .o. ü -> u
Spelling Relaxations, Accentuation Build your core system to reflect formally correct spelling. Then relax that spelling in some versions of your system via composition, e.g. to allow accents to be “dropped”. StandardSpanishFST .o. [ é (->) e , í (->) i , á (->) a , ó (->) o , ú (->) u , ü (->) u ]
Relaxed German, accept ü or ue Standard German spelling uses ü, ö, ä and ß. You might want to accept them AND also ue, oe, ae and ss. StandardGermanFST .o. [ ü (->) u e , ö (->) o e , ä (->) a e , ß (->) s s ]
Summary: About Choices • When it appears that you have to make a choice (dialect, orthography, register, etc.) between A and B, always try to make a common “core” system that is the basis for • Choice A alone • Choice B alone • Choice A and B • Composition is often the key to modifying a common core system for a variety of uses. • The failure to abstract and generalize is a sign of a finite-state beginner.