310 likes | 325 Views
CSA405: Advanced Topics in NLP. Computational Morphology IV: xfst. What is xfst?. xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers.
E N D
CSA405: Advanced Topicsin NLP Computational Morphology IV: xfst CSA4050: Computational Morphology IV
What is xfst? • xfst is a general tool for creating and manipulating finite state networks, both simple automota and transducers. • xfst and other Xerox tools employ a notation very close to the notation we have been using so far. • For full documentation on the syntax and semantics of Xerox REs, see • http://www.fsmbook.com CSA4050: Computational Morphology IV
Simple Commands • command line (via babe)> xfst • define: give a name to an RE • print: print information • read: read information • various stack operations • file interaction CSA4050: Computational Morphology IV
define command • definename regexp ; xfst[0]: define foo [d o g] | [c a t]; xfst[0]: define R1 [a | b | c | d]; xfst[0]: define R2 [d | e | f | g]; xfst[0]: define R3 [f | g | h | i | j]; xfst[0]: define baz R1 & R2; CSA4050: Computational Morphology IV
print words print wordsname - see the words in the language called name xfst[0]: print words R1 d c b a xfst[0]: CSA4050: Computational Morphology IV
print net print net name - see detailed information about the network name. xfst[0]: define z R1 & R2; xfst[0]: define baz R1 & R2; xfst[0]: print net z Sigma: a b c d e f g Size: 7 Net: FC370 Flags: deterministic, pruned, minimized, epsilon_free, loop_free Arity: 1 s0: d -> fs1. fs1: (no arcs) xfst[0]: CSA4050: Computational Morphology IV
Some Properties of Networks • epsilon free: there are no arcs labeled with the epsilon symbol • deterministic: no state has more than one outgoing arc • minimised: there is no other network with exactly the same paths that has fewer states. • These make sense for FSAs – not necessarily for FSTs. CSA4050: Computational Morphology IV
Equivalent? a:0 a A no. states? no. paths? relation encoded? a b a:0 a B b CSA4050: Computational Morphology IV
Remarks • A and B encode the same relation{<“aa”,”a”>,<“ab”,”ab”>} • They are both deterministic and minimal • They have different numbers of states. • Arcs labeled with a pair containing an epsilon on one side can sometimes be redistributed or eliminated, reducing the number of states. • This situation does not occur with FSAs CSA4050: Computational Morphology IV
FST Determinism:Sequential vs. Unambiguous • Unambiguous: for any input there is at most one output. • Transducer A is unambiguous in either direction. • Sequential: No state has more than one arc with the same symbol on the input side. • Transducer A is not sequential in one direction. • A transducer is sequentiable if the relation it encodes is unambiguous and all the local ambiguities resolve themselves in a fixed number of steps CSA4050: Computational Morphology IV
Basic Stack Operations • read regex: push network onto stack: • print stack: list items on stack • print net: detailed info on top stack item • pop stack: remove top item from stack • define name: set name to value of top stack item CSA4050: Computational Morphology IV
Stack Operations:intersect net; union net, etc. • Load stack with N suitable arguments. • Ensure that arguments are pushed onto stack in correct (reverse) order. • intersect net command is issued. • These are popped from the stack, the operation is performed, and the result written back onto the stack. CSA4050: Computational Morphology IV
Stack Example 1 xfst[0]: clear stack; xfst[0]: read regex [d |c |e | b | w] xfst[1]: read regex [b | s | h | w] xfst[2]: read regex [s | d | c | f | w] xfst[3]: print stack xfst[3]: intersect net xfst[1]: print stack xfst[1]: print net xfst[1]: print words x1 CSA4050: Computational Morphology IV
Stack Example 2 xfst[0]: clear stack; xfst[0]: read regex [e d | i n g | s |[]] xfst[1]: read regex [t a l k | k i c k] xfst[2]: print stack xfst[2]: print net xfst[2]: print words xfst[2]: concatenate net xfst[1]: print words x2/a CSA4050: Computational Morphology IV
Creating Relations • A simple example of a transducer can be shown using the crossproduct operator: xfst[0] clear stack xfst[0] define Y [d o g | c a t]; xfst[0] define Z [c h i e n | c h a t]; xfst[0] read regex Y .x. Z • We can now use apply up and apply down to test the transducer’s behaviour. x3ab CSA4050: Computational Morphology IV
apply up; apply down • applyup(arg,R) = {x | <x,arg> in R} • applydown(arg,R) = {x | <arg,x> in R} xfst[0] read regex [d o g | c a t].x.[c h i e n | c h a t]; xfst[1] apply up chien dog cat xfst[1] apply down cat chien chat CSA4050: Computational Morphology IV
Exercise for .x. • What RE would perform the correct translations? • Define it in xfst. • Define an RE in xfst which relates the surface forms "sing", "sang" and "sung" to the lexical form "sing". x3c CSA4050: Computational Morphology IV
Replace Rules • Xerox RE notation, includes replace rules. • Replace rules do not increase the descriptive power of REs; however, they do provide a powerful abbreviated rule-like notation. • There are two main types of replace rules:unconditional and conditional CSA4050: Computational Morphology IV
Unconditional Replace Rules • The most straightforward kind of unconditional replace rule is: a -> b • This denotes an FS relation in which every symbol a in the upper language corresponds to a symbol b in the lower language. • Checkpoint: how does this differ from a:b? What is the FST that computes this relation CSA4050: Computational Morphology IV
Unconditional Replace e.g. xfst[0]: read regex c -> r xfst[0]: apply down cat xfst[0]: apply down dog • Where there is no match, the string is identity mapped. • The general pattern for simple Replace rules is A -> B, where A and B are REs denoting arbitrarily complex languages (not relations) x4ab CSA4050: Computational Morphology IV
Definition of A → B • A → B = [no_A [A .x. B]]* no_Awhere no_A ~$[A – 0] • N.B. if upper does not contain empty str~$[upper – 0] = ~$[upper]otherwise ~$[upper] is null whereas~$[upper – 0] contains at least the empty str. CSA4050: Computational Morphology IV
Conditional Replace Rules • More complex replace rules can also specify left and right context, as in A -> B || L _ R • each lexical substring A is related to a substring B when the left context ends with L and the right context starts with R. • A, B, L and R are REs denoting languages not relations. x4c CSA4050: Computational Morphology IV
Special Cases • The symbol .#. refers to the absolute beginning or end of string in left and right contexts. For example e -> i || .#. p _ r • Checkpoint: write a replace rule that brings lexical "go" into correspondence with surface "went". CSA4050: Computational Morphology IV
The kaNpat exercise • Suppose we have a language in which kaNpat is a lexical string consisting of the morpheme kaN concatenated with the suffix pat. • N just before nasal p gets realised as m. • p occurring just after an m is realised as m. CSA4050: Computational Morphology IV
kaNpat rules • We can write the following two rules to account for this behaviour: Rule 1. [N -> m || _ p] • Notice that the lh context is empty, meaning that any context will do. Rule 2. [p -> m || m _] • Note that the linguist must keep track of the order in which rules are applied. CSA4050: Computational Morphology IV
Derivation of kammat Lexical: kaNpat apply [N -> m || _ p] Intermediate: kampat apply [p -> m || m _] surface: kammat • The first rule feeds the second • Checkpoint: what happens if rules are applied in reverse order? CSA4050: Computational Morphology IV
Composing the Relations • Each rule describes a certain relation: call these R1 and R2 • If R1 maps X to Y and R2 maps Y to Z, then there must exist a single relation which maps directly from X to Z without passing through Y. • Mathematically, that relation is the composition of R1 and R2. CSA4050: Computational Morphology IV
Composing the Rules • Each rule is compiled into an FST. • If Rule1 compiles to F1, and Rule2 to F2, then there must be an F3 which computes the composition of F1 and F2. • Checkpoint: write the RE corresponding to the composition of the original 2 rules. CSA4050: Computational Morphology IV
Testing the kaNpat grammar • First get rules onto stack xfst[0] read regex [N->m || _p] .o. [p->m||m_]; • Try the following and explain • apply down (kaNpat; kampat; kammat) • apply up kammat • Try the above but with rules in reverse order X5ab CSA4050: Computational Morphology IV
Practical use of xfst • Regular expression files (text) xfst[0] read regexp < regexpfile • Binary files (compiled networks) xfst[1]: save stack binfile xfst[0]: load stack binfile • Scripts (xfst commands) xfst[0] source scriptfile % xfst -f myscript % xfst -l myscript CSA4050: Computational Morphology IV
A’ is the sequentiable a:0 a A no. states? no. paths? relation encoded? a b a a:0 A’ 0:b b:a CSA4050: Computational Morphology IV