210 likes | 454 Views
Morphological Recognition. We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up the set of stems in the reg-noun word class. This way a FSA is created that can be used for morphological recognition.
E N D
Morphological Recognition • We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up the set of stems in the reg-noun word class. • This way a FSA is created that can be used for morphological recognition.
Two-level Morphology • Ideally, for morphological parsing we would like to input a word and get as output its stem with morphological information. e.g. cats ->cat + N + PL • Two-level morphology represents a word as the correspondence between the lexical and the surface level.
Finite State Transducer (FST) • A FST is an automaton that we use for performing the mapping between the two-levels. • A FST is an automaton with two-tapes that recognizes or generates pairs of strings, therefore it defines a relation between strings. • Another view of a FST is as a machine that reads one string and generates another string.
Formal FST definition • Extention to FSA definition • Q: a finite set of states. (q0, q1, q2, …) • Σ: a finite alphabet of complex symbols i:o pairs where i is a symbol from the input alphabet and o a symbol from the output alphabet (ε might be part of both the input and output alphabets) • q0: the start state (first state) • F: the states with of final states (subset of Q) • δ(q,i:o): the transition function from states and complex input symbols to states. Given a state q and an input i, it returns a new state q’. • e.g Σ= {a:a, b:b, !:!, a:!, a:b, b:a, a:ε, ε:!}
Useful FST Properties • Inversion: The inversion of a transducer simply switches the input and output labels of the transducer (the two tapes). Therefore it is very easy to transform a FST from a parser into a generator. • Composition: Given two FSTs T1 that maps from I to C and T2 that maps from C to O, their composition is a new transducer T1 o T2 that maps from I to O. Therefore is we have a number of FST that run serialy, it is possible to build a new FST that maps from the initial input to the final output.
Finite State Transducers • It is convenient to view a FST as having two tapes. • The upper or lexical tape • The lower of surface tape • Each symbol a:b in the FST alphabet expresses how a symbol from one tape is mapped to a symbol on the other tape. • Symbols such as a:a are called default pairs and are represented simply as a.
FST Morphotactics FST for English plural formation. ^ marks a morpheme boundary and # a word boundary.
Combining FST Lexicon and Morphtactics • The two FST for lexicon and morphotactics can be cascaded, i.e. the input is run through the lexicon FST and then the output is run through the morphotactics FST. • Based on the composition propery it is possible to compose these two FSTs into a single FST that maps directly from the lexical to the surface level (without any reference to word classes).
Orthographic Rules • The previous FST will accept the word foxs and reject the word foxes. • We need a way to deal with the spelling changes that often take place at morpheme boundaries. This is done by introducing orthographic rules. E.g. for English • e is inserted after -s, -z, -x, -ch, -sh before -s. • -y becomes -ie before -s. • Formal rule notation: a -> b/c__d means “rewrite a as b when it occurs between c and d. • ε ->e/{x,s,z}^__s#.
Orthographic Rules and FST • The spelling rule can be seen as taking a simple concatenation of morphemes (intermediate level) and producing the surface form of the word.
Orthographic Rules and FST • The previous orthographic rule can be represented as a FST.
Orthographic Rules and FST • Transition table for the previous FST.
Combining FST Lexicon and Rules • First the lexicon FST maps between the lexical level and the intermediate level which is just a concatenation of morphemes. • Then, a number of spelling rule FSTs run in parallel (or as a cascade) mapping from the intermediate level to the surface level. • The lexicon FST and the orthographic rules FST form a cascade. This can be run top-down (generation) or bottom-up (parsing).
FST Parsing • Parsing is more complicated than generation because of ambiguity. E.g. foxes may be parsed as both fox+V+3SG and as fox+N+PL. Disambiguiation cannot be performed at the lexical level. Both parses should be given by the FST. • Also ambiguities occur during parsing due to ε arcs or multiple possible paths. In fact, this is similar to the case for NFSA and similar search techniques must be employed.