Morphological Recognition

Morphological Recognition • We take each sub-lexicon of each stem class and we expand each arc (e.g. the reg-noun arc) with all the morphemes that make up the set of stems in the reg-noun word class. • This way a FSA is created that can be used for morphological recognition.

Two-level Morphology • Ideally, for morphological parsing we would like to input a word and get as output its stem with morphological information. e.g. cats ->cat + N + PL • Two-level morphology represents a word as the correspondence between the lexical and the surface level.

Finite State Transducer (FST) • A FST is an automaton that we use for performing the mapping between the two-levels. • A FST is an automaton with two-tapes that recognizes or generates pairs of strings, therefore it defines a relation between strings. • Another view of a FST is as a machine that reads one string and generates another string.

Formal FST definition • Extention to FSA definition • Q: a finite set of states. (q0, q1, q2, …) • Σ: a finite alphabet of complex symbols i:o pairs where i is a symbol from the input alphabet and o a symbol from the output alphabet (ε might be part of both the input and output alphabets) • q0: the start state (first state) • F: the states with of final states (subset of Q) • δ(q,i:o): the transition function from states and complex input symbols to states. Given a state q and an input i, it returns a new state q’. • e.g Σ= {a:a, b:b, !:!, a:!, a:b, b:a, a:ε, ε:!}

Useful FST Properties • Inversion: The inversion of a transducer simply switches the input and output labels of the transducer (the two tapes). Therefore it is very easy to transform a FST from a parser into a generator. • Composition: Given two FSTs T1 that maps from I to C and T2 that maps from C to O, their composition is a new transducer T1 o T2 that maps from I to O. Therefore is we have a number of FST that run serialy, it is possible to build a new FST that maps from the initial input to the final output.

Finite State Transducers • It is convenient to view a FST as having two tapes. • The upper or lexical tape • The lower of surface tape • Each symbol a:b in the FST alphabet expresses how a symbol from one tape is mapped to a symbol on the other tape. • Symbols such as a:a are called default pairs and are represented simply as a.

FST Morphotactics FST for English plural formation. ^ marks a morpheme boundary and # a word boundary.

FST Lexicon

Combining FST Lexicon and Morphtactics • The two FST for lexicon and morphotactics can be cascaded, i.e. the input is run through the lexicon FST and then the output is run through the morphotactics FST. • Based on the composition propery it is possible to compose these two FSTs into a single FST that maps directly from the lexical to the surface level (without any reference to word classes).

Orthographic Rules • The previous FST will accept the word foxs and reject the word foxes. • We need a way to deal with the spelling changes that often take place at morpheme boundaries. This is done by introducing orthographic rules. E.g. for English • e is inserted after -s, -z, -x, -ch, -sh before -s. • -y becomes -ie before -s. • Formal rule notation: a -> b/c__d means “rewrite a as b when it occurs between c and d. • ε ->e/{x,s,z}^__s#.

Orthographic Rules and FST • The spelling rule can be seen as taking a simple concatenation of morphemes (intermediate level) and producing the surface form of the word.

Orthographic Rules and FST • The previous orthographic rule can be represented as a FST.

Orthographic Rules and FST • Transition table for the previous FST.

Combining FST Lexicon and Rules • First the lexicon FST maps between the lexical level and the intermediate level which is just a concatenation of morphemes. • Then, a number of spelling rule FSTs run in parallel (or as a cascade) mapping from the intermediate level to the surface level. • The lexicon FST and the orthographic rules FST form a cascade. This can be run top-down (generation) or bottom-up (parsing).

FST Parsing • Parsing is more complicated than generation because of ambiguity. E.g. foxes may be parsed as both fox+V+3SG and as fox+N+PL. Disambiguiation cannot be performed at the lexical level. Both parses should be given by the FST. • Also ambiguities occur during parsing due to ε arcs or multiple possible paths. In fact, this is similar to the case for NFSA and similar search techniques must be employed.

Morphological Recognition

Morphological Recognition

Presentation Transcript

Morphological analysis

Morphological Analysis

MORPHOLOGICAL TYPOLOGY

Morphological change

Morphological Operation

Morphological Process

Morphological Analysis

Morphological Matrix

Morphology Morphological analysis

Morphological Classification

Soft Morphological Filter

Morphological Analysis

Morphological Parsing

Morphological Simplification

MORPHOLOGICAL FEATURES

Morphological Analysis

Morphological Analysis

Morphological Operation

MORPHOLOGICAL PROCESSES

Morphological Simplification