310 likes | 672 Views
Morphology. See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics , Oxford (2004): OUP D Jurafsky & JH Martin: Speech and Language Processing , Upper Saddle River NJ (2000): Prentice Hall, Chapter 3 [quite technical].
E N D
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky & JH Martin: Speech and Language Processing, Upper Saddle River NJ (2000): Prentice Hall, Chapter 3 [quite technical]
Morphology - reminder • Internal analysis of word forms • morpheme – allomorphic variation • Words usually consist of a root plus affix(es), though some words can have multiple roots, and some can be single morphemes • lexeme – abstract notion of group of word forms that ‘belong’ together • lexeme ~ root ~ stem ~ base form ~ dictionary (citation) form
Role of morphology • Commonly made distinction: inflectional vs derivational • Inflectional morphology is grammatical • number, tense, case, gender • Derivational morphology concerns word building • part-of-speech derivation • words with related meaning
Inflectional morphology • Grammatical in nature • Does not carry meaning, other than grammatical meaning • Highly systematic, though there may be irregularities and exceptions • Simplifies lexicon, only exceptions need to be listed • Unknown words may be guessable • Language-specific and sometimes idiosyncratic • (Mostly) helpful in parsing
Derivational morphology • Lexical in nature • Can carry meaning • Fairly systematic, and predictable up to a point • Simplifies description of lexicon: regularly derived words need not be listed • Unknown words may be guessable • But … • Apparent derivations have specialised meaning • Some derivations missing • Languages often have parallel derivations which may be translatable
Morphological processes • Affixes: prefix, suffix, infix, circumfix • Vowel change (umlaut, ablaut) • Gemination, (partial) reduplication • Root and pattern • Stress (or tone) change • Sandhi
Morphophonemics • Morphemes and allomorphs • eg {plur}: +(e)s, vowel change, yies, fves, um a,, ... • Morphophonemic variation • Affixes and stems may have variants which are conditioned by context • eg +ing in lifting, swimming, boxing, raining, hoping, hopping • Rules may be generalisable across morphemes • eg +(e)s in cats, boxes, tomatoes, matches, dishes, buses • Applies to both {plur} (nouns) and {3rd sing pres} (verbs)
Morphology in NLP • Analysis vs synthesis • what does dogs mean? vs what is the plural of dog? • Analysis • Need to identify lexeme • Tokenization • To access lexical information • Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number) • Morphology can be ambiguous • May need other process to disambiguate (eg German –en) • Synthesis • Need to generate appropriate inflections from underlying representation
Morphology in NLP • String-handling programs can be written • More general approach • formalism to write rules which express correspondence between surface and underlying form (eg dogs = dog +{plur}) • Computational algorithm (program) which can apply those rules to actual instances • Especially of interest if rules (though not program) is independent of direction: analysis or synthesis
Role of lexicon in morphology • Rules interact with the lexicon • Obviously category information • eg rules that apply to nouns • Note also morphology-related subcategories • eg “er” verbs in French, rules for gender agreement • Other lexical information can impact on morphology • eg all fish have two forms of the plural (+s and ) • in Slavic languages case inflections differ for inanimate and animate nouns)
Problems with rules • Exceptions have to be covered • Including systematic irregularities • May be a trade-off between treating something as a small group of irregularities or as a list of unrelated exceptions (eg French irregular verbs, English fves) • Rules must not over/under-generate • Must cover all and only the correct cases • May depend on what order the rules are applied in
Tokenization • The simplest form of analysis is to reduce different word forms into tokens • Also called “normalization” • For example, if you want to count how many times a given ‘word’ occurs in a text • Or you want to search for texts containing certain ‘words’ (e.g. Google)
Morphological processing • Stemming • String-handling approaches • Regular expressions • Mapping onto finite-state automata • 2-level morphology • Mapping between surface form and lexical representation
Stemming • Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem • (Recall our discussion of stem ~ base form ~ dictionary form ~ citation form) • Stemming algorithms are basic string-handling algorithms, which depend on rules which identify affixes that can be stripped
Finite state automata • A finite state automaton is a simple and intuitive formalism with straightforward computational properties (so easy to implement) • A bit like a flow chart, but can be used for both recognition (analysis) and generation • FSAs have a close relationship with “regular expressions”, a formalism for expressing strings, mainly used for searching texts, or stipulating patterns of strings
Finite state automata • A bit like a flow chart, but can be used for both recognition and generation • “Transition network” • Unique start point • Series of states linked by transitions • Transitions represent input to be accounted for, or output to be generated • Legal exit-point(s) explicitly identified
a b a a ! q0 q1 q2 q3 q4 ExampleJurafsky & Martin, Figure 2.10 • Loop on q3 means that it can account for infinite length strings • “Deterministic” because in any state, its behaviour is fully predictable
2.19 a b a a ! ε q0 q1 q2 q3 q4 Non-deterministic FSAJurafsky & Martin, Figure 2.18 • At state q2 with input “a” there is a choice of transitions • We can also have “jump” arcs (or empty transitions), which also introduce non-determinism
c e x f o s i q6 q4 q5 q0 q1 q2 q3 q7 r y An FSA to handle morphology Spot the deliberate mistake: overgeneration
Finite State Transducers • A “transducer” defines a relationship (a mapping) between two things • Typically used for “two-level morphology”, but can be used for other things • Like an FSA, but each state transition stipulates a pair of symbols, and thus a mapping
Finite State Transducers • Three functions: • Recognizer (verification): takes a pair of strings and verifies if the FST is able to map them onto each other • Generator (synthesis): can generate a legal pair of strings • Translator (transduction): given one string, can generate the corresponding string • Mapping usually between levels of representation • spy+s : spies • Lexical:intermediate foxNPs : fox^s • Intermediate:surface fox^s : foxes
Some conventions • Transitions are marked by “:” • A non-changing transition “x:x” can be shown simply as “x” • Wild-cards are shown as “@” • Empty string shown as “ε”
An examplebased on Trost p.42 #spy+s# : spies #:ε s p y:i +:e s #:ε #toy+s# : toys #:ε t o y +:0 s #:ε #:ε s h e l f:v +:e s #:ε #:ε w i f:v e s #:ε
@ #:0 y:i +:e s #:0 y +:0 Using wild cards and loops #:0 s p y:i +:e s #:0 #:0 t o y +:0 s #:0 Can be collapsed into a single FST:
Another example (J&M Fig. 3.9, p.74) f o x c a t d o g P:^ s # N:ε q4 q1 g o o s e s h e e p m o u s e S:# N:ε q0 q2 q5 q7 S:# g o:e o:e s e s h e e p m o:i u:εs:c e N:ε P:# q3 q6 lexical:intermediate
f o x c a t d o g q1 q0 o s1 s2 f x a c t q0 s3 s4 q1 d g o s5 s6
0] f:f o:o x:x [1] N:ε [4] P:^ s:s #:# [7] • 0] f:f o:o x:x [1] N:ε [4] S:# [7] • 0] c:c a:a t:t [1] N:ε [4] P:^ s:s #:# [7] • 0] s:s h:h e:e p:p [2] N:ε [5] S:# [7] • 0] g:g o:e o:e s:s e:e [3] N:ε [5] P:# [7] f o x N P s # : f o x ^ s # f o x N S : f o x # c a t N P s # : c a t ^ s # s h e e p N S : s h e e p # g o o s e N P : g e e s e # f o x c a t d o g P:^ s # N:ε q4 q1 g o o s e s h e e p m o u s e S:# N:ε q0 q2 q5 q7 S:# g o:e o:e s e s h e e p m o:i u:εs:c e N:ε P:# q3 q6
other ^: ε # other q5 z, s, x s ^:ε z, s, x ^:ε ε:e s q0 q1 q2 q3 q4 #, other z, x # Lexical:surface mappingJ&M Fig. 3.14, p.78 f o x N P s # : f o x ^ s # c a t N P s # : c a t ^ s # ε e / {x s z} ^ __ s #
[0] f:f [0] o:o [0] x:x [1] ^:ε [2] ε:e [3] s:s [4] #:# [0] [0] c:c [0] a:a [0] t:t [0] ^:ε [0] s:s [0] #:# [0] f o x ^ s # f o x e s # c a t ^ s # : c a t ^ s # other ^: ε # other q5 z, s, x s ^:ε z, s, x ^:ε ε:e s q0 q1 q2 q3 q4 #, other z, x #
FST • But you don’t have to draw all these FSTs • They map neatly onto rule formalisms • What is more, these can be generated automatically • Therefore, slightly different formalism
c s1 d s2 s0 f s3 g s4 FST compiler http://www.xrce.xerox.com/competencies/content-analysis/fsCompiler/fsinput.html [d o g N P .x. d o g s ] | [c a t N P .x. c a t s ] | [f o x N P .x. f o x e s ] | [g o o s e N P .x. g e e s e] s0: c -> s1, d -> s2, f -> s3, g -> s4. s1: a -> s5. s2: o -> s6. s3: o -> s7. s4: <o:e> -> s8. s5: t -> s9. s6: g -> s9. s7: x -> s10. s8: <o:e> -> s11. s9: <N:s> -> s12. s10: <N:e> -> s13. s11: s -> s14. s12: <P:0> -> fs15. s13: <P:s> -> fs15. s14: e -> s16. fs15: (no arcs) s16: <N:0> -> s12.