310 likes | 315 Views
Dive into the intricate study of morphology, focusing on inflectional vs. derivational processes, morphological analysis vs. synthesis in NLP, and the role of lexicon. Explore important concepts like affixes, morphophonemics, tokenization, and the challenges of rule-based approaches in linguistic processing. Enhance your knowledge of language structures and their computational applications.
E N D
Morphology See Harald Trost “Morphology”. Chapter 2 of R Mitkov (ed.) The Oxford Handbook of Computational Linguistics, Oxford (2004): OUP D Jurafsky & JH Martin: Speech and Language Processing, Upper Saddle River NJ (2000): Prentice Hall, Chapter 3 [quite technical]
Morphology - reminder • Internal analysis of word forms • morpheme – allomorphic variation • Words usually consist of a root plus affix(es), though some words can have multiple roots, and some can be single morphemes • lexeme – abstract notion of group of word forms that ‘belong’ together • lexeme ~ root ~ stem ~ base form ~ dictionary (citation) form
Role of morphology • Commonly made distinction: inflectional vs derivational • Inflectional morphology is grammatical • number, tense, case, gender • Derivational morphology concerns word building • part-of-speech derivation • words with related meaning
Inflectional morphology • Grammatical in nature • Does not carry meaning, other than grammatical meaning • Highly systematic, though there may be irregularities and exceptions • Simplifies lexicon, only exceptions need to be listed • Unknown words may be guessable • Language-specific and sometimes idiosyncratic • (Mostly) helpful in parsing
Derivational morphology • Lexical in nature • Can carry meaning • Fairly systematic, and predictable up to a point • Simplifies description of lexicon: regularly derived words need not be listed • Unknown words may be guessable • But … • Apparent derivations have specialised meaning • Some derivations missing • Languages often have parallel derivations which may be translatable
Morphological processes • Affixes: prefix, suffix, infix, circumfix • Vowel change (umlaut, ablaut) • Gemination, (partial) reduplication • Root and pattern • Stress (or tone) change • Sandhi
Morphophonemics • Morphemes and allomorphs • eg {plur}: +(e)s, vowel change, yies, fves, um a,, ... • Morphophonemic variation • Affixes and stems may have variants which are conditioned by context • eg +ing in lifting, swimming, boxing, raining, hoping, hopping • Rules may be generalisable across morphemes • eg +(e)s in cats, boxes, tomatoes, matches, dishes, buses • Applies to both {plur} (nouns) and {3rd sing pres} (verbs)
Morphology in NLP • Analysis vs synthesis • what does dogs mean? vs what is the plural of dog? • Analysis • Need to identify lexeme • Tokenization • To access lexical information • Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number) • Morphology can be ambiguous • May need other process to disambiguate (eg German –en) • Synthesis • Need to generate appropriate inflections from underlying representation
Morphology in NLP • String-handling programs can be written • More general approach • formalism to write rules which express correspondence between surface and underlying form (eg dogs = dog +{plur}) • Computational algorithm (program) which can apply those rules to actual instances • Especially of interest if rules (though not program) is independent of direction: analysis or synthesis
Role of lexicon in morphology • Rules interact with the lexicon • Obviously category information • eg rules that apply to nouns • Note also morphology-related subcategories • eg “er” verbs in French, rules for gender agreement • Other lexical information can impact on morphology • eg all fish have two forms of the plural (+s and ) • in Slavic languages case inflections differ for inanimate and animate nouns)
Problems with rules • Exceptions have to be covered • Including systematic irregularities • May be a trade-off between treating something as a small group of irregularities or as a list of unrelated exceptions (eg French irregular verbs, English fves) • Rules must not over/under-generate • Must cover all and only the correct cases • May depend on what order the rules are applied in
Tokenization • The simplest form of analysis is to reduce different word forms into tokens • Also called “normalization” • For example, if you want to count how many times a given ‘word’ occurs in a text • Or you want to search for texts containing certain ‘words’ (e.g. Google)
Morphological processing • Stemming • String-handling approaches • Regular expressions • Mapping onto finite-state automata • 2-level morphology • Mapping between surface form and lexical representation
Stemming • Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem • (Recall our discussion of stem ~ base form ~ dictionary form ~ citation form) • Stemming algorithms are basic string-handling algorithms, which depend on rules which identify affixes that can be stripped
Finite state automata • A finite state automaton is a simple and intuitive formalism with straightforward computational properties (so easy to implement) • A bit like a flow chart, but can be used for both recognition (analysis) and generation • FSAs have a close relationship with “regular expressions”, a formalism for expressing strings, mainly used for searching texts, or stipulating patterns of strings
Finite state automata • A bit like a flow chart, but can be used for both recognition and generation • “Transition network” • Unique start point • Series of states linked by transitions • Transitions represent input to be accounted for, or output to be generated • Legal exit-point(s) explicitly identified
a b a a ! q0 q1 q2 q3 q4 ExampleJurafsky & Martin, Figure 2.10 • Loop on q3 means that it can account for infinite length strings • “Deterministic” because in any state, its behaviour is fully predictable
2.19 a b a a ! ε q0 q1 q2 q3 q4 Non-deterministic FSAJurafsky & Martin, Figure 2.18 • At state q2 with input “a” there is a choice of transitions • We can also have “jump” arcs (or empty transitions), which also introduce non-determinism
c e x f o s i q6 q4 q5 q0 q1 q2 q3 q7 r y An FSA to handle morphology Spot the deliberate mistake: overgeneration
Finite State Transducers • A “transducer” defines a relationship (a mapping) between two things • Typically used for “two-level morphology”, but can be used for other things • Like an FSA, but each state transition stipulates a pair of symbols, and thus a mapping
Finite State Transducers • Three functions: • Recognizer (verification): takes a pair of strings and verifies if the FST is able to map them onto each other • Generator (synthesis): can generate a legal pair of strings • Translator (transduction): given one string, can generate the corresponding string • Mapping usually between levels of representation • spy+s : spies • Lexical:intermediate foxNPs : fox^s • Intermediate:surface fox^s : foxes
Some conventions • Transitions are marked by “:” • A non-changing transition “x:x” can be shown simply as “x” • Wild-cards are shown as “@” • Empty string shown as “ε”
An examplebased on Trost p.42 #spy+s# : spies #:ε s p y:i +:e s #:ε #toy+s# : toys #:ε t o y +:0 s #:ε #:ε s h e l f:v +:e s #:ε #:ε w i f:v e s #:ε
@ #:0 y:i +:e s #:0 y +:0 Using wild cards and loops #:0 s p y:i +:e s #:0 #:0 t o y +:0 s #:0 Can be collapsed into a single FST:
Another example (J&M Fig. 3.9, p.74) f o x c a t d o g P:^ s # N:ε q4 q1 g o o s e s h e e p m o u s e S:# N:ε q0 q2 q5 q7 S:# g o:e o:e s e s h e e p m o:i u:εs:c e N:ε P:# q3 q6 lexical:intermediate
f o x c a t d o g q1 q0 o s1 s2 f x a c t q0 s3 s4 q1 d g o s5 s6
0] f:f o:o x:x [1] N:ε [4] P:^ s:s #:# [7] • 0] f:f o:o x:x [1] N:ε [4] S:# [7] • 0] c:c a:a t:t [1] N:ε [4] P:^ s:s #:# [7] • 0] s:s h:h e:e p:p [2] N:ε [5] S:# [7] • 0] g:g o:e o:e s:s e:e [3] N:ε [5] P:# [7] f o x N P s # : f o x ^ s # f o x N S : f o x # c a t N P s # : c a t ^ s # s h e e p N S : s h e e p # g o o s e N P : g e e s e # f o x c a t d o g P:^ s # N:ε q4 q1 g o o s e s h e e p m o u s e S:# N:ε q0 q2 q5 q7 S:# g o:e o:e s e s h e e p m o:i u:εs:c e N:ε P:# q3 q6
other ^: ε # other q5 z, s, x s ^:ε z, s, x ^:ε ε:e s q0 q1 q2 q3 q4 #, other z, x # Lexical:surface mappingJ&M Fig. 3.14, p.78 f o x N P s # : f o x ^ s # c a t N P s # : c a t ^ s # ε e / {x s z} ^ __ s #
[0] f:f [0] o:o [0] x:x [1] ^:ε [2] ε:e [3] s:s [4] #:# [0] [0] c:c [0] a:a [0] t:t [0] ^:ε [0] s:s [0] #:# [0] f o x ^ s # f o x e s # c a t ^ s # : c a t ^ s # other ^: ε # other q5 z, s, x s ^:ε z, s, x ^:ε ε:e s q0 q1 q2 q3 q4 #, other z, x #
FST • But you don’t have to draw all these FSTs • They map neatly onto rule formalisms • What is more, these can be generated automatically • Therefore, slightly different formalism
c s1 d s2 s0 f s3 g s4 FST compiler http://www.xrce.xerox.com/competencies/content-analysis/fsCompiler/fsinput.html [d o g N P .x. d o g s ] | [c a t N P .x. c a t s ] | [f o x N P .x. f o x e s ] | [g o o s e N P .x. g e e s e] s0: c -> s1, d -> s2, f -> s3, g -> s4. s1: a -> s5. s2: o -> s6. s3: o -> s7. s4: <o:e> -> s8. s5: t -> s9. s6: g -> s9. s7: x -> s10. s8: <o:e> -> s11. s9: <N:s> -> s12. s10: <N:e> -> s13. s11: s -> s14. s12: <P:0> -> fs15. s13: <P:s> -> fs15. s14: e -> s16. fs15: (no arcs) s16: <N:0> -> s12.