1 / 71

Morphology-2

Morphology-2. Sudeshna Sarkar Professor Computer Science & Engineering Department Indian Institute of Technology Kharagpur. Morphology in NLP. Analysis vs synthesis what does dogs mean? vs what is the plural of dog ? Analysis Need to identify lexeme Tokenization

iola
Download Presentation

Morphology-2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Morphology-2 Sudeshna Sarkar Professor Computer Science & Engineering Department Indian Institute of Technology Kharagpur

  2. Morphology in NLP • Analysis vs synthesis • what does dogs mean? vs what is the plural of dog? • Analysis • Need to identify lexeme • Tokenization • To access lexical information • Inflections (etc) carry information that will be needed by other processes (eg agreement useful in parsing, inflections can carry meaning (eg tense, number) • Morphology can be ambiguous • May need other process to disambiguate (eg German –en) • Synthesis • Need to generate appropriate inflections from underlying representation

  3. Morphological processing • Stemming • String-handling approaches • Regular expressions • Mapping onto finite-state automata • 2-level morphology • Mapping between surface form and lexical representation

  4. Stemming • Stemming is the particular case of tokenization which reduces inflected forms to a single base form or stem • Stemming algorithms are basic string-handling algorithms, which depend on rules which identify affixes that can be stripped

  5. Surface and Lexical Forms • The surface level of a word represents the actual spelling of that word. • geliyorum eats cats kitabım • The lexical level of a word represents a simple concatenation of morphemes making up that word. • gel +PROG +1SG • eat +AOR • cat +PLU • kitap +P1SG • Morphological processors try to find correspondences between lexical and surface forms of words. • Morphological recognition/ analysis – surface to lexical • Morphological generation/ synthesis – lexical to surface

  6. Morphological Parsing • Morphological parsing is to find the lexical form of a word from its surface form. • cats -- cat +N +PLU • cat -- cat +N +SG • goose -- goose +N +SG or goose +V • geese -- goose +N +PLU • gooses -- goose +V +3SG • catch -- catch +V • caught -- catch +V +PAST or catch +V +PP • AsachhilAma AsA+PROG+PAST+1st I/We was/were coming • There can be more than one lexical level representation for a given word. (ambiguity) flies flyVERB+PROG flyNOUN+PLU mAtAla kare

  7. Formal definition of the problem • Surface form: The word (ws) as it occurs in the text. [sings] ws L  Σ+ • Lexical form: The root word(s) (r1, r2, …) and other grammatical features (F). [sing,v,+sg,+3rd ] wl {Σ+,}+F+ wl  Δ+

  8. Analysis & Synthesis • Morphological Analysis: Maps a string from surface form to corresponding lexical form. fMA:Σ+  Δ+ • Morphological Synthesis: Maps a string from lexical form to surface form. fMA:Δ+ Σ+

  9. Fly + s  flys  flies (y i rule) • Duckling Go-getter  get + er Doer  do + er Beer  ? What knowledge do we need? How do we represent it? How do we compute with it?

  10. Knowledge needed • Knowledge of stems or roots • Duck is a possible root, not duckl We need a dictionary (lexicon) • Only some endings go on some words • Do + er ok • Be + er – not ok • In addition, spelling change rules that adjust the surface form • Get + er – double the t getter • Fox + s – insert e – foxes • Fly + s – insert e – flys – y to i – flies • Chase + ed – drop e - chased

  11. Put all this in a big dictionary (lexicon) • Turkish – approx 600  106 forms • Finnish – 107 • Hindi, Bengali, Telugu, Tamil? • Besides, always novel forms can be constructed • Anti-missile • Anti-anti-missile • Anti-anti-anti-missile • …….. • Compounding of words – Sanskrit, German

  12. Dictionary • Lemma: lexical unit, “pointer” to lexicon • typically is represented as the “base form”, or “dictionary headword” • possibly indexed when ambiguous/polysemous: • state1 (verb), state2 (state-of-the-art), state3 (government) • from one or more morphemes (“root”, “stem”, “root+derivation”, ...) • Categories: non-lexical • small number of possible values (< 100, often < 5-10)

  13. Morphological Analyzer • Relatively simple for English. • But for many Indian languages, it may be more difficult. Examples Inflectional and Derivational Morphology. • Common tools: Finite-state transducers • A transducer maps a set/string of symbols to another set/string of symbols

  14. A simpler problem • Linear concatenation of morphemes with possible spelling changes at the boundary and a few irregular cases. • Quite practical assumptions • English, Hindi, Bengali, Telugu, Tamil, French, Turkish … • Exceptions: Semitic languages, Sanskrit

  15. Computational Morphology • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Automata • Finite-state Transducers

  16. Computational Morphology • Systems • WordNet’s morphy • PCKimmo • Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay • Accurate but complex • http://www.sil.org/pckimmo/ • Two-level morphology • Commercial version available from InXight Corp. • Background • Chapter 3 of Jurafsky and Martin • A short history of Two-Level Morphology • http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/

  17. Morphological Anlayser To build a morphological analyser we need: • lexicon: the list of stems and affixes, together with basic information about them • morphotactics: the model of morpheme ordering (eg English plural morpheme follows the noun rather than a verb) • orthographic rules: these spelling rules are used to model the changes that occur in a word, usually when two morphemes combine (e.g., fly+s = flies)

  18. Finite State Machines • FSAs are equivalent to regular languages • FSTs are equivalent to regular relations (over pairs of regular languages) • FSTs are like FSAs but with complex labels. • We can use FSTs to transduce between surface and lexical levels.

  19. Can FSAs help? Reg-noun Plural (-s) Q0 Q1 Q2 Irreg-pl-noun Irreg-sg-noun

  20. What’s this for? un Adj-root Q0 Q1 Q2 Q3 -er -est -ly ε un?ADJ-ROOT{er | est | ly}?

  21. Morphotactics • The last two examples basically model some parts of the English morphotactics • But where is the information about regular and irregular roots?LEXICON • Can we include the lexicon in the FSA?

  22. Reg-noun Plural (-s) Q0 Q1 Q2 Irreg-pl-noun Irreg-sg-noun The English Pluralization FSA

  23. After adding a mini-lexicon a s g u b s Q1 Q2 Q0 d o g m a n n e

  24. Elegance & Power • FSAs are elegant because • NFA  DFA • Closed under Union, Intersection, Concatenation, Complementation • Traversal is always linear on input size • Well-known algorithms for minimization, determinization, compilation etc. • They are powerful because they can capture • Linear morphology • Irregularities

  25. But… FSAs are language recognizer/generator. We need transducers to build Morphological Analyzers (fMA) & Morphological Synthesizers (fMS)

  26. Finite State Transducers Surface form Finite State Machine Lexical form

  27. Formal Definition • A 6-tuple {Σ,Δ,Q,δ,q0,F} • Σ is the (finite) set of input symbols • Δ is the (finite) set of output symbols • Q is the set (FINITE) of states • δ is the transition function Q Σ to Q  Δ • q0 Q is the start state • F  Q is the set of accepting states

  28. An example FST a:a s:ε g:g b:b u:u s:s Q1 Q2 Q0 d:d o:o g:g a:a m:m n:n n:n e:a

  29. The Lexicon FST a:a s:+Pl g:g #:+Sg b:b u:u s:s Q1 Q2 Q0 d:d o:o g:g #:+Sg a:a n:n m:m Q3 e:a #:+Pl n:n Q4

  30. Ways to look at FSTs • Recognizer of a pair of strings • Generator of a pair of strings • Translator from one regular language to another • Computer of a relation – regular relation

  31. Invertibility • Given T = {Σ,Δ,Q,δ,q0,F} • Construct T-1 = {Δ,Σ,Q,δ-1,q0,F} such that if δ(x,q)  (y,q’) then δ-1(y,q)  (x,q’) where, x Σand y  Δ

  32. Compositionality • T1 = {Σ, X, Q1,δ1,q1,F1} & T2 = {X, Δ, Q2,δ2,q2,F2} • DefineT3 = {Σ, Δ, Q3,δ3,q3,F3} such that Q3 = Q1 Q2 q3 = (q1, q2) δ3 ((q,s), i) = ((q’,s’),o) if c s.t δ1 (q, i) = (q’,c) andδ2 (s, c) = (s’,o)

  33. Modeling Orthographic Rules • Spelling changes in morpheme boundaries • bus+s  buses, watch+s  watches • fly+s  flies • make+ing  making • Rules • E-insertion takes place if the stem ends in s, z, ch, sh etc. • y maps to ie when pluralization marker s is added

  34. Incorporating Spelling Rules • Spelling rules, each corresponding to an FST, can be run in parallel provided that they are "aligned". • The set of spelling rules is positioned between the surface level and the intermediate level. • Parallel execution of FSTs can be carried out: • by simulation: in this case FSTs must first be aligned. • by first constructing a a single FST corresponding to their intersection.

  35. Rewrite Rules • Chomsky and Halle (1968) • General form: ab / λ__ ρ • E-insertion: ε e / {x,s,z,ch,sh…}^ __ s# • Kay and Kaplan (1994) showed that FSTs can be compiled from general rewrite rules

  36. Two-level Morphology (Koskenniemi, 1983) lexical LEXICON FST intermediate FST1 FSTn orthographic rules surface

  37. b b u u s s e e s s A Single FST for MA and MS b u s +N +Pl b u s +N +Pl LEXICON FST Morphology FST b u s ^ s # FST1 FSTn orthographic rules

  38. Can we do without the lexicon • Not really! • But for some applications we might need to know the stem only • Surface form  Stem [Stemming] • Porter Stemming algorithm (1980) is a very popular technique that does not use lexicon.

  39. Derivational Rules

  40. Lexicon & Morphotactics • Typically list of word parts (lexicon) and the models of ordering can be combined together into an FSA which will recognise the all the valid word forms. • For this to be possible the word parts must first be classified into sublexicons. • The FSA defines the morphotactics (ordering constraints).

  41. Sublexicons to classify the list of word parts

  42. Towards the Analyser • We can use lexc or xfst to build such an FSA (see lex1.lexc) • To augment this to produce an analysis we must create a transducer Tnum which maps between the lexical level and an "intermediate" level that is needed to handle the spelling rules of English.

  43. Ambiguity • Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. • Didn’t matter which path was actually traversed • In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result

  44. Ambiguity • What’s the right parse for • Unionizable • Union-ize-able • Un-ion-ize-able • Each represents a valid path through the derivational morphology machine.

  45. Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored

  46. Generativity • Nothing really privileged about the directions. • We can write from one and read from the other or vice-versa. • One way is generation, the other way is analysis

  47. Multi-Level Tape Machines • We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

  48. Note • A key feature of this machine is that it doesn’t do anything to inputs to which it doesn’t apply. • Meaning that they are written out unchanged to the output tape. • Turns out the multiple tapes aren’t really needed; they can be compiled away.

  49. Overall Scheme • We now have one FST that has explicit information about the lexicon (actual words, their spelling, facts about word classes and regularity). • Lexical level to intermediate forms • We have a larger set of machines that capture orthographic/spelling rules. • Intermediate forms to surface forms

  50. Other Issues • How to formulate the rewrite rules? • How to ensure coverage? • What to do for unknown roots? • Is it possible to learn morphology of a language in supervised/unsupervised manner? • What about non-linear morphology?

More Related