1 / 36

Natural Language Processing

Learn about how words are formed and classified in English morphology using finite-state methods, including inflectional and derivational processes. Understand regular and irregular verbs, derivational rules, morphological processing requirements, and applications.

Download Presentation

Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Natural Language Processing Lecture Notes 3

  2. Morphology (Ch 3) • Finite-state methods are particularly useful in dealing with a lexicon. • So we’ll switch to talking about some facts about words and then come back to computational methods

  3. English Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes • Two classes of morhphemes • Stems: The core meaning bearing units • Affixes: adhere to stems to change their meanings and grammatical functions

  4. Examples • Insubstantial, trying, unreadable

  5. English Morphology • We can also divide morphology up into two broad classes • Inflectional • Derivational

  6. Word Classes • Things like nouns and verbs • We’ll go into the gory details when we cover POS tagging • Relevant now: how stems and affixes combine depends on word class of the stem

  7. Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the resulting word • Has the same word class as the original • Serves a grammatical purpose different from the original (agreement, tense) • bird  birds • like  likes or liked

  8. Nouns and Verbs (English) • Nouns are simple • Markers for plural and possessive • Verbs are only slightly more complex • Markers appropriate to the tense of the verb

  9. Regulars and Irregulars • Some words “misbehave” • Mouse/mice, goose/geese, ox/oxen • Go/went, fly/flew • The terms regular and irregular will be used to refer to words that follow the rules and those that don’t. (Different meaning than “regular” languages!)

  10. Regular and Irregular Verbs • Regulars… • Walk, walks, walking, walked, walked • Irregulars • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught, caught • Cut, cuts, cutting, cut, cut

  11. Regular Verbs • If you know a regular verb stem, you can predict the other forms by adding a predictable ending and making regular spelling changes (details in the chapter) • The regular class is productive – includes new verbs. • Emailed, instant-messaged, faxed, googled, …

  12. Derivational Morphology • Quasi-systematicity • Irregular meaning changes • Healthful vs. Healthy • Clueclueless (lacking understanding) • Art  artless (without guile; not artificial;…) • Changes of word class • Examples • Computerize (V) Computerization (N) • Appoint (V) Appointee (N) • Computation (N) Computational (Adj) • *eatation *spellation *sleepable *scienceless

  13. Morphological Processing Requirements • Lexicon • Word repository • Stems and affixes (with corresponding parts of speech) • Morphotactics • Morpheme ordering • Orthographic Rules • Spelling changes due to affixation • City + -s cities (not citys)

  14. Morphotactics using FSAs English nominal inflection Inputs: cats, goose, geese

  15. Adding the Words Expand each non-terminal into each stem in its class: reg-noun = cat, dog, … Then expand each step to the letters it includes

  16. Derivational Rules

  17. Limitations • FSAs can only tell us whether a word in in the language or not, but what if we want to know more? • What is the stem? • What are the affixes, and of what sort?

  18. Parsing/Generation vs. Recognition • Recognition is usually not quite what we need. • Usually if we find some string in the language we need to find the structure in it (parsing) • Or we have some structure and we want to produce a surface form (production/generation) • Examples • From “cats” to “cat +N +PL” • From “cat +N +Pl” to “cats”

  19. Applications • The kind of parsing we’re talking about is normally called morphological analysis • It can either be • An important stand-alone component of an application (spelling correction, information retrieval) • Or simply a step in a processing sequence

  20. Finite State Transducers • Basic idea • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around.

  21. FSTs Two-level morphology represents a word as a correspondence between lexical (the morphemes) and surface (the orthographic word) levels Parsing maps surface to lexical level Visualize a FST as a 2-tape FSA which recognizes/generates pairs of strings

  22. +N:ε +PL:s c:c a:a t:t Transitions • c:c means read a c on one tape and write a c on the other • +N:ε means read a +N symbol on one tape and write nothing on the other • +PL:s means read +PL and write an s

  23. Typical Uses • Typically, we’ll read from one tape using the first symbol • And write to the second tape using the other symbol • Closure properties of FSTs: inversion and composition • So, they may used in reverse and they may be cascaded

  24. Ambiguity • Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. • Didn’t matter which path was actually traversed • In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result

  25. Ambiguity • What’s the right parse for • Unionizable • Union-ize-able • Un-ion-ize-able • Each represents a valid path through the derivational morphology machine.

  26. Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored

  27. Details • Of course, its not as easy as • “cat +N +PL” <-> “cats” • As we saw earlier there are geese, mice and oxen • But there are also spelling/pronunciation changes that go along with inflectional changes (e.g., plural of fox is foxes, not foxs)

  28. Multi-Tape Machines • add another tape, and use the output of one tape machine as the input to the next • To handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols

  29. Multi-Level Tape Machines • We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape

  30. Lexical to Intermediate Level

  31. Intermediate to Surface • The add an “e” rule as in fox^s#  foxes : epsilon (error in the text)

  32. Succeeds iff rule applies; removes all ^’s from output string; useful if < 2 spelling rules apply (^ separates all affixes and stems; another nlp process may care about mid-word morphemes ) • Suppose q0, q1, q2 were not final states : epsilon (error in the text)

  33. Overall Plan

  34. Summing Up • FSTs allow us to take an input and deliver a structure based on it • Or… take a structure and create a surface form • Or take a structure and create another structure

  35. Summing Up • In many applications it’s convenient to decompose the problem into a set of cascaded transducers where • The output of one feeds into the input of the next. • We’ll see this scheme again for deeper semantic processing.

  36. Summing Up • FSTs provide a useful tool for implementing a standard model of morphological analysis (two-level morphology) • Toolkits such as AT&T FSM Toolkit available • Other approaches are also used, e.g., the rule-based Porter Stemmer, and memory-based learning

More Related