480 likes | 1.2k Views
CSC 9010 Natural Language Processing Lecture 3: Morphology, Finite State Transducers Paula Matuszek Mary-Angela Papalaskari. Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ and
E N D
CSC 9010Natural Language ProcessingLecture 3: Morphology, Finite State TransducersPaula MatuszekMary-Angela Papalaskari Presentation slides adapted from: Marti Hearst (some following Dorr and Habash) http://www.sims.berkeley.edu/courses/is290-2/f04/ and Jim Martin: http://www.cs.colorado.edu/~martin/csci5832.html CSC 9010- NLP - 3: Morphology, Finite State Transducers
Today • Elementary Morphology • Computational morphology • Finite State Transducers • Lexicon-only schemes • Rule-only schemes • Lab: Introduction to NLTK CSC 9010- NLP - 3: Morphology, Finite State Transducers
Morphology • Morphology: • The study of the way words are built up from smaller meaning units. • Morphemes: • The smallest meaningful unit in the grammar of a language. • Contrasts: • Derivational vs. Inflectional • Regular vs. Irregular • Concatinative vs. Templatic (root-and-pattern) • A useful resource: • Glossary of linguistic terms by Eugene Loos • http://www.sil.org/linguistics/GlossaryOfLinguisticTerms/contents.htm CSC 9010- NLP - 3: Morphology, Finite State Transducers
Examples (English) • “unladylike” • 3 morphemes, 4 syllables un- ‘not’ lady ‘(well behaved) female adult human’ -like ‘having the characteristics of’ • Can’t break any of these down further without distorting the meaning of the units • “technique” • 1 morpheme, 2 syllables • “dogs” • 2 morphemes, 1 syllable -s, a plural marker on nouns CSC 9010- NLP - 3: Morphology, Finite State Transducers
Morpheme Definitions • Root • The portion of the word that: • is common to a set of derived or inflected forms, if any, when all affixes are removed • is not further analyzable into meaningful elements • carries the principal portion of meaning of the words • Stem • The root or roots of a word, together with any derivational affixes, to which inflectional affixes are added. • Affix • A bound morpheme that is joined before, after, or within a root or stem. • Clitic • a morpheme that functions syntactically like a word, but does not appear as an independent phonological word • Spanish: un beso, las aguas • English: Hal’s (genetive marker) • Proto-European: Kwe -que (Latin), te (Greek), and –ca (Sanskrit) CSC 9010- NLP - 3: Morphology, Finite State Transducers
Inflectional vs. Derivational • Word Classes • Parts of speech: noun, verb, adjectives, etc. • Word class dictates how a word combines with morphemes to form new words • Inflection: • Variation in the form of a word, typically by means of an affix, that expresses a grammatical contrast. • Doesn’t change the word class • Usually produces a predictable, non-idiosyncratic change of meaning. • Derivation: • The formation of a new word or inflectable stem from another word or stem. CSC 9010- NLP - 3: Morphology, Finite State Transducers
Inflectional Morphology • Adds: • tense, number, person, mood, aspect • Word class doesn’t change • Word serves new grammatical role • Examples • come is inflected for person and number: The pizza guy comes at noon. • las and rojas are inflected for agreement with manzanas in grammatical gender by -a and in number by –s las manzanas rojas (‘the red apples’) CSC 9010- NLP - 3: Morphology, Finite State Transducers
Derivational Morphology • Nominalization (formation of nouns from other parts of speech, primarily verbs in English): • computerization • appointee • killer • fuzziness • Formation of adjectives (primarily from nouns) • computational • clueless • Embraceable • Diffulcult cases: • building from which sense of “build”? CSC 9010- NLP - 3: Morphology, Finite State Transducers
Concatinative Morphology • Morpheme+Morpheme+Morpheme+… • Stems: also called lemma, base form, root, lexeme • hope+ing hoping hop hopping • Affixes • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German • Agglutinative Languages • uygarlaştıramadıklarımızdanmışsınızcasına (Turkish) • uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına • Behaving as if you are among those whom we could not cause to become civilized CSC 9010- NLP - 3: Morphology, Finite State Transducers
Templatic Morphology • Roots and Patterns • Example: Hebrew verbs • Root: • Consists of 3 consonants CCC • Carries basic meaning • Template: • Gives the ordering of consonants and vowels • Specifies semantic information about the verb • Active, passive, middle voice • Example: • lmd (to learn or study) • CaCaC -> lamad (he studied) • CiCeC -> limed (he taught) • CuCaC -> lumad (he was taught) CSC 9010- NLP - 3: Morphology, Finite State Transducers
Nouns and Verbs (in English) • Nouns have simple inflectional morphology • cat • cat+s, cat+’s • Verbs have more complex morphology CSC 9010- NLP - 3: Morphology, Finite State Transducers
Nouns and Verbs (in English) • Nouns • Have simple inflectional morphology • Cat/Cats • Mouse/Mice, Ox, Oxen, Goose, Geese • Verbs • More complex morphology • Walk/Walked • Go/Went, Fly/Flew CSC 9010- NLP - 3: Morphology, Finite State Transducers
Regular (English) Verbs CSC 9010- NLP - 3: Morphology, Finite State Transducers
Irregular (English) Verbs CSC 9010- NLP - 3: Morphology, Finite State Transducers
“To love” in Spanish CSC 9010- NLP - 3: Morphology, Finite State Transducers
Syntax and Morphology • Phrase-level agreement • Subject-Verb • John studies hard (STUDY+3SG) • Noun-Adjective • Las vacas hermosas • Sub-word phrasal structures • שבספרינו • ש+ב+ספר+ים+נו • That+in+book+PL+Poss:1PL • Which are in our books CSC 9010- NLP - 3: Morphology, Finite State Transducers
Phonology and Morphology • Script Limitations • Spoken English has 14 vowels • heed hid hayed head had hoed hood who’d hide how’d taught Tut toyenough • English Alphabet has 5 • Use vowel combinatios: far fair fare • Consonantal doubling (hopping vs. hoping) CSC 9010- NLP - 3: Morphology, Finite State Transducers
Computational Morphology • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Automata • Finite-state Transducers • Systems • WordNet’s morphy • PCKimmo • Named after Kimmo Koskenniemi, much work done by Lauri Karttunen, Ron Kaplan, and Martin Kay • Accurate but complex • http://www.sil.org/pckimmo/ • Two-level morphology • Commercial version available from InXight Corp. • Background • Chapter 3 of Jurafsky and Martin • A short history of Two-Level Morphology • http://www.ling.helsinki.fi/~koskenni/esslli-2001-karttunen/ CSC 9010- NLP - 3: Morphology, Finite State Transducers
Computational Morphology WORD STEM (+FEATURES)* • cats cat +N +PL • cat cat +N +SG • cities city +N +PL • geese goose +N +PL • ducks (duck +N +PL) or (duck +V +3SG) • merging merge +V +PRES-PART • caught (catch +V +PAST-PART) or (catch +V +PAST) CSC 9010- NLP - 3: Morphology, Finite State Transducers
FSAs and the Lexicon • First we’ll capture the morphotactics • The rules governing the ordering of affixes in a language. • Then we’ll add in the actual words CSC 9010- NLP - 3: Morphology, Finite State Transducers
Simple Rules CSC 9010- NLP - 3: Morphology, Finite State Transducers
Adding the Words CSC 9010- NLP - 3: Morphology, Finite State Transducers
Derivational Rules CSC 9010- NLP - 3: Morphology, Finite State Transducers
Parsing/Generation vs. Recognition • Recognition is usually not quite what we need. • Usually if we find some string in the language we need to find the structure in it (parsing) • Or we have some structure and we want to produce a surface form (production/generation) • Example • From “cats” to “cat +N +PL”and back • Morphological analysis CSC 9010- NLP - 3: Morphology, Finite State Transducers
Finite State Transducers • The simple story • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around. CSC 9010- NLP - 3: Morphology, Finite State Transducers
FSTs CSC 9010- NLP - 3: Morphology, Finite State Transducers
+N:ε +PL:s c:c a:a t:t Transitions • c:c means read a c on one tape and write a c on the other • +N:ε means read a +N symbol on one tape and write nothing on the other • +PL:s means read +PL and write an s CSC 9010- NLP - 3: Morphology, Finite State Transducers
Ambiguity • Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. • Didn’t matter which path was actually traversed • In FSTs the path to an accept state does matter since different paths represent different parses and different outputs will result CSC 9010- NLP - 3: Morphology, Finite State Transducers
Ambiguity • What’s the right parse for • Unionizable • Union-ize-able • Un-ion-ize-able • Each represents a valid path through the derivational morphology machine. CSC 9010- NLP - 3: Morphology, Finite State Transducers
Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored CSC 9010- NLP - 3: Morphology, Finite State Transducers
The Gory Details • Of course, its not as easy as • “cat +N +PL” <-> “cats” • As we saw earlier there are geese, mice and oxen • But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes • Cats vs Dogs Multi-tape machines CSC 9010- NLP - 3: Morphology, Finite State Transducers
Multi-Level Tape Machines • We use one machine to transduce between the lexical and the intermediate level, and another to handle the spelling changes to the surface tape CSC 9010- NLP - 3: Morphology, Finite State Transducers
Lexical to Intermediate Level CSC 9010- NLP - 3: Morphology, Finite State Transducers
Intermediate to Surface • The add an “e” rule as in fox^s# <-> foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers
Foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers
Foxes CSC 9010- NLP - 3: Morphology, Finite State Transducers
FST Review • FSTs allow us to take an input and deliver a structure based on it • Or… take a structure and create a surface form • Or take a structure and create another structure In many applications its convenient to decompose the problem into a set of cascaded transducers where • The output of one feeds into the input of the next. • We’ll see this scheme again for deeper semantic processing. CSC 9010- NLP - 3: Morphology, Finite State Transducers
Overall Plan CSC 9010- NLP - 3: Morphology, Finite State Transducers
Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules … • Analysis/Generation is easy • Very large for English • What about • Arabic or • Turkish or • Chinese? acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$ CSC 9010- NLP - 3: Morphology, Finite State Transducers
Stemming vs Morphology • Sometimes you just need to know the stem of a word and you don’t care about the structure. • In fact you may not even care if you get the right stem, as long as you get a consistent string. • This is stemming… it most often shows up in IR applications CSC 9010- NLP - 3: Morphology, Finite State Transducers
Stemming in IR • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Match • This is basically a form of hashing • Example: Computerization • ization -> -ize computerize • ize -> εcomputer CSC 9010- NLP - 3: Morphology, Finite State Transducers
Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible CSC 9010- NLP - 3: Morphology, Finite State Transducers
Porter • No lexicon needed • Basically a set of staged sets of rewrite rules that strip suffixes • Handles both inflectional and derivational suffixes Doesn’t guarantee that the resulting stem is really a stem (see first bullet) Lack of guarantee doesn’t matter for IR CSC 9010- NLP - 3: Morphology, Finite State Transducers
Porter Stemmer • Errors of Omission • European Europe • analysis analyzes • matrices matrix • noise noisy • explain explanation • Errors of Commission • organization organ • doing doe • generalization generic • numerical numerous • university universe CSC 9010- NLP - 3: Morphology, Finite State Transducers
Soundex • You work as the Villanova telephone operator. Someone calls looking for: Dr Papalarsky or Dr Matuzka • ???????? What do you type as your query string? CSC 9010- NLP - 3: Morphology, Finite State Transducers
Soundex • Keep the first letter • Drop non-initial occurrences of vowels, h, w and y • Replace the remaining letters with numbers according to group (e.g.. b, f, p, and v -> 1 • Replace strings of identical numbers with a single number (333 -> 3) • Drop any numbers beyond a third one CSC 9010- NLP - 3: Morphology, Finite State Transducers
Soundex • Effect is to map (hash) all similar sounding transcriptions to the same code. • Structure your directory so that it can be accessed by code as well as by correct spelling • Used for census records, phone directories, author searches in libraries etc. CSC 9010- NLP - 3: Morphology, Finite State Transducers