220 likes | 305 Views
LING 388: Language and Computers. Sandiway Fong Lecture 21: 11/8. Administrivia. Homework #4 due this Wednesday November 10th email by midnight to sandiway@email.arizona.edu. Last Time. Finished up looking at machine translation (MT) You have now learnt how to write grammars
E N D
LING 388: Language and Computers Sandiway Fong Lecture 21: 11/8
Administrivia • Homework #4 • due this Wednesday • November 10th • email • by midnight to sandiway@email.arizona.edu
Last Time • Finished up looking at machine translation (MT) • You have now learnt how to write grammars • You have now also acquired basic techniques for building and modifying grammar-based MT systems • After homework #4, you have demonstrated basic ability to modify and extend the coverage of such systems
Today’s Topics • Turn our attention to a different area: Morphology, Stemming • Important not only in sentence parsing • but also for internet-related applications such as information retrieval (IR)
Morphology • Inflectional Morphology: • basically: no change in category • -features (person, number, gender) • Examples: movies, blonde, actress • Irregular examples: appendices, geese • Case • Examples: he/him, who/whom • Comparatives and superlatives • Examples: happier/happiest • Tense • Examples: drive/drives/drove (-ed)/driven
Morphology • Derivational Morphology • basically: category changing • Nominalization • Examples: formalization, informant, informer, refusal, lossage • Deadjectivals • Examples: weaken, happiness, simplify,formalize, slowly, calm • Deverbals • Examples: see nominalizations, readable, employee • Denominals • Examples: formal, bridge, ski, cowardly, useful
Morphology and Semantics • Morphemes: units of meaning • Suffixation • Examples: • x employ y • employee: picks out y • employer: picks out x • x read y • readable: picks out y • Prefixation • Examples: • undo, redo, un-redo, encode, defrost, asymmetric, malformed, ill-formed, pro-Chomsky
Google doesn’t didn’t use stemming Stemming • Normalization Procedure • Inflectional morphology: • cities city, improves/improved improve • Derivational morphology: • transformation/transformational transform • Criterion: • preserve meaning (word senses) • organization organ • Primary application: • information retrieval (IR) • Efficacy questioned: Harman (1991)
Stemming • IR-centric view • Applies to open-class lexical items only: • Stop-words: the, below, being, does • Not full morphology • prefixes generally excluded • (not meaning preserving) • Examples: asymmetric, undo, encoding
Stemming: Methods • Use a dictionary (look-up) • OK for English, not for languages with more productive morphology, e.g. Japanese • Write rules, e.g. Porter Algorithm (Porter, 1980) • Example: • Ends in doubled consonant (not “l”, “s” or “z”), remove last character • hopping hop • hissing hiss
Stemming: Methods • Dictionary approach not enough • Example: (Porter, 1991) • routed route/rout • At Waterloo, Napoleon’s forces were routed • The cars were routed off the highway • Here, the (inflected) verb form is ambiguous
Stemming: Errors • Understemming: failure to merge • Example: • adhere/adhesion • Overstemming: incorrect merge • Example: • probe/probable • Claim: -able irregular suffix, root: probare (Lat.) • Mis-stemming: removing a non-suffix (Porter, 1991) • Example: • reply rep
Stemming: Interaction • Interacts with noun compounding • Example: • operating systems • negative polarity items • For IR, compounds need to be identified first…
Stemming: Porter Algorithm • The Porter Stemmer (Porter, 1980) • URL: • http://www.tartarus.org/~martin/PorterStemmer/ C, java, Perl code (among others) • for English • most widely used system: dictionary-free • manually written rules • 5 stage approach to extracting roots • considers suffixes only • may produce non-word roots
Stemming: Porter Algorithm • Rule format: • (condition on stem) suffix1 suffix2 • In case of conflict, prefer longest suffix match • “Measure” of a word is m in: • (C) (VC)m (V) • C = sequence of one or more consonants • V = sequence of one or more vowels • Examples: • tree C(VC)0V • troubles C(VC)2
Stemming: Porter Algorithm • Step 1a: remove plural suffixation • SSES SS (caresses) • IES I (ponies) • SS SS (caress) • S (cats) • Step 1b: remove verbal inflection • (m>0) EED EE (agreed, feed) • (*v*) ED (plastered, bled) • (*v*) ING (motoring, sing)
Stemming: Porter Algorithm • Step 1b: (contd. for -ed and -ing rules) • AT ATE (conflated) • BL BLE (troubled) • IZ IZE (sized) • (*doubled c & ¬(*L v *S v *Z)) single c (hopping, hissing, falling, fizzing) • (m=1 & *cvc) E (filing, failing, slowing) • Step 1c: Y and I • (*v*) Y I (happy, sky)
Stemming: Porter Algorithm • Step 2: Peel one suffix off for multiple suffixes • (m>0) ATIONAL ATE (relational) • (m>0) TIONAL TION (conditional, rational) • (m>0) ENCI ENCE (valenci) • (m>0) ANCI ANCE (hesitanci) • (m>0) IZER IZE (digitizer) • (m>0) ABLI ABLE (conformabli) - able (step 4) • … • (m>0) IZATION IZE (vietnamization) • (m>0) ATION ATE (predication) • … • (m>0) IVITI IVE (sensitiviti)
Stemming: Porter Algorithm • Step 3 • (m>0) ICATE IC (triplicate) • (m>0) ATIVE (formative) • (m>0) ALIZE AL (formalize) • (m>0) ICITI IC (electriciti) • (m>0) ICAL IC (electrical, chemical) • (m>0) FUL (hopeful) • (m>0) NESS (goodness)
Stemming: Porter Algorithm • Step 4: Delete last suffix • (m>1) AL (revival) - revive, see step 5 • (m>1) ANCE (allowance, dance) • (m>1) ENCE (inference, fence) • (m>1) ER (airliner, employer) • (m>1) IC (gyroscopic, electric) • (m>1) ABLE (adjustable, mov(e)able) • (m>1) IBLE (defensible,bible) • (m>1) ANT (irritant,ant) • (m>1) EMENT (replacement) • (m>1) MENT (adjustment) • …
Stemming: Porter Algorithm • Step 5a: remove e • (m>1) E (probate, rate) • (m>1 & ¬*cvc) E (cease) • Step 5b: ll reduction • (m>1 & *LL) L (controller, roll)
Stemming: Porter Algorithm • Misses (understemming) • Unaffected: • agreement (VC)1VCC - step 4 (m>1) • adhesion • Irregular morphology: • drove, geese • Overstemming • relativity - step 2 • Mis-stemming • wander C(VC)1VC