Morphological Processing & Stemming

Morphological Processing & Stemming Using FSAs/FSTs

FSAs and Morphology • Can be used to validate/recognize input string • For example, consider the Spanish conjugation for amar in J&M p. 64 • What would a FSA look like the would recognize the input? a 3 5 s am e 1 2 … m 4 6 …

FSTs and Morphology • An FST could output information about the input, such as a translation or grammatical info: a e a:ε 3 ε:impf am:love o:ε 1 2 … 7

FSAs and NLP • Why even use FSAs in NLP? • Memory and storage are cheap • Build one large lexicon • List all entries and req’d output amo: amas: ames love love love pres ind pres impf pres subj • Some NLP apps do this (e.g., AZ Noun Phraser (Tolle 2001)) [ ] [ ] [ ]

FSAs and NLP • For more morphologically complex languages, one big lexicon not feasible • Consider Hungarian and Finnish • One verbal form • Hundreds of possible inflections • Millions of resulting forms • A complete “word” lexicon not feasible • Morphological processing essential

Hungarian • Consider one concept/’word’ in Hungarian: haz house hazat house (object) haznak of the house hazzal with the house hazza into a house hazba into the house hazra to the house …

Hungarian • Now consider plural inflections: hazak houses hazakat houses (object) hazaknak of the houses hazakzal with the houses hazakza into a houses hazakba into the houses hazakra to the houses …

Hungarian • And possessives: hazaim my houses hazaimat my houses (object) hazaimnak of the houses hazaimzal with the houses hazaimza into a houses hazaimba into the houses hazaimra to the houses …

Stop

Stemming • Used in many IR applications • For building equivalence classes Connect Connected Connecting Connection Connections Porter Stemmer, simple and efficient Website: http://www.tartarus.org/~martin/PorterStemmer Same class; suffixes irrelevant

Stop

Stemming and Performance • Does stemming help IR performance? • Harman 91 indicated that it hurt as much as it helped • Krovetz 93 shows that stemming does help • Porter-like algorithms work well with smaller documents • Krovetz proposes that stemming loses information • Derivational morphemes tell us something that helps identify word senses (and helps in IR) • Stemming them = information loss

Evaluating Performance • Measures of Stemming Performance rely on similar metrics used in IR: • Precision: measure of the proportion of selected items the system got right • precision = tp / (tp + fp) • Recall: measure of the proportion of the target items the system selected • recall = tp / (tp + fn) • Rule of thumb: as precision increases, recall drops, and vice versa • Metrics widely adopted in Stat NLP

Precision and Recall • Take a given stemming task • Suppose there are 100 words that could be stemmed • A stemmer gets 52 of these right (tp) • But it inadvertently stems 10 others (fp) Precision = 52 / (52 + 10) = .84 Recall = 52 / (52 + 48) = .52

Morphological Processing & Stemming