140 likes | 244 Views
Morphological Processing & Stemming. Using FSAs/FSTs. FSAs and Morphology. Can be used to validate/recognize input string For example, consider the Spanish conjugation for amar in J&M p. 64 What would a FSA look like the would recognize the input?. a. 3. 5. s. am. e. 1. 2. …. m.
E N D
Morphological Processing & Stemming Using FSAs/FSTs
FSAs and Morphology • Can be used to validate/recognize input string • For example, consider the Spanish conjugation for amar in J&M p. 64 • What would a FSA look like the would recognize the input? a 3 5 s am e 1 2 … m 4 6 …
FSTs and Morphology • An FST could output information about the input, such as a translation or grammatical info: a e a:ε 3 ε:impf am:love o:ε 1 2 … 7
FSAs and NLP • Why even use FSAs in NLP? • Memory and storage are cheap • Build one large lexicon • List all entries and req’d output amo: amas: ames love love love pres ind pres impf pres subj • Some NLP apps do this (e.g., AZ Noun Phraser (Tolle 2001)) [ ] [ ] [ ]
FSAs and NLP • For more morphologically complex languages, one big lexicon not feasible • Consider Hungarian and Finnish • One verbal form • Hundreds of possible inflections • Millions of resulting forms • A complete “word” lexicon not feasible • Morphological processing essential
Hungarian • Consider one concept/’word’ in Hungarian: haz house hazat house (object) haznak of the house hazzal with the house hazza into a house hazba into the house hazra to the house …
Hungarian • Now consider plural inflections: hazak houses hazakat houses (object) hazaknak of the houses hazakzal with the houses hazakza into a houses hazakba into the houses hazakra to the houses …
Hungarian • And possessives: hazaim my houses hazaimat my houses (object) hazaimnak of the houses hazaimzal with the houses hazaimza into a houses hazaimba into the houses hazaimra to the houses …
Stemming • Used in many IR applications • For building equivalence classes Connect Connected Connecting Connection Connections Porter Stemmer, simple and efficient Website: http://www.tartarus.org/~martin/PorterStemmer Same class; suffixes irrelevant
Stemming and Performance • Does stemming help IR performance? • Harman 91 indicated that it hurt as much as it helped • Krovetz 93 shows that stemming does help • Porter-like algorithms work well with smaller documents • Krovetz proposes that stemming loses information • Derivational morphemes tell us something that helps identify word senses (and helps in IR) • Stemming them = information loss
Evaluating Performance • Measures of Stemming Performance rely on similar metrics used in IR: • Precision: measure of the proportion of selected items the system got right • precision = tp / (tp + fp) • Recall: measure of the proportion of the target items the system selected • recall = tp / (tp + fn) • Rule of thumb: as precision increases, recall drops, and vice versa • Metrics widely adopted in Stat NLP
Precision and Recall • Take a given stemming task • Suppose there are 100 words that could be stemmed • A stemmer gets 52 of these right (tp) • But it inadvertently stems 10 others (fp) Precision = 52 / (52 + 10) = .84 Recall = 52 / (52 + 48) = .52