240 likes | 451 Views
Morphology. Morpheme = "minimal meaning-bearing unit in a language"Morphology handles the formation of words by using morphemesbase form (stem), e.g., believeaffixes (suffixes, prefixes, infixes), e.g., un-, -able, -lyMorphological parsing = the task of recognizing the morphemes inside a worde.
E N D
1. Morphology
Reading: Chap 3, Jurafsky & Martin
Instructor: Rada Mihalcea
Note: Some of the material in this slide set was adapted from Christel Kemke (U. Manitoba) slides on morphology
2. Morphology Morpheme = "minimal meaning-bearing unit in a language"
Morphology handles the formation of words by using morphemes
base form (stem), e.g., believe
affixes (suffixes, prefixes, infixes), e.g., un-, -able, -ly
Morphological parsing = the task of recognizing the morphemes inside a word
e.g., hands, foxes, children
Important for many tasks
machine translation
information retrieval
lexicography
any further processing (e.g., part-of-speech tagging)
3. Morphemes and Words Combine morphemes to create words
Inflection
combination of a word stem with a grammatical morpheme
same word class, e.g. clean (verb), clean-ing (verb)
Derivation
combination of a word stem with a grammatical morpheme
Yields different word class, e.g. clean (verb), clean-ing (noun)
Compounding
combination of multiple word stems
Cliticization
combination of a word stem with a clitic
different words from different syntactic categories, e.g. I’ve = I + have
4. Inflectional Morphology Inflectional Morphology
word stem + grammatical morpheme cat + s
only for nouns, verbs, and some adjectives
Nouns
plural:
regular: +s, +es irregular: mouse - mice; ox - oxen
rules for exceptions: e.g. -y -> -ies like: butterfly - butterflies
possessive: +'s, +'
Verbs
main verbs (sleep, eat, walk)
modal verbs (can, will, should)
primary verbs (be, have, do)
5. Inflectional Morphology (verbs) Verb Inflections for:
main verbs (sleep, eat, walk); primary verbs (be, have, do)
Morpholog. Form Regularly Inflected Form
stem walk merge try map
-s form walks merges tries maps
-ing participle walking merging trying mapping
past; -ed participle walked merged tried mapped
Morph. Form Irregularly Inflected Form
stem eat catch cut
-s form eats catches cuts
-ing participle eating catching cutting
-ed past ate caught cut
-ed participle eaten caught cut
6. Noun Inflections for:
regular nouns (cat, hand); irregular nouns(child, ox)
Morpholog. Form Regularly Inflected Form
stem cat hand
plural form cats hands
Morph. Form Irregularly Inflected Form
stem child ox
plural form children oxen Inflectional Morphology (nouns)
7. Inflectional and Derivational Morphology (adjectives) Adjective Inflections and Derivations:
prefix un- unhappy adjective, negation
suffix -ly happily adverb, mode
-er happier adjective, comparative 1
-est happiest adjective, comparative 2
suffix -ness happiness noun
plus combinations, like unhappiest, unhappiness.
Distinguish different adjective classes, which can or cannot take certain inflectional or derivational forms, e.g. no negation for big.
9. Derivational Morphology (adjectives)
10. Verb Clitics
11. Methods, Algorithms
12. Stemming Stemming algorithms strip off word affixes
yield stem only, no additional information (like plural, 3rd person etc.)
used, e.g. in web search engines
famous stemming algorithm: the Porter stemmer
13. Stemming Reduce tokens to “root” form of words to recognize morphological variation.
“computer”, “computational”, “computation” all reduced to same token “compute”
Correct morphological analysis is language specific and can be complex.
Stemming “blindly” strips off known affixes (prefixes and suffixes) in an iterative fashion.
14. Porter Stemmer Simple procedure for removing known affixes in English without using a dictionary.
Can produce unusual stems that are not English words:
“computer”, “computational”, “computation” all reduced to same token “comput”
May conflate (reduce to the same token) words that are actually distinct.
Does not recognize all morphological derivations
Typical rules in Porter stemmer
sses ? ss
ies ? i
ational ? ate
tional ? tion
ing ? ?
15. Stemming Problems
16. Tokenization, Word Segmentation Tokenization or word segmentation
separate out “words” (lexical entries) from running text
expand abbreviated terms
E.g. I’m into I am, it’s into it is
collect tokens forming single lexical entry
E.g. New York marked as one single entry
More of an issue in languages like Chinese
17. Simple Tokenization Analyze text into a sequence of discrete tokens (words).
Sometimes punctuation (e-mail), numbers (1999), and case (Republican vs. republican) can be a meaningful part of a token.
However, frequently they are not.
Simplest approach is to ignore all numbers and punctuation and use only case-insensitive unbroken strings of alphabetic characters as tokens.
More careful approach:
Separate ? ! ; : “ ‘ [ ] ( ) < >
Care with . - why? when?
Care with … ??
18. Punctuation Children’s: use language-specific mappings to normalize (e.g. Anglo-Saxon genitive of nouns, verb contractions: won’t -> wo ‘nt)
State-of-the-art: break up hyphenated sequence.
U.S.A. vs. USA
a.out
19. Numbers 3/12/91
Mar. 12, 1991
55 B.C.
B-52
100.2.86.144
Generally, don’t index as text
Creation dates for docs
20. Lemmatization Reduce inflectional/derivational forms to base form
Direct impact on vocabulary size
E.g.,
am, are, is ? be
car, cars, car's, cars' ? car
the boy's cars are different colors ? the boy car be different color
How to do this?
Need a list of grammatical rules + a list of irregular words
Children ? child, spoken ? speak …
Practical implementation: use WordNet’s morphstr function
Perl: WordNet::QueryData (first returned value from validForms function)
21. Morphological Processing Knowledge
lexical entry: stem plus possible prefixes, suffixes plus word classes, e.g. endings for verb forms (see tables above)
rules: how to combine stem and affixes, e.g. add s to form plural of noun as in dogs
orthographic rules: spelling, e.g. double consonant as in mapping
Processing: Finite State Transducers
take information above and analyze word token / generate word form