490 likes | 510 Views
Natural Language Processing. Vasile Rus http://www.cs.memphis.edu/~vrus/teaching/nlp. Outline. Announcements Word-level Processing Stemming. Announcements. Paper presentations (PhD students) Project. Language. Language = words grouped according to some rules called a grammar
E N D
Natural Language Processing Vasile Rus http://www.cs.memphis.edu/~vrus/teaching/nlp
Outline • Announcements • Word-level Processing • Stemming
Announcements • Paper presentations (PhD students) • Project
Language • Language = words grouped according to some rules called a grammar Language = words + rules • Rules are too flexible for system developers • Rules are not flexible enough for poets
Language • Dictionary/Lexicon • set of words defined in the language • open (dynamic) • entries in lexicon are called lemma • Grammar • set of rules which describe what is allowable in a language • Classic Grammars • meant for humans who know the language • definitions and rules are mainly supported by examples • no (or almost no) formal description tools; cannot be programmed • Explicit Grammar (CFG, Dependency Grammars,...) • formal description • can be programmed & tested on data (texts)
Levels of (Formal) Description handle sounds/ graphics • Speech/Character Recognition • Speech:Phonetics and Phonology • Morphology • Syntax • Semantics • Pragmatics • Discourse handle words handle rules for grouping words in legal language constructs
NLP Pipeline speech text Phonetic Analysis Character Recognition Morphological analysis This Lecture Syntactic analysis Semantic Interpretation Discourse Processing
Words and their Internal Affairs: Morphology • Words are grouped into classes/ grammatical categories/ syntactic categories/parts-of-speech (POS) based • on their syntactic and morphological behavior • Noun: words that occur with determiners, take possessives, occur (most but not all) in plural form • and less on their typical semantic type • Luckily the classes are semantically coherent at some extent • A word belongs to a category if it passes the substitution test • The sad/intelligent/green/fat bug submerged in my soup. They all belong to the same class: ADJ
Words and their Internal Affairs: Morphology • Word categories are of two types: • Open categories: accept new members • Nouns • Verbs • Adjectives • Adverbs • Closed or functional categories • Almost fixed membership • Few members • Determiners, prepositions, pronouns, conjunctions, auxiliary verbs?, particles, numerals, etc. • Play an important role in grammar Any known human language has nouns and verbs!
Phonology and Morphology • Phonology: how sounds are realized in different contexts • Phoneme: a set of closely related speech sounds (phones) regarded as a single sound. For example, the sound of "r" in red, bring, or round is a phoneme. • Historical spelling • night, nite • attention, mission, fish • Script Limitations • Spoken English has 14 vowels • heed hid hayed head had hoed hood who’d hide how’d taught Tut toyenough • English Alphabet has 5 • Use vowel combinations: far fair fare • Consonantal doubling (hopping vs. hoping)
Syntax and Morphology • Phrase-level agreement • Subject-Verb • John studies hard (STUDY+3SG) • Noun-Adjective • Las vacas hermosas
Morphology: Morphemes • Studies how words are built up from morpheme, the minimal meaning bearing unit • Example: foxes - fox + es • could as well be some “ID” numbers: • e.g. fox ~ 2327, es ~ 1278 • Two broad classes of morphemes: • Stem: the main morpheme of a word (supplies the main meaning) • Affixes: provide additional meanings • prefixes: precede the stem • suffixes: follow the stem • infixes: inside the stem • circumfixes: precede and follow the stem
Morphology • Morpheme combine according to some fixed rules • Concatenative morphology • Prefixes and suffixes • Morphemes combined in complex ways • Non-concatenative/templatic morphology • Morphology can be divided up into two broad classes • Inflectional: • Derivational
Inflectional Morphology • Inflectional morphology concerns the combination of stems and affixes where the resulting word • Has the same word class as the original • Serves a grammatical/semantic purpose different from the original
Nouns (English) • Nouns are simple (not really) • Markers for plural and possessive • Plural: • Regular plural: affix –s or -es • Cat – cats • Thrush – thrushes • Irregular plural: • Mouse – mice • Ox - oxen
Verbs • Only main and primary (be, have, do) verbs have inflectional affixes (modals don’t have) • Regular • Walk, walks, walking, walked, walked • Irregular • Eat, eats, eating, ate, eaten • Catch, catches, catching, caught, caught • Cut, cuts, cutting, cut, cut
Derivational Morphology • Combination of a word stem with a grammatical morpheme • Results in a word of a different class • Derivational morphology is complex in English • Quasi-systematicity • Irregular meaning change • Changes of word class • Nominalisation • Computerize + ation - computerization
Derivational Examples • Verb/Adj to Noun
Derivational Examples • Noun/Verb to Adj
Compute • Many paths are possible… • Start with compute • Computer -> computerize -> computerization • Computation -> computational • Computer -> computerize -> computerizable • Compute -> computee
Computational Morphology • Morphological Parsing • Maps surface representation to lexicon entries • Finite State Morphology • Finite State Transducers (FST) • Input/Output • Analysis/Generation
Computational Morphology WORD STEM (+FEATURES)* • Cats cat +N +PL • cat cat +N +SG • cities city +N +PL • geese goose +N +PL • ducks (duck +N +PL) or (duck +V +3SG) • merging merge +V +PRES-PART • caught (catch +V +PAST-PART) or (catch +V +PAST)
Computational Morphology • The Rules and the Lexicon • General versus Specific • Regular versus Irregular • Accuracy, speed, space • The Morphology of a language • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Transducers: a Finite-state Automata that maps between lexical and surface level
Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules …? • Analysis/Generation is easy • Very large for English • What about Arabic or Turkish? • Chinese? acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$
Lexicon and RulesFSA Inflectional Morphology • English Noun Lexicon • English Noun Rule
Rules Only: Stemming • Sometimes you just need to know the stem of a word and you don’t care about the structure • In fact you may not even care if you get the right stem, as long as you get a consistent string • This is stemming… it most often shows up in IR (Information Retrieval) applications • IR: the task of locating most relevant documents in a collection given a query as a set of keywords
Stemming in IR • Run a stemmer on the documents to be indexed • Run a stemmer on users queries • Match
Porter Stemmer • No lexicon needed • Basically a set of staged sets of rewrite rules that strip suffixes • Handles both inflectional and derivational suffixes • Doesn’t guarantee that the resulting stem is really a stem • Lack of guarantee doesn’t matter for IR • In a test on the well-known Cranfield 200 collection it gave an improvement in retrieval performance when compared with a very much more elaborate program which has been in use in IR research in Cambridge since 1971 • DAWSON, J.L. Suffix Removal and Word Conflation. ALLC Bulletin, Michaelmas 1974 p.33-46 • ANDREWS, K. The Development of a Fast Conflation Algorithm for English. Dissertation for the Diploma in Computer Science, Computer Laboratory, University of Cambridge, 1971
wear wear wearable wearabl wearer wearer wearied weari wearier wearier weariest weariest wearily wearili weariness weari wearing wear wearisome wearisom wearisomely wearisom wears wear weather weather weathercock weathercock weathercocks weathercock web web Webb webb Webber webber webs web Webster webster Websterville webstervil wedded wedd wedding wedd weddings wedd wedge wedg wedged wedg wedges wedg wedging wedg Porter Stemmer: Examples
Definitions • C = consonant = Not A E I O U or (Y preceded byconsonant) • V = not C • Every word (or part of a word) has one of those forms • CVCV … C • CVCV … V • VCVC … C • VCVC … V • C* means zero or more consonants; similarly V* • C+ means one or more consonants; similarly V+ • M = Measure of sequences of V+C+: Words = C*(V+C+){M}V* M=0 TR, EE, TREE, Y, BY M=1 TROUBLE, OATS, TREES, IVY M=2 TROUBLES, PRIVATE, OATEN, ORRERY
Definitions • Conditions *S - stem ends with S - (and similarly for the other letters) *v* - stem contains a V *d - stem ends with double C e.g. -DD, -ZZ *o - stem ends CVC, where the second C is not W, X or Y e.g. -WIL, -SOB
Stemming Rules • The rules for removing a suffix will be given in the form (condition) S1 -> S2 • This means: if a word ends with the suffix S1, and the stem before S1 satisfies the given condition, S1 is replaced by S2 • The condition is usually given in terms of m (m > 1) EMENT -> Here S1 is `EMENT' and S2 is null. This would map REPLACEMENT to REPLAC, since REPLAC is a word part for which m = 2. • the condition part may also contain expressions with and, or and not • (m>1 and (*S or *T)) tests for a stem with m>1 ending in S or T • (*d and not (*L or *S or *Z)) tests for a stem ending with a double consonant other than L, S or Z
Stemming Rules • In a set of rules written beneath each other, only one is obeyed, and this will be the one with the longest matching S1 for the given word • For example, with SSES -> SS IES -> I SS -> SS S -> CARESSES maps to CARESS since SSES is the longest match for S1. Equally CARESS maps to CARESS (S1=`SS') and CARES to CARE (S1=`S')
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 1: Plural Nouns and Third Person Singular Verbs SSES SS caresses caress IES I ponies poni ties ti SS SS caress caress S cats cat Step 2a: Verbal Past Tense and Progressive Forms (M>0) EED EE feed feed, agreed agree i (*v*) ED plastered plaster, bled bled ii (*v*) ING motoring motor, sing sing Step 2b: If 2a.i or 2a.ii is successful, Cleanup AT ATE conflat(ed) conflate BL BLE troubl(ed) trouble IZ IZE siz(ed) size (*d and not (*L or *S or *Z)) hopp(ing) hop, tann(ed) tan single letter hiss(ing) hiss, fizz(ed) fizz (M=1 and *o) E fail(ing) fail, fil(ing) file
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 3: Y I (*v*) Y I happy happi sky sky
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 4: Derivational Morphology I (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 5: Derivational Morphology II: More Suffixes (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 5: Derivational Morphology III: Even More Suffixes (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 7a: Cleanup (m>1) E probate probat rate rate (m=1 and not *o) E cease ceas Step 7b: More Cleanup (m > 1 and *d and *L) controll control single letter roll roll
Some Insights • The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m • For example, in the following two lists: list A list B ------ ------ RELATE DERIVATE PROBATE ACTIVATE CONFLATE DEMONSTRATE PIRATE NECESSITATE PRELATE RENOVATE -ATE is removed from the list B words, but not from the list A words. • The fact that no attempt is made to identify prefixes can make the results look rather inconsistent • PRELATE does not lose the -ATE, but ARCHPRELATE becomes ARCHPREL • in practice this does not matter too much, because the presence of the prefix decreases the probability of an erroneous conflation
More Insights • Complex suffixes are removed bit by bit in the different steps • GENERALIZATIONS -> GENERALIZATION (Step 1) -> GENERALIZE (Step 2) -> GENERAL (Step 3) -> GENER (Step 4) • OSCILLATORS -> OSCILLATOR (Step 1) -> OSCILLATE (Step 2) -> OSCILL (Step 4) -> OSCIL (Step 5) • In a vocabulary of 10,000 words, the reduction in size of the stem was distributed among the steps as follows: Suffix stripping of a vocabulary of 10,000 words ------------------------------------------------ Number of words reduced in step 1: 3597 " 2: 766 " 3: 327 " 4: 2424 " 5: 1373 Number of words not reduced: 3650 The resulting vocabulary of stems contained 6370 distinct entries (reduced the size of the vocabulary by about one third).
Morphology: From Morphemes to Lemmas • Lemma : lexical unit, “pointer” to lexicon • Set of lexical forms having same stem, same major part-of-speech, and same word sense • might as well be a number, but typically is represented as the “base form”, or “dictionary headword” • possibly indexed when ambiguous/polysemous: state1 (verb), state2 (state- of -affairs), state3 (government)
Lexeme, Morpheme, Phoneme Lexeme: individual entry in lexicon Allomorph is a variant form of a morpheme. The meaning remains the same, while the sound can vary. (-ed in fished is [t], in buzzed is [d]) Allophone is one of several similar phones or speech sounds, that belong to the same phoneme. Each allophone is used in a specific phonetic context.
Summary • Morphology • Stemming • Porter’s Stemmer; see the link to the code on the class web page
Next Time • Part of Speech Tagging