520 likes | 850 Views
CMSC 723: Intro to Computational Linguistics. February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorr and Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot. Plan for Today’s Lecture. Morphology: Definitions and Problems What is Morphology? Topology of Morphologies
E N D
CMSC 723: Intro to Computational Linguistics February 11, 2003 Lecture 3: Finite-State Morphology Prof. Bonnie J. Dorrand Dr. Nizar Habash TAs: Nitin Madnani and Nate Waisbrot
Plan for Today’s Lecture • Morphology: Definitions and Problems • What is Morphology? • Topology of Morphologies • Approaches to Computational Morphology • Lexicons and Rules • Computational Morphology Approaches • Assignment 2
Morphology • The study of the way words are built up from smaller meaning units called Morphemes • Abstract versus Realized • HOP +PAST hop +ed hopped /hapt/ • Context Context Context
Phonology and Morphology • Phonology vs. Orthography • Historical spelling • night, nite • attention, mission, fish • Script Limitations • Spoken English has 14 vowels • heed hid hayed head had hoed hood who’d hide how’d taught Tut toyenough • English Alphabet has 5 • Use vowel combinatios: far fair fare • Consonantal doubling (hopping vs. hoping)
conj prep noun article plural poss Syntax and Morphology • Phrase-level agreement • Subject-Verb • John studies hard (STUDY+3SG) • Noun-Adjective • Las vacas hermosas • Sub-word phrasal structures • שבספרינו • ש+ב+ספר+ים+נו • That+in+book+PL+Poss:1PL • Which are in our books
Topology of Morphologies • Concatinative vs. Templatic • Derivational vs. Inflectional • Regular vs. Irregular
Concatinative Morphology • Morpheme+Morpheme+Morpheme+… • Stems: also called lemma, base form, root, lexeme • hope+ing hoping hop hopping • Affixes • Prefixes: Antidisestablishmentarianism • Suffixes: Antidisestablishmentarianism • Infixes: hingi (borrow) – humingi (borrower) in Tagalog • Circumfixes: sagen (say) – gesagt (said) in German • Agglutinative Languages • uygarlaştıramadıklarımızdanmışsınızcasına • uygar+laş+tır+ama+dık+lar+ımız+dan+mış+sınız+casına • Behaving as if you are among those whom we could not cause to become civilized
Templatic Morphology • Roots and Patterns ب ت ك ב ת כ K T B ? و ? ? مَ ? ו ? ? כתוב مكتوب maktuubwritten ktuuvwritten
Templatic Morphology: Root Meaning • KTB: writing “stuff” كتاب book write كتب כתב כתיב spelling مكتبة library letter מכתב مكتوب כתובת address مكتب office writer كاتب כתב
Inflectional vs. Derivational • Word Classes • Parts of speech: noun, verb, adjectives, etc. • Word class dictates how a word combines with morphemes to form new words
Derivational morphology • Nominalization: computerization, appointee, killer, fuzziness • Formation of adjectives: computational, clueless, embraceable • CatVar: Categorial Variation Database http://clipdemos.umiacs.umd.edu/catvar/
Inflectional morphology • Adds: Tense, number, person, mood, aspect • Word class doesn’t change • Word serves new grammatical role • Five verb forms in English • Other languages have (lots more)
Nouns and Verbs (in English) • Nouns have simple inflectional morphology • cat • cat+s, cat+’s • Verbs have more complex morphology
Regulars and Irregulars • Nouns • Cat/Cats • Mouse/Mice, Ox, Oxen, Goose, Geese • Verbs • Walk/Walked • Go/Went, Fly/Flew
Computational Morphology • Finite State Morphology • Finite State Transducers (FST) • Input/Output • Analysis/Generation
Computational Morphology WORD STEM (+FEATURES)* • cats cat +N +PL • cat cat +N +SG • cities city +N +PL • geese goose +N +PL • ducks (duck +N +PL) or (duck +V +3SG) • merging merge +V +PRES-PART • caught (catch +V +PAST-PART) or (catch +V +PAST)
Computational Morphology • The Rules and the Lexicon • General versus Specific • Regular versus Irregular • Accuracy, speed, space • The Morphology of a language • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Automata • Finite-state Transducers
Lexicon-only Morphology • The lexicon lists all surface level and lexical level pairs • No rules …? • Analysis/Generation is easy • Very large for English • What about Arabic or Turkish? • Chinese? acclaim acclaim $N$ acclaim acclaim $V+0$ acclaimed acclaim $V+ed$ acclaimed acclaim $V+en$ acclaiming acclaim $V+ing$ acclaims acclaim $N+s$ acclaims acclaim $V+s$ acclamation acclamation $N$ acclamations acclamation $N+s$ acclimate acclimate $V+0$ acclimated acclimate $V+ed$ acclimated acclimate $V+en$ acclimates acclimate $V+s$ acclimating acclimate $V+ing$
Lexicon and RulesFSA Inflectional Morphology • English Noun Lexicon • English Noun Rule
Using FSAs for Recognition: English Nouns and their Inflection
Morphological Parsing • Finite-state automata (FSA) • Recognizer • One-level morphology • Finite-state transducers (FST) • Two-level morphology • PC-Kimmo (Koskenniemi 83) • input-output pair
Terminology for PC-Kimmo • Upper = lexical tape • Lower = surface tape • Characters correspond to pairs, written a:b • If “a:a”, write “a” for shorthand • Two-level lexical entries • # = word boundary • ^ = morpheme boundary • Other = “any feasible pair that is not in this transducer”
Four-Fold View of FSTs • As a recognizer • As a generator • As a translator • As a set relater
Chomsky and Halle Notation x s z ^ __ s # ε → e /
FST Properties • Inversion • T-1= inversion of T • Input/Output switched • Composition • T1 maps I1 to O1 • T2 maps I2 to O2 • T1°T2 maps I1 to O2
FSTs and ambiguity • Kimmo Demo • Parse Example 1: unionizable • union +ize +able • un+ ion +ize +able • Parse Example 2: assess • assessv • assN +essN • Parse Example 3: tender • tenderAJ • tenNum+dAJ+erCMP
What to do about Global Ambiguity? • Accept first successful structure • Run parser through all possible paths • Bias the search in some manner
Computational Morphology • The Rules and the Lexicon • General versus Specific • Regular versus Irregular • Accuracy, speed, space • The Morphology of a language • Approaches • Lexicon only • Rules only • Lexicon and Rules • Finite-state Automata • Finite-state Transducers
Lexicon-Free Morphology:Porter Stemmer • Lexicon-Free FST Approach • By Martin Porter (1980)http://www.tartarus.org/%7Emartin/PorterStemmer/ • Cascade of substitutions given specific conditions GENERALIZATIONS GENERALIZATION GENERALIZE GENERAL GENER • Porter Stemmer Game
Porter Stemmer Definitions • C = consonant = Not A E I O U or (Y preceded byC) • V = not C • M = Measure: Words = C*(V*C*){M}V* • M=0 TR, EE, TREE, Y, BY • M=1 TROUBLE, OATS, TREES, IVY • M=2 TROUBLES, PRIVATE, OATEN, ORRERY • Conditions • *S - stem ends with S • *v* - stem contains a V • *d - stem ends with double C • -DD, -ZZ • *o - stem ends CVC, where the second C is not W, X or Y • -WIL, -SOB
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 1: Plural Nouns and Third Person Singular Verbs SSES SS caresses caress IES I ponies poni ties ti SS SS caress caress S cats cat Step 2a: Verbal Past Tense and Progressive Forms (M>0) EED EE feed feed, agreed agree i (*v*) ED plastered plaster, bled bled ii (*v*) ING motoring motor, sing sing Step 2b: If 2a.i or 2a.ii is successful, Cleanup AT ATE conflat(ed) conflate BL BLE troubl(ed) trouble IZ IZE siz(ed) size (*d and not (*L or *S or *Z)) hopp(ing) hop, tann(ed) tan single letter hiss(ing) hiss, fizz(ed) fizz (M=1 and *o) E fail(ing) fail, fil(ing) file
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 3: Y I (*v*) Y I happy happi sky sky
Porter Stemmer Step 4: Derivational Morphology I: Multiple Suffixes (m>0) ATIONAL -> ATE relational -> relate (m>0) TIONAL -> TION conditional -> condition rational -> rational (m>0) ENCI -> ENCE valenci -> valence (m>0) ANCI -> ANCE hesitanci -> hesitance (m>0) IZER -> IZE digitizer -> digitize (m>0) ABLI -> ABLE conformabli -> conformable (m>0) ALLI -> AL radicalli -> radical (m>0) ENTLI -> ENT differentli -> different (m>0) ELI -> E vileli - > vile (m>0) OUSLI -> OUS analogousli -> analogous (m>0) IZATION -> IZE vietnamization -> vietnamize (m>0) ATION -> ATE predication -> predicate (m>0) ATOR -> ATE operator -> operate (m>0) ALISM -> AL feudalism -> feudal (m>0) IVENESS -> IVE decisiveness -> decisive (m>0) FULNESS -> FUL hopefulness -> hopeful (m>0) OUSNESS -> OUS callousness -> callous (m>0) ALITI -> AL formaliti -> formal (m>0) IVITI -> IVE sensitiviti -> sensitive (m>0) BILITI -> BLE sensibiliti -> sensible
Porter Stemmer Step 5: Derivational Morphology II: More Multiple Suffixes (m>0) ICATE -> IC triplicate -> triplic (m>0) ATIVE -> formative -> form (m>0) ALIZE -> AL formalize -> formal (m>0) ICITI -> IC electriciti -> electric (m>0) ICAL -> IC electrical -> electric (m>0) FUL -> hopeful -> hope (m>0) NESS -> goodness -> good
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 5: Derivational Morphology III: Single Suffixes (m>1) AL -> revival -> reviv (m>1) ANCE -> allowance -> allow (m>1) ENCE -> inference -> infer (m>1) ER -> airliner -> airlin (m>1) IC -> gyroscopic -> gyroscop (m>1) ABLE -> adjustable -> adjust (m>1) IBLE -> defensible -> defens (m>1) ANT -> irritant -> irrit (m>1) EMENT -> replacement -> replac (m>1) MENT -> adjustment -> adjust (m>1) ENT -> dependent -> depend (m>1 and (*S or *T)) ION -> adoption -> adopt (m>1) OU -> homologou -> homolog (m>1) ISM -> communism -> commun (m>1) ATE -> activate -> activ (m>1) ITI -> angulariti -> angular (m>1) OUS -> homologous -> homolog (m>1) IVE -> effective -> effect (m>1) IZE -> bowdlerize -> bowdler
*<S> = ends with <S> *v* = contains a V *d = ends with double C *o = ends with CVC second C is not W, X or Y Porter Stemmer Step 7a: Cleanup (m>1) E probate probat rate rate (m=1 and not *o) E cease ceas Step 7b: More Cleanup (m > 1 and *d and *L) controll control single letter roll roll
Porter Stemmer • Errors of Omission • European Europe • analysis analyzes • matrices matrix • noise noisy • explain explanation • Errors of Commission • organization organ • doing doe • generalization generic • numerical numerous • university universe