1.14k likes | 1.44k Views
Morphology, Phonology & FSTs. Shallow Processing Techniques for NLP Ling570 October 12, 2011. Roadmap. Motivation: Representing words A little (mostly English) Morphology Stemming FSTs & Morphology Stemming Morphological analysis FSTs & Phonology. Words.
E N D
Morphology, Phonology & FSTs Shallow Processing Techniques for NLP Ling570 October 12, 2011
Roadmap • Motivation: • Representing words • A little (mostly English) Morphology • Stemming • FSTs & Morphology • Stemming • Morphological analysis • FSTs & Phonology
Words • Goal: Compact representation of all surface forms in a language
Lexicon • Goal: Compact representation of all surface forms in a language • Enumeration: • Impractical for morphologically rich languages • Descriptively unsatisfying for most languages
Lexicon • Goal: Compact representation of all surface forms in a language • Enumeration: • Impractical for morphologically rich languages • Descriptively unsatisfying for most languages • Orthographic variation: • Fly+er Flier
Lexicon • Goal: Compact representation of all surface forms in a language • Enumeration: • Impractical for morphologically rich languages • Descriptively unsatisfying for most languages • Orthographic variation: • Fly+er Flier • Morphological variation: • saw + s saws; fish + s fish; goose + s geese
Lexicon • Goal: Compact representation of all surface forms in a language • Enumeration: • Impractical for morphologically rich languages • Descriptively unsatisfying for most languages • Orthographic variation: • Fly+er Flier • Morphological variation: • saw + s saws; fish + s fish; goose + s geese • Phonological variation: • dog + s dog + /z/; fox + s fox + /IH Z/
Morphological Parsing • Goal: Take a surface word form and generate a linguistic structure of component morphemes • A morpheme is the minimal meaning-bearing unit in a language. • Stem: the morpheme that forms the central meaning unit in a word • Affix: prefix, suffix, infix, circumfix • Prefix: e.g., possible impossible • Suffix: e.g., walk walking • Infix: e.g., hingihumingi (Tagalog) • Circumfix: e.g., sagengesagt (German)
Combining Morphemes • Inflection: Stem + gram. morpheme same class • E.g.: help + ed helped
Combining Morphemes • Inflection: Stem + gram. morpheme same class • E.g.: help + ed helped • Derivation: Stem + gram. morpheme new class • E.g. Walk + er walker (N)
Combining Morphemes • Inflection: Stem + gram. morpheme same class • E.g.: help + ed helped • Derivation: Stem + gram. morpheme new class • E.g. Walk + er walker (N) • Compounding: multiple stems new word • E.g. doghouse, catwalk, …
Combining Morphemes • Inflection: Stem + gram. morpheme same class • E.g.: help + ed helped • Derivation: Stem + gram. morpheme new class • E.g. Walk + er walker (N) • Compounding: multiple stems new word • E.g. doghouse, catwalk, … • Clitics: stem+clitic • I + ll I’ll; he + is he’s
Inflectional Morphology(Mostly English) • Relatively simple inflectional system • Nouns, verbs, some adjectives
Inflectional Morphology(Mostly English) • Relatively simple inflectional system • Nouns, verbs, some adjectives • Noun inflection: • Only plural, possessive • Non-English???
Inflectional Morphology(Mostly English) • Relatively simple inflectional system • Nouns, verbs, some adjectives • Noun inflection: • Only plural, possessive • Non-English??? • Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x • Possessive:
Inflectional Morphology(Mostly English) • Relatively simple inflectional system • Nouns, verbs, some adjectives • Noun inflection: • Only plural, possessive • Non-English??? • Plural: mostly stem + ‘s’, ‘es’ after s,z,sh,ch,x • Possessive: sg, irregpl: +’s; regpl, after s,z: ‘
Verb Inflectional Morphology • Classes: • Main (eat, hit), modal (can, should), primary (be, have) • Only main, primary inflected
Verb Inflectional Morphology • Classes: • Main (eat, hit), modal (can, should), primary (be, have) • Only main, primary inflected • Regular verbs: Forms predictable from stem, productive
Verb Inflectional Morphology • Classes: • Main (eat, hit), modal (can, should), primary (be, have) • Only main, primary inflected • Regular verbs: Forms predictable from stem, productive • Irregular verbs: Only about 250, but very frequent
Derivational Morphology • Relatively complex, common in English • Nominalization: Verb or Adj + affix Noun
Derivational Morphology • Relatively complex, common in English • Nominalization: Verb or Adj + affix Noun • Adjectives: Verb or Noun + affix Adj
Derivational Morphology • Relatively complex, common in English • Nominalization: Verb or Adj + affix Noun • Adjectives: Verb or Noun + affix Adj
Cliticization • Clitics: between affix and word • Affix: short, reduced • Word: act as pronouns, articles, conj, verbs
Cliticization • Clitics: between affix and word • Affix: short, reduced • Word: act as pronouns, articles, conj, verbs • In English: • Presence is (mostly) unambiguous: ‘ • Meaning is often ambiguous: e.g. he’s
Cliticization • Clitics: between affix and word • Affix: short, reduced • Word: act as pronouns, articles, conj, verbs • In English: • Presence is (mostly) unambiguous: ‘ • Meaning is often ambiguous: e.g. he’s • More complex in other languages: e.g. Arabic
Cliticization • Clitics: between affix and word • Affix: short, reduced • Word: act as pronouns, articles, conj, verbs • In English: • Presence is (mostly) unambiguous: ‘ • Meaning is often ambiguous: e.g. he’s • More complex in other languages: e.g. Arabic • Can prefix (proclitic) article, prep, conj, • No markers • Removal of such clitics often referred to as light stemming
Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising televise
Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising televise • Typically improves retrieval of short documents – why?
Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising televise • Typically improves retrieval of short documents – why? • Most popular: Porter stemmer (snowball.tartarus.org)
Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising televise • Typically improves retrieval of short documents – why? • Most popular: Porter stemmer (snowball.tartarus.org) • Task: Given surface form, produce base form • Typically, removes suffixes
Stemming • Simple type of morphological analysis • Commonly used in information retrieval (IR) • Supports matching using base form • e.g. Television, televised, televising televise • Typically improves retrieval of short documents – why? • Most popular: Porter stemmer (snowball.tartarus.org) • Task: Given surface form, produce base form • Typically, removes suffixes • Model: • Rule cascade • No lexicon!
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros:
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros: Simple, fast, buildable for a variety of languages • Cons:
Porter Stemmer • Rule cascade: • Rule form: • (condition) PATT1 PATT2 • E.g. stem contains vowel, ING -> ε • ATIONAL ATE • Rule partial order: • Step1a: -s • Step1b: -ed, -ing • Step 2-4: derivational suffixes • Step 5: cleanup • Pros: Simple, fast, buildable for a variety of languages • Cons: Overaggressive and underaggressive • Limited in application
FST Morphological Analysis • Focus on English morphology • FSA acceptor: • cats yes; foxes yes; childs no
FST Morphological Analysis • Focus on English morphology • FSA acceptor: • cats yes; foxes yes; childs no • FST morphological analyzer: • fox + N + pl fox^s#
FST Morphological Analysis • Focus on English morphology • FSA acceptor: • cats yes; foxes yes; childs no • FST morphological analyzer: • fox + N + pl fox^s# • FST for orthographic rules: • fox^s# foxes#
Morphological AnalysisComponents • Lexicon: List of stems and affixes • E.g.: cat: N • -s: Pl
Morphological AnalysisComponents • Lexicon: List of stems and affixes • E.g.: cat: N • -s: Pl • Morphotactics: Model of morpheme ordering • Association with classes, affix ordering • E.g. Pl follows N
Morphological AnalysisComponents • Lexicon: List of stems and affixes • E.g.: cat: N • -s: Pl • Morphotactics: Model of morpheme ordering • Association with classes, affix ordering • E.g. Pl follows N • Orthographic rules: Spelling rules • Changes when morphemes combine • E.g. y ie in try + s
Example • Goal: foxes fox + N + Pl
Example • Goal: foxes fox + N + Pl • Surface: foxes
Example • Goal: foxes fox + N + Pl • Surface: foxes • Orthographic rules • Intermediate: fox s
Example • Goal: foxes fox + N + Pl • Surface: foxes • Orthographic rules • Intermediate: fox s • Lexicon + morphotactics • Lexical: fox + N + Pl
Multiple Levels • Generation and Analysis • Generation: fox + N + Pl fox^s#; fox^s# foxes# • Analysis: foxes# fox^s#; fox^s# fox + N + Pl