CS60057 Speech &Natural Language Processing

CS60057Speech &Natural Language Processing Autumn 2007 Lecture 3 27 July 2007 Natural Language Processing

Levels of (Formal) Description • 6 basic levels (more or less explicitly present in most theories): and beyond (pragmatics/logic/...) meaning (semantics) (surface) syntax morphology phonology phonetics/orthography • Each level has an input and output representation • output from one level is the input to the next (upper) level • sometimes levels might be skipped (merged) or split

Phonetics/Orthography • Input: • acoustic signal (phonetics) / text (orthography) • Output: • phonetic alphabet (phonetics) / text (orthography) • Deals with: • Phonetics: • consonant & vowel (& others) formation in the vocal tract • classification of consonants, vowels, ... in relation to frequencies, shape & position of the tongue and various muscles • intonation • Orthography: normalization, punctuation, etc.

Phonology • Input: • sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes] • Output: • sequence of phonemes (~ (lexical) letters; in an abstract alphabet) • Deals with: • relation between sounds and phonemes (units which might have some function on the upper level) • e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)

Morphology • Input: • sequence of phonemes (~ (lexical) letters) • Output: • sequence of pairs (lemma, (morphological) tag) • Deals with: • composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding) • e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.

(Surface) Syntax • Input: • sequence of pairs (lemma, (morphological) tag) • Output: • sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms • Deals with: • the relation between lemmas & morphological categories and the sentence structure • uses syntactic categories such as Subject, Verb, Object,... • e.g.: I/PP1 see/VB a/DT dog/NN ~ • ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S

Meaning (semantics) • Input: • sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions) • Output: • sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions) • Deals with: • relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s • e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ • (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)

...and Beyond • Input: • sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions) • Output: • logical form, which can be evaluated (true/false) • Deals with: • assignment of objects from the real world to the nodes of the sentence structure • e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ • see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...])[Time:bef 99/9/27/14:15][Place:39ş19’40”N76ş37’10”W]

Three Views • Three equivalent formal ways to look at what we’re up to (not including tables) Regular Expressions Finite State Automata Regular Languages Natural Language Processing

Transition • Finite-state methods are particularly useful in dealing with a lexicon. • Lots of devices, some with limited memory, need access to big lists of words. • So we’ll switch to talking about some facts about words and then come back to computational methods Natural Language Processing

MORPHOLOGY Natural Language Processing

Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes (morph = shape, logos = word) • We can usefully divide morphemes into two classes • Stems: The core meaning bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions • Prefix: un-, anti-, etc • Suffix: -ity, -ation, etc • Infix: are inserted inside the stem • Tagalog: um + hingi humingi • Circumfixes – precede and follow the stem • English doesn’t stack more affixes. • But Turkish can have words with a lot of suffixes. • Languages, such as Turkish, tend to string affixes together are called agglutinative languages. Natural Language Processing

Surface and Lexical Forms • The surface level of a word represents the actual spelling of that word. • geliyorum eats cats kitabım • The lexical level of a word represents a simple concatenation of morphemes making up that word. • gel +PROG +1SG • eat +AOR • cat +PLU • kitap +P1SG • Morphological processors try to find correspondences between lexical and surface forms of words. • Morphological recognition/ analysis – surface to lexical • Morphological generation/ synthesis – lexical to surface Natural Language Processing

Morphology: Morphemes & Order • Handles what is an isolated form in written text • Grouping of phonemes into morphemes • sequence deliverables ~deliver, able and s(3 units) • Morpheme Combination • certain combinations/sequencing possible, other not: • deliver+able+s, but not able+derive+s; noun+s, but not noun+ing • typically fixed (in any given language)

Inflectional & Derivational Morphology • We can also divide morphology up into two broad classes • Inflectional • Derivational • Inflectional morphology concerns the combination of stems and affixes where the resulting word • Has the same word class as the original • Serves a grammatical/semantic purpose different from the original After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. • eat / eats pencil / pencils • After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. • compute / computer do / undo friend / friendly • Uygar / uygarlaş kapı /kapıcı • The irregular changes may happen with derivational affixes. Natural Language Processing

Morphological Parsing • Morphological parsing is to find the lexical form of a word from its surface form. • cats -- cat +N +PLU • cat -- cat +N +SG • goose -- goose +N +SG or goose +V • geese -- goose +N +PLU • gooses -- goose +V +3SG • catch -- catch +V • caught -- catch +V +PAST or catch +V +PP • There can be more than one lexical level representation for a given word. (ambiguity) Natural Language Processing

Morphological Analysis • Analyzing words into their linguistic components (morphemes). • Morphemes are the smallest meaningful units of language. cars car+PLU giving give+PROG AsachhilAma AsA+PROG+PAST+1st I/We was/were coming • Ambiguity: More than one alternatives flies flyVERB+PROG flyNOUN+PLU mAtAla kare Natural Language Processing

Fly + s  flys  flies (y i rule) • Duckling Go-getter  get + er Doer  do + er Beer  ? What knowledge do we need? How do we represent it? How do we compute with it? Natural Language Processing

Knowledge needed • Knowledge of stems or roots • Duck is a possible root, not duckl We need a dictionary (lexicon) • Only some endings go on some words • Do + er ok • Be + er – not ok • In addition, spelling change rules that adjust the surface form • Get + er – double the t getter • Fox + s – insert e – foxes • Fly + s – insert e – flys – y to i – flies • Chase + ed – drop e - chased Natural Language Processing

Put all this in a big dictionary (lexicon) • Turkish – approx 600  106 forms • Finnish – 107 • Hindi, Bengali, Telugu, Tamil? • Besides, always novel forms can be constructed • Anti-missile • Anti-anti-missile • Anti-anti-anti-missile • …….. • Compounding of words – Sanskrit, German Natural Language Processing

Morphology: From Morphemes to Lemmas & Categories • Lemma: lexical unit, “pointer” to lexicon • typically is represented as the “base form”, or “dictionary headword” • possibly indexed when ambiguous/polysemous: • state1 (verb), state2 (state-of-the-art), state3 (government) • from one or more morphemes (“root”, “stem”, “root+derivation”, ...) • Categories: non-lexical • small number of possible values (< 100, often < 5-10)

Morphology Level: The Mapping • Formally: A+  2(L,C1,C2,...,Cn) • A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes) • L is the set of possible lemmas, uniquely identified • Ci are morphological categories, such as: • grammatical number, gender, case • person, tense, negation, degree of comparison, voice, aspect, ... • tone, politeness, ... • part of speech (not quite morphological category, but...) • A, L and Ci are obviously language-dependent

Morphological Analysis (cont.) • Relatively simple for English. • But for many Indian languages, it may be more difficult. Examples Inflectional and Derivational Morphology. • Common tools: Finite-state transducers Natural Language Processing

Bengali Verb Paradigms Natural Language Processing

Bengali Verb morphology for one of the paradigms Natural Language Processing

Natural Language Processing

Finite State Machines • FSAs are equivalent to regular languages • FSTs are equivalent to regular relations (over pairs of regular languages) • FSTs are like FSAs but with complex labels. • We can use FSTs to transduce between surface and lexical levels. Natural Language Processing

Simple Rules Natural Language Processing

Adding in the Words Natural Language Processing

Derivational Rules Natural Language Processing

Parsing/Generation vs. Recognition • Recognition is usually not quite what we need. • Usually if we find some string in the language we need to find the structure in it (parsing) • Or we have some structure and we want to produce a surface form (production/generation) • Example • From “cats” to “cat +N +PL”and back Natural Language Processing

Morphological Parsing • Given the input cats, we’d like to outputcat +N +Pl, telling us that cat is a plural noun. • Given the Spanish input bebo, we’d like to outputbeber +V +PInd +1P +Sg telling us that bebo is the present indicative first person singular form of the Spanish verb beber, ‘to drink’. Natural Language Processing

Morphological Anlayser To build a morphological analyser we need: • lexicon: the list of stems and affixes, together with basic information about them • morphotactics: the model of morpheme ordering (eg English plural morpheme follows the noun rather than a verb) • orthographic rules: these spelling rules are used to model the changes that occur in a word, usually when two morphemes combine (e.g., fly+s = flies) Natural Language Processing

Lexicon & Morphotactics • Typically list of word parts (lexicon) and the models of ordering can be combined together into an FSA which will recognise the all the valid word forms. • For this to be possible the word parts must first be classified into sublexicons. • The FSA defines the morphotactics (ordering constraints). Natural Language Processing

Sublexiconsto classify the list of word parts Natural Language Processing

FSA Expresses Morphotactics (ordering model) Natural Language Processing

Towards the Analyser • We can use lexc or xfst to build such an FSA (see lex1.lexc) • To augment this to produce an analysis we must create a transducer Tnum which maps between the lexical level and an "intermediate" level that is needed to handle the spelling rules of English. Natural Language Processing

Three Levels of Analysis Natural Language Processing

1. Tnum: Noun Number Inflection • multi-character symbols • morpheme boundary ^ • word boundary # Natural Language Processing

Intermediate Form to Surface • The reason we need to have an intermediate form is that funny things happen at morpheme boundaries, e.g. cat^s  cats fox^s  foxes fly^s  flies • The rules which describe these changes are called orthographic rules or "spelling rules". Natural Language Processing

More English Spelling Rules • consonant doubling: beg / begging • y replacement: try/tries • k insertion: panic/panicked • e deletion: make/making • e insertion: watch/watches • Each rule can be stated in more detail ... Natural Language Processing

Spelling Rules • Chomsky & Halle (1968) invented a special notation for spelling rules. • A very similar notation is embodied in the "conditional replacement" rules of xfst. E -> F || L _ R which means replace E with F when it appears between left context L and right context R Natural Language Processing

A Particular Spelling Rule This rule does e-insertion ^ -> e || x _ s# Natural Language Processing

e insertion over 3 levels The rule corresponds to the mapping between surface and intermediate levels Natural Language Processing

e insertion as an FST Natural Language Processing

Incorporating Spelling Rules • Spelling rules, each corresponding to an FST, can be run in parallel provided that they are "aligned". • The set of spelling rules is positioned between the surface level and the intermediate level. • Parallel execution of FSTs can be carried out: • by simulation: in this case FSTs must first be aligned. • by first constructing a a single FST corresponding to their intersection. Natural Language Processing

Putting it all together execution of FSTi takes place in parallel Natural Language Processing

Kaplan and KayThe Xerox View FSTi are aligned but separate FSTi intersected together Natural Language Processing

Finite State Transducers • The simple story • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around. Natural Language Processing

CS60057 Speech &Natural Language Processing