CS60057 Speech &Natural Language Processing

CS60057Speech &Natural Language Processing Autumn 2007 Lecture 2 26 July 2007 Natural Language Processing

Why is NLP difficult? • Because Natural Language is highly ambiguous. • Syntactic ambiguity • The president spoke to the nation about the problem of drug use in the schools from one coast to the other. • has 720 parses. • Ex: • “to the other” can attach to any of the previous NPs (ex. “the problem”), or the head verb  6 places • “from one coast” has 5 places to attach • … Natural Language Processing

Why is NLP difficult? • Word category ambiguity • book -->verb? or noun? • Word sense ambiguity • bank --> financial institution? building? or river side? • Words can mean more than their sum of parts • make up a story • Fictitious worlds • People on mars can fly. • Defining scope • People like ice-cream. • Does this mean that all (or some?) people like ice cream? • Language is changing and evolving • I’ll email you my answer. • This new S.U.V. has a compartment for your mobile phone. • Googling, … Natural Language Processing

Dealing with Ambiguity • Four possible approaches: • Tightly coupled interaction among processing levels; knowledge from other levels can help decide among choices at ambiguous levels. • Pipeline processing that ignores ambiguity as it occurs and hopes that other levels can eliminate incorrect structures. Natural Language Processing

Resolve Ambiguities • We will introduce models and algorithms to resolve ambiguities at different levels. • part-of-speech tagging -- Deciding whether duck is verb or noun. • word-sense disambiguation -- Deciding whether make is create or cook. • lexical disambiguation -- Resolution of part-of-speech and word-sense ambiguities are two important kinds of lexical disambiguation. • syntactic ambiguity -- her duck is an example of syntactic ambiguity, and can be addressed by probabilistic parsing. Natural Language Processing

Resolve Ambiguities (cont.) I made her duck S S NP VP NP VP I V NP NP I V NP made her duck made DET N her duck Natural Language Processing

Dealing with Ambiguity • Three approaches: • Tightly coupled interaction among processing levels; knowledge from other levels can help decide among choices at ambiguous levels. • Pipeline processing that ignores ambiguity as it occurs and hopes that other levels can eliminate incorrect structures. • Syntax proposes/semantics disposes approach • Probabilistic approaches based on making the most likely choices Natural Language Processing

Models and Algorithms • By models I mean the formalisms that are used to capture the various kinds of linguistic knowledge we need. • Algorithms are then used to manipulate the knowledge representations needed to tackle the task at hand. Natural Language Processing

Models to Represent Linguistic Knowledge • Different formalisms (models) are used to represent the required linguistic knowledge. • State Machines -- FSAs, HMMs, ATNs, RTNs • Formal Rule Systems -- Context Free Grammars, Unification Grammars, Probabilistic CFGs. • Logic-based Formalisms -- first order predicate logic, some higher order logic. • Models of Uncertainty -- Bayesian probability theory. Natural Language Processing

Algorithms • Many of the algorithms that we’ll study will turn out to be transducers; algorithms that take one kind of structure as input and output another. • Unfortunately, ambiguity makes this process difficult. This leads us to employ algorithms that are designed to handle ambiguity of various kinds Natural Language Processing

Algorithms • In particular.. • State-space search • To manage the problem of making choices during processing when we lack the information needed to make the right choice • Dynamic programming • To avoid having to redo work during the course of a state-space search • CKY, Earley, Minimum Edit Distance, Viterbi, Baum-Welch Natural Language Processing

State Space Search • States represent pairings of partially processed inputs with partially constructed representations. • Goals are inputs paired with completed representations that satisfy some criteria. • As with most interesting problems the spaces are normally too large to exhaustively explore. • We need heuristics to guide the search • Criteria to trim the space Natural Language Processing

Dynamic Programming • Don’t do the same work over and over. • Avoid this by building and making use of solutions to sub-problems that must be invariant across all parts of the space. Natural Language Processing

Languages • Languages: 39,000 languages and dialects (22,000 dialects in India alone) • Top languages: • Chinese/Mandarin (885M), • Spanish (332M), • English (322M), • Bengali (189M), • Hindi (182M), • Portuguese (170M), Russian (170M), Japanese (125M) • Source: www.sil.org/ethnologue, www.nytimes.com • Internet: English (128M), Japanese (19.7M), German (14M), Spanish (9.4M), French (9.3M), Chinese (7.0M) • Usage: English (1999-54%, 2001-51%, 2003-46%, 2005-43%) • Source: www.computereconomics.com Natural Language Processing

The Description of Language • Language = Words and Rules •  Dictionary (vocabulary) + Grammar • Dictionary • set of words defined in the language. open (dynamic) • Traditional - paper based • Electronic - machine readable dictionaries; can be obtained from paper-based • Grammar • set of rules which describe what is allowable in a language • Classic Grammars • meant for humans who know the language • definitions and rules are mainly supported by examples • no (or almost no) formal description tools; cannot be programmed • Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...) • formal description • can be programmed & tested on data (texts)

Levels of (Formal) Description • 6 basic levels (more or less explicitly present in most theories): and beyond (pragmatics/logic/...) meaning (semantics) (surface) syntax morphology phonology phonetics/orthography • Each level has an input and output representation • output from one level is the input to the next (upper) level • sometimes levels might be skipped (merged) or split

Phonetics/Orthography • Input: • acoustic signal (phonetics) / text (orthography) • Output: • phonetic alphabet (phonetics) / text (orthography) • Deals with: • Phonetics: • consonant & vowel (& others) formation in the vocal tract • classification of consonants, vowels, ... in relation to frequencies, shape & position of the tongue and various muscles • intonation • Orthography: normalization, punctuation, etc.

Phonology • Input: • sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes] • Output: • sequence of phonemes (~ (lexical) letters; in an abstract alphabet) • Deals with: • relation between sounds and phonemes (units which might have some function on the upper level) • e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)

Morphology • Input: • sequence of phonemes (~ (lexical) letters) • Output: • sequence of pairs (lemma, (morphological) tag) • Deals with: • composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding) • e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.

(Surface) Syntax • Input: • sequence of pairs (lemma, (morphological) tag) • Output: • sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms • Deals with: • the relation between lemmas & morphological categories and the sentence structure • uses syntactic categories such as Subject, Verb, Object,... • e.g.: I/PP1 see/VB a/DT dog/NN ~ • ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S

Meaning (semantics) • Input: • sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions) • Output: • sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions) • Deals with: • relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s • e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ • (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)

...and Beyond • Input: • sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions) • Output: • logical form, which can be evaluated (true/false) • Deals with: • assignment of objects from the real world to the nodes of the sentence structure • e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ • see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...])[Time:bef 99/9/27/14:15][Place:39ş19’40”N76ş37’10”W]

Three Views • Three equivalent formal ways to look at what we’re up to (not including tables) Regular Expressions Finite State Automata Regular Languages Natural Language Processing

Transition • Finite-state methods are particularly useful in dealing with a lexicon. • Lots of devices, some with limited memory, need access to big lists of words. • So we’ll switch to talking about some facts about words and then come back to computational methods Natural Language Processing

MORPHOLOGY Natural Language Processing

Morphology • Morphology is the study of the ways that words are built up from smaller meaningful units called morphemes (morph = shape, logos = word) • We can usefully divide morphemes into two classes • Stems: The core meaning bearing units • Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions • Prefix: un-, anti-, etc • Suffix: -ity, -ation, etc • Infix: are inserted inside the stem • Tagalog: um + hingi humingi • Circumfixes – precede and follow the stem • English doesn’t stack more affixes. • But Turkish can have words with a lot of suffixes. • Languages, such as Turkish, tend to string affixes together are called agglutinative languages. Natural Language Processing

Surface and Lexical Forms • The surface level of a word represents the actual spelling of that word. • geliyorum eats cats kitabım • The lexical level of a word represents a simple concatenation of morphemes making up that word. • gel +PROG +1SG • eat +AOR • cat +PLU • kitap +P1SG • Morphological processors try to find correspondences between lexical and surface forms of words. • Morphological recognition/ analysis – surface to lexical • Morphological generation/ synthesis – lexical to surface Natural Language Processing

Morphology: Morphemes & Order • Handles what is an isolated form in written text • Grouping of phonemes into morphemes • sequence deliverables ~deliver, able and s(3 units) • Morpheme Combination • certain combinations/sequencing possible, other not: • deliver+able+s, but not able+derive+s; noun+s, but not noun+ing • typically fixed (in any given language)

Inflectional & Derivational Morphology • We can also divide morphology up into two broad classes • Inflectional • Derivational • Inflectional morphology concerns the combination of stems and affixes where the resulting word • Has the same word class as the original • Serves a grammatical/semantic purpose different from the original After a combination with an inflectional morpheme, the meaning and class of the actual stem usually do not change. • eat / eats pencil / pencils • After a combination with an derivational morpheme, the meaning and the class of the actual stem usually change. • compute / computer do / undo friend / friendly • Uygar / uygarlaş kapı /kapıcı • The irregular changes may happen with derivational affixes. Natural Language Processing

Morphological Parsing • Morphological parsing is to find the lexical form of a word from its surface form. • cats -- cat +N +PLU • cat -- cat +N +SG • goose -- goose +N +SG or goose +V • geese -- goose +N +PLU • gooses -- goose +V +3SG • catch -- catch +V • caught -- catch +V +PAST or catch +V +PP • There can be more than one lexical level representation for a given word. (ambiguity) Natural Language Processing

Morphological Analysis • Analyzing words into their linguistic components (morphemes). • Morphemes are the smallest meaningful units of language. cars car+PLU giving give+PROG AsachhilAma AsA+PROG+PAST+1st I/We was/were coming • Ambiguity: More than one alternatives flies flyVERB+PROG flyNOUN+PLU mAtAla kare Natural Language Processing

Fly + s  flys  flies (y i rule) • Duckling Go-getter  get + er Doer  do + er Beer  ? What knowledge do we need? How do we represent it? How do we compute with it? Natural Language Processing

Knowledge needed • Knowledge of stems or roots • Duck is a possible root, not duckl We need a dictionary (lexicon) • Only some endings go on some words • Do + er ok • Be + er – not ok • In addition, spelling change rules that adjust the surface form • Get + er – double the t getter • Fox + s – insert e – foxes • Fly + s – insert e – flys – y to i – flies • Chase + ed – drop e - chased Natural Language Processing

Put all this in a big dictionary (lexicon) • Turkish – approx 600  106 forms • Finnish – 107 • Hindi, Bengali, Telugu, Tamil? • Besides, always novel forms can be constructed • Anti-missile • Anti-anti-missile • Anti-anti-anti-missile • …….. • Compounding of words – Sanskrit, German Natural Language Processing

Morphology: From Morphemes to Lemmas & Categories • Lemma: lexical unit, “pointer” to lexicon • typically is represented as the “base form”, or “dictionary headword” • possibly indexed when ambiguous/polysemous: • state1 (verb), state2 (state-of-the-art), state3 (government) • from one or more morphemes (“root”, “stem”, “root+derivation”, ...) • Categories: non-lexical • small number of possible values (< 100, often < 5-10)

Morphology Level: The Mapping • Formally: A+  2(L,C1,C2,...,Cn) • A is the alphabet of phonemes (A+ denotes any non-empty sequence of phonemes) • L is the set of possible lemmas, uniquely identified • Ci are morphological categories, such as: • grammatical number, gender, case • person, tense, negation, degree of comparison, voice, aspect, ... • tone, politeness, ... • part of speech (not quite morphological category, but...) • A, L and Ci are obviously language-dependent

Morphological Analysis (cont.) • Relatively simple for English. • But for many Indian languages, it may be more difficult. Examples Inflectional and Derivational Morphology. • Common tools: Finite-state transducers Natural Language Processing

Simple Rules Natural Language Processing

Adding in the Words Natural Language Processing

Derivational Rules Natural Language Processing

Parsing/Generation vs. Recognition • Recognition is usually not quite what we need. • Usually if we find some string in the language we need to find the structure in it (parsing) • Or we have some structure and we want to produce a surface form (production/generation) • Example • From “cats” to “cat +N +PL”and back Natural Language Processing

Finite State Transducers • The simple story • Add another tape • Add extra symbols to the transitions • On one tape we read “cats”, on the other we write “cat +N +PL”, or the other way around. Natural Language Processing

FSTs Natural Language Processing

+N:ε +PL:s c:c a:a t:t Transitions • c:c means read a c on one tape and write a c on the other • +N:ε means read a +N symbol on one tape and write nothing on the other • +PL:s means read +PL and write an s Natural Language Processing

Typical Uses • Typically, we’ll read from one tape using the first symbol on the machine transitions (just as in a simple FSA). • And we’ll write to the second tape using the other symbols on the transitions. Natural Language Processing

Ambiguity • Recall that in non-deterministic recognition multiple paths through a machine may lead to an accept state. • Didn’t matter which path was actually traversed • In FSTs the path to an accept state does matter since differ paths represent different parses and different outputs will result Natural Language Processing

Ambiguity • What’s the right parse for • Unionizable • Union-ize-able • Un-ion-ize-able • Each represents a valid path through the derivational morphology machine. Natural Language Processing

Ambiguity • There are a number of ways to deal with this problem • Simply take the first output found • Find all the possible outputs (all paths) and return them all (without choosing) • Bias the search so that only one or a few likely paths are explored Natural Language Processing

The Gory Details • Of course, its not as easy as • “cat +N +PL” <-> “cats” • As we saw earlier there are geese, mice and oxen • But there are also a whole host of spelling/pronunciation changes that go along with inflectional changes • Cats vs Dogs • Fox and Foxes Natural Language Processing

Multi-Tape Machines • To deal with this we can simply add more tapes and use the output of one tape machine as the input to the next • So to handle irregular spelling changes we’ll add intermediate tapes with intermediate symbols Natural Language Processing

CS60057 Speech &Natural Language Processing