730 likes | 894 Views
Natural Language Processing. Sheldon Liang, Ph D Computer Science Department. AI@Azusa Pacific University. AI@Azusa Pacific University. Sense, Communicate, Actuate. AI@Azusa Pacific University. Natural?. Natural Language?
E N D
Natural Language Processing Sheldon Liang, Ph D Computer Science Department AI@Azusa Pacific University
AI@Azusa Pacific University Sense, Communicate, Actuate
AI@Azusa Pacific University Natural? • Natural Language? • Refers to the language spoken by people, e.g. English, Chinese, Swahili, as opposed to artificial languages, like C++, Java, etc. • Natural Language Processing • Applications that deal with natural language in a way or another and it is the subfield of Artificial Intelligence • Computational Linguistics • Doing linguistics on computers • More on the linguistic side than NLP, but closely related
AI@Azusa Pacific University What is Artificial Intelligence? • The use of computer programs and programming techniques to cast light on the principles of intelligence in general and human thought in particular (Boden) • AI is the study of how to do things which at the moment people do better (Rich & Knight) • AI is the science of making machines do things that would require intelligence if done by men. (Minsky)
AI@Azusa Pacific University Why Natural Language Processing? • kJfmmfj mmmvvv nnnffn333 • Uj iheale eleee mnster vensi credur • Baboi oi cestnitze • Coovoel2^ ekk; ldsllk lkdf vnnjfj? • Fgmflmllk mlfm kfre xnnn!
AI@Azusa Pacific University Computers Lack Knowledge! • Computers “see” text in English the same you have seen the previous text! • People have no trouble understanding language • Common sense knowledge • Reasoning capacity • Experience • Computers have • No common sense knowledge • No reasoning capacity Unless we teach them!
Huge amounts of data Internet = at least 8 billion pages Intranet Applications for processing large amounts of texts Require NLP expertise Classify text into categories Index and search large texts Automatic translation Speech understanding Understand phone conversations Information extraction Extract useful information from resumes Automatic summarization Condense 1 book into 1 page Question answering Knowledge acquisition Text generations / dialogs AI@Azusa Pacific University Why Natural Language Processing?
AI@Azusa Pacific University Where does it fit in the CS taxonomy? Computers & Applications Databases Artificial Intelligence Algorithms Networking Search Robotics Natural Language Processing Information Retrieval Machine Translation Language Analysis Semantics Parsing
AI@Azusa Pacific University Situating NLP philosophy linguistics computer science NLP communication math/statistics psychology/cognitive science
AI@Azusa Pacific University Theoretical foundations • math: statistics, calculus, algebra, modeling • computational paradigms: connectionist, rule-based, cognitively plausible • linguistics: LFG, HPSG, GB, OT, CG, etc. • architectures: stacks, automata, networks, compilers
AI@Azusa Pacific University Some areas of research • Corpora, tools, resources, standards • Language/grammar engineering • Machine (assisted) translation, tools • Language modeling • Lexicography • Speech
AI@Azusa Pacific University Linguistics Essentials
AI@Azusa Pacific University The Description of Language • Language = Words and Rules Dictionary (vocabulary) + Grammar • Dictionary set of words defined in the language open (dynamic) • Traditional paper based • Electronic machine readable dictionaries; can be obtained from paper-based • Grammar set of rules which describe what is allowable in a language • Classic Grammars meant for humans who know the language • definitions and rules are mainly supported by examples • no (or almost no) formal description tools; cannot be programmed • Explicit Grammar (CFG, Dependency Grammars, Link Grammars,...) formal description can be programmed & tested on data (texts)
AI@Azusa Pacific University Linguistics Levels of Analysis • Speech • Written language • Phonology: sounds / letters / pronunciation • Morphology: the structure of words • Syntax: how these sequences are structured • Semantics: meaning of the strings • Interaction between levelswhere each level has an input and an output.
AI@Azusa Pacific University Phonetics/Orthography • Input: • acoustic signal (phonetics) / text (orthography) • Output: • phonetic alphabet (phonetics) / text (orthography) • Deals with: • Phonetics: • consonant & vowel (& others) formation in the vocal tract • classification of consonants, vowels, ... in relation to frequencies, shape & position of the tongue and various muscles • intonation • Orthography: normalization, punctuation, etc.
AI@Azusa Pacific University Phonology -- pronunciation • Input: • sequence of phones/sounds (in a phonetic alphabet); or “normalized” text (sequence of (surface) letters in one language’s alphabet) [NB: phones vs. phonemes] • Output: • sequence of phonemes (~ (lexical) letters; in an abstract alphabet) • Deals with: • relation between sounds and phonemes (units which might have some function on the upper level) • e.g.: [u] ~ oo (as in book), [æ] ~ a (cat); i ~ y (flies)
AI@Azusa Pacific University Morphology -- the structure of words • Input: sequence of phonemes (~ (lexical) letters) • Output: • sequence of pairs (lemma, (morphological) tag) • Deals with: • composition of phonemes into word forms and their underlying lemmas (lexical units) + morphological categories (inflection, derivation, compounding) • e.g. quotations ~ quote/V + -ation(der.V->N) + NNS.
AI@Azusa Pacific University ...and Beyond • Input: • sentence structure (tree): annotated nodes (autosemantic lemmas, (morphosyntactic) tags, deep functions) • Output: • logical form, which can be evaluated (true/false) • Deals with: • assignment of objects from the real world to the nodes of the sentence structure • e.g.: (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f) ~ see(Mark-Twain[SSN:...],Tom-Sawyer[SSN:...])
AI@Azusa Pacific University Phonology • (Surface « Lexical) Correspondence • “symbol-based” (no complex structures) • Ex.: (stem-final change) • lexical: b a b y + s (+ denotes start of ending) • surface:b a b i e s (phonetic-related: bébì0s) • Arabic: (interfixing, inside-stem doubling) • lexical: kTb+uu+CVCCVC (CVCC...vowel/consonant pattern) • surface: kuttub
AI@Azusa Pacific University Phonology Examples • German (umlaut) (satz ~ sentence) • lexical: s A t z + e (A denotes “umlautable” a) • surface: s ä t z e (phonetic: zæce, vs. zac) • Turkish (vowel harmony) • lexical: e v + l A r (~house) • surface: e v l e r
AI@Azusa Pacific University Morphology: Morphemes & Order • Scientific study of forms of words • Grouping of phonemes into morphemes • sequence deliverables ~deliver, able and s(3 units) • could as well be some “ID” numbers: • e.g. deliver ~ 23987, s ~ 12, able ~ 3456 • Morpheme Combination • certain combinations/sequencing possible, other not: • deliver+able+s, but not able+derive+s; noun+s, but not noun+ing • typically fixed (in any given language)
AI@Azusa Pacific University The Dictionary (or Lexicon) • Repository of information about words: • Morphological: • description of morphological “behavior”: inflection patterns/classes • Syntactic: • Part of Speech • relations to other words: • subcategorization (or “surface valency frames”) • Semantic: • semantic features • frames • ...and any other! (e.g., translation)
AI@Azusa Pacific University Sense, Communicate, Actuate
AI@Azusa Pacific University (Surface) Syntax • Input: • sequence of pairs (lemma, (morphological) tag) • Output: • sentence structure (tree) with annotated nodes (all lemmas, (morphosyntactic) tags, functions), of various forms • Deals with: • the relation between lemmas & morphological categories and the sentence structure • uses syntactic categories such as Subject, Verb, Object,... • e.g.: I/PP1 see/VB a/DT dog/NN ~ • ((I/sg)SB ((see/pres)V (a/ind dog/sg)OBJ)VP)S
AI@Azusa Pacific University Issues in Syntax “the dog ate my homework” - Who did what? • Identify the part of speech (POS) Dog = noun ; ate = verb ; homework = noun English POS tagging: 95% Can be improved! Part of speech tagging on other languages almost inexistent 2. Identify collocations mother in law, hot dog Compositional versus non-compositional collocates
AI@Azusa Pacific University Issues in Syntax • Shallow parsing: “the dog chased the bear” “the dog” “chased the bear” subject - predicate Identify basic structures NP-[the dog] VP-[chased the bear] Shallow parsing on new languages Shallow parsing with little training data
AI@Azusa Pacific University Issues in Syntax • Full parsing: John loves Mary Current precisions: 85-88% Help figuring out (automatically) questions like: Who did what and when?
AI@Azusa Pacific University Meaning (semantics) • Input: • sentence structure (tree) with annotated nodes (lemmas, (morphosyntactic) tags, surface functions) • Output: • sentence structure (tree) with annotated nodes (semantic lemmas, (morpho-syntactic) tags, deep functions) • Deals with: • relation between categories such as “Subject”, “Object” and (deep) categories such as “Agent”, “Effect”; adds other cat’s • e.g. ((I)SB ((was seen)V (by Tom)OBJ)VP)S ~ • (I/Sg/Pat/t (see/Perf/Pred/t) Tom/Sg/Ag/f)
AI@Azusa Pacific University Issues in Semantics • Understand language! How? • “plant” = industrial plant • “plant” = living organism • Words are ambiguous • Importance of semantics? • Machine Translation: wrong translations • Information Retrieval: wrong information • Anaphora Resolution: wrong referents
AI@Azusa Pacific University Why Semantics? • The sea is at the home for billions of factories and animals • The sea is home to million of plants and animals • English French [commercial MT system] • Le mer est a la maison de billion des usines et des animaux • French English
AI@Azusa Pacific University Issues in Semantics • How to learn the meaning of words? • From dictionaries: plant, works, industrial plant -- (buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles") plant, flora, plant life -- (a living organism lacking the power of locomotion) They are producing about 1,000 automobiles in the new plant The sea flora consists in 1,000 different plant species The plant was close to the farm of animals.
AI@Azusa Pacific University Issues in Semantics • Learn from annotated examples: • Assume 100 examples containing “plant” previously tagged by a human • Train a learning algorithm • Precisions in the range 60%-70%-(80%) How to choose the learning algorithm? How to obtain the 100 tagged examples?
AI@Azusa Pacific University Issues in Learning Semantics • Learning? • Assume a (large) amount of annotated data = training • Assume a new text not annotated = test • Learn from previous experience (training) to classify new data (test) • Decision trees, memory based learning, neural networks • Machine Learning Which one performs best?
AI@Azusa Pacific University Issues in Semantics • Automatic annotation of data • Active learning • Identify only the hard examples • Co-training • Identify the examples where several techniques agree on the semantic tag • Collecting from Web users • Open Mind Word Expert
AI@Azusa Pacific University Problems faced by Natural Language-Understanding Systems
AI@Azusa Pacific University Key NLP problem: ambiguity • Human Language is highly ambiguous at all levels • acoustic levelrecognize speech vs. wreck a nice beach • morphological levelsaw: to see (past), saw (noun), to saw (present, inf) • syntactic levelI saw the man on the hill with a telescope • semantic levelOne book has to be read by every student
AI@Azusa Pacific University Key NLP problem: Ambiguity • Human Language is highly ambiguous at all levels • acoustic levelrecognize speech vs. wreck a nice beach • morphological levelsaw: to see (past), saw (noun), to saw (present, inf) • syntactic levelI saw the man on the hill with a telescope • semantic levelOne book has to be read by every student
AI@Azusa Pacific University Language Model • A formal model about language • Two types • Non-probabilistic • Allows one to compute whether a certain sequence (sentence or part thereof) is possible • Often grammar based • Probabilistic • Allows one to compute the probability of a certain sequence • Often extends grammars with probabilities
AI@Azusa Pacific University Example of Bad Language Model
AI@Azusa Pacific University Example of Bad Language Model
AI@Azusa Pacific University Example of Bad Language Model
AI@Azusa Pacific University A Good Language Model • Non-Probabilistic • “I swear to tell the truth” is possible • “I swerve to smell de soup” is impossible • Probabilistic • P(I swear to tell the truth) ~ .0001 • P(I swerve to smell de soup) ~ 0
AI@Azusa Pacific University Language Model Application • Spelling correction • Mobile phone texting • Speech recognition • Handwriting recognition • Disabled users • …
AI@Azusa Pacific University Speech & Text segmentation • In spoken language, sounds representing succesive letters blend into each other • This makes the conversion of the analog signal to discrete characters very difficult • Regarding Text Segmentation , Some written languages like chinese, japanese and thai don’t have signal word boundaries. • So any significant text parsing requires identifying word boundaries, which is often a non-trivial tasks
AI@Azusa Pacific University Word sense disambiguation • Word sense disambiguation is the problem of selecting a sense for a word from a set of predefined possibilities. • Sense Inventory usually comes from a dictionary or thesaurus. • Knowledge intensive methods, supervised learning, and (sometimes) bootstrapping approaches • Word sense discrimination is the problem of dividing the usages of a word into different meanings, without regard to any particular existing sense inventory. • Unsupervised techniques
AI@Azusa Pacific University Word sense disambiguationComputers versus Humans • Polysemy – most words have many possible meanings. • A computer program has no basis for knowing which one is appropriate, even if it is obvious to a human… • Ambiguity is rarely a problem for humans in their day to day communication, except in extreme cases…
AI@Azusa Pacific University Word sense disambiguationAmbiguity for a Computer • The fisherman jumped off the bank and into the water. • The bank down the street was robbed! • Back in the day, we had an entire bank of computers devoted to this problem. • The bank in that road is entirely too steep and is really dangerous. • The plane took a bank to the left, and then headed off towards the mountains.
AI@Azusa Pacific University Syntactic ambiguity • There are often multiple possible parse trees for a given sentence. • Choosing the most appropriate one usually requires semantic and contextual information. • Specific problem components here are: • Sentence boundary disambiguation • Imperfect input • Foreign or regional accents etc.
AI@Azusa Pacific University Syntactic ambiguity
AI@Azusa Pacific University Statistical NLP • Statistical NLP uses stochastic, probabilistic and statistical methods to resolve some difficulties of NLP • Methods for disambiguation of an involve the use of corpora & Markov models. • Technology for statistical NLP comes from machine learning and data mining both of which involve learning from data.