1.17k likes | 1.42k Views
CS 8520: Artificial Intelligence. Natural Language Processing Introduction Paula Matuszek Spring, 2013. CSC 8520 Spring 2013. Paula Matuszek. Natural Language Processing. speech recognition natural language understanding computational linguistics psycholinguistics information extraction
E N D
CS 8520: Artificial Intelligence Natural Language Processing Introduction Paula Matuszek Spring, 2013 CSC 8520 Spring 2013. Paula Matuszek
Natural Language Processing • speech recognition • natural language understanding • computational linguistics • psycholinguistics • information extraction • information retrieval • inference • natural language generation • speech synthesis • language evolution CSC 8520 Spring 2013. Paula Matuszek
Applied NLP • Machine translation • spelling/grammar correction • Information Retrieval • Data mining • Document classification • Question answering, conversational agents CSC 8520 Spring 2013. Paula Matuszek
sound waves accoustic /phonetic morphological/syntactic semantic / pragmatic Natural Language Understanding internal representation CSC 8520 Spring 2013. Paula Matuszek
sound waves accoustic /phonetic morphological/syntactic semantic / pragmatic Natural Language Understanding Sounds Symbols Sense internal representation CSC 8520 Spring 2013. Paula Matuszek
sound waves Where are the words? semantic / pragmatic accoustic /phonetic morphological/syntactic • “How to recognize speech, not to wreck a nice beach” • “The cat scares all the birds away” • “The cat’s cares are few” internal representation • pauses in speech bear little relation to word breaks • + intonation offers additional clues to meaning CSC 8520 Spring 2013. Paula Matuszek
sound waves Dissecting words/sentences accoustic /phonetic morphological/syntactic semantic / pragmatic internal representation • “The dealer sold the merchant a dog” • “I saw the Golden bridge flying into San Francisco” CSC 8520 Spring 2013. Paula Matuszek
sound waves What does it mean? semantic / pragmatic accoustic /phonetic morphological/syntactic internal representation • “I saw Pathfinder on Mars with a telescope” • “Pathfinder photographed Mars” • “The Pathfinder photograph from Ford has arrived” • “When a Pathfinder fords a river it sometimes mars its paint job.” CSC 8520 Spring 2013. Paula Matuszek
sound waves What does it mean? accoustic /phonetic morphological/syntactic semantic / pragmatic • “Jack went to the store. Hefound the milk in aisle 3. He paid for it and left.” • “ Q: Did you read the report? • A: I read Bob’s email.” internal representation CSC 8520 Spring 2013. Paula Matuszek
The steps in NLP • Morphology: Concerns the way words are built up from smaller meaning bearing units. • Syntax: concerns how words are put together to form correct sentences and what structural role each word has • Semantics: concerns what words mean and how these meanings combine in sentences to form sentence meanings CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt
The steps in NLP (Cont.) • Pragmatics: concerns how sentences are used in different situations and how use affects the interpretation of the sentence • Discourse: concerns how the immediately preceding sentences affect the interpretation of the next sentence CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt
Some of the Tools • Regular Expressions and Finite State Automata • Part of Speech taggers • N-Grams • Grammars • Parsers • Semantic Analysis CSC 8520 Spring 2013. Paula Matuszek
Parsing (Syntactic Analysis) • Assigning a syntactic and logical form to an input sentence • uses knowledge about word and word meanings (lexicon) • uses a set of rules defining legal structures (grammar) • Paula ate the apple. • (S (NP (NAME Paula)) • (VP (V ate) • (NP (ART the) • (N apple)))) CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt
Word Sense Resolution • Many words have many meanings or senses • We need to resolve which of the senses of an ambiguous word is invoked in a particular use of the word • I made her duck. (made her a bird for lunch or made her move her head quickly downwards?) CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt
Reference Resolution • Domain Knowledge (Registration transaction) • Discourse Knowledge • World Knowledge • U: I would like to register in an CSC Course. • S: Which number? • U: Make it 8520. • S: Which section? • U: Which section is in the evening? • S: section 1. • U: Then make it that section. CSC 8520 Spring 2013. Paula Matuszek Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt
Syntactic Analysis Discourse Analysis Pragmatic Analysis Semantic Analysis user Morphological Analysis Internal representation lexicon stems parse tree Surface form Perform action Resolve references Stages of NLP Taken from ocw.kfupm.edu.sa/user062/ICS48201/NL2introduction.ppt CSC 8520 Spring 2013. Paula Matuszek
Human Languages • You know ~50,000 words of primary language, each with several meanings • six year old knows ~13000 words • First 16 years we learn 1 word every 90 min of waking time • Mental grammar generates sentences -virtually every sentence is novel • 3 year olds already have 90% of grammar • ~6000 human languages – none of them simple! Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London CSC 8520 Spring 2013. Paula Matuszek
Human Spoken language • Most complicated mechanical motion of the human body • Movements must be accurate to within mm • synchronized within hundredths of a second • We can understand up to 50 phonemes/sec (normal speech 10-15ph/sec) • but if sound is repeated 20 times /sec we hear continuous buzz! • All aspects of language processing are involved and manage to keep apace Adapted from Martin Nowak 2000 – Evolutionary biology of language – Phil.Trans. Royal Society London CSC 8520 Spring 2013. Paula Matuszek
Why Language is Hard • NLP is AI-complete • Abstract concepts are difficult to represent • LOTS of possible relationships among concepts • Many ways to represent similar concepts • Tens of hundreds or thousands of features/dimensions CSC 8520 Spring 2013. Paula Matuszek
Why Language is Easy • Highly redundant • Many relatively crude methods provide fairly good results CSC 8520 Spring 2013. Paula Matuszek
History of NLP • Prehistory (1940s, 1950s) • automata theory, formal language theory, markov processes (Turing, McCullock&Pitts, Chomsky) • information theory and probabilistic algorithms (Shannon) • Turing test – can machines think? CSC 8520 Spring 2013. Paula Matuszek
History of NLP • Early work: • symbolic approach • generative syntax - eg Transformations and Discourse Analysis Project (TDAP- Harris) • AI – pattern matching, logic-based, special-purpose systems • Eliza -- Rogerian therapist http://www.manifestation.com/neurotoys/eliza.php3 • stochastic • bayesian methods • early successes -- $$$$ grants! • by 1966 US government had spent 20 million on machine translation • Critics: • Bar Hillel – “no way to disambiguation without deep understanding” • Pierce NSF 1966 report: “no way to justify work in terms of practical output” CSC 8520 Spring 2013. Paula Matuszek
History of NLP • The middle ages (1970-1990) • stochastic • speech recognition and synthesis (Bell Labs) • logic-based • compositional semantics (Montague) • definite clause grammars (Pereira&Warren) • ad hoc AI-based NLU systems • SHRDLU robot in blocks world (Winograd) • knowledge representation systems at Yale (Shank) • discourse modeling • anaphora • focus/topic (Groz et al) • conversational implicature (Grice) CSC 8520 Spring 2013. Paula Matuszek
History of NLP • NLP Renaissance (1990-2000) • Lessons from phonology & morphology successes: • finite-state models are very powerful • probabilistic models pervasive • Web creates new opportunities and challenges • practical applications driving the field again • 21st Century NLP • The web changes everything: • much greater use for NLP • much more data available CSC 8520 Spring 2013. Paula Matuszek
Document Features • Most NLP is applied to some quantity of unstructured text. • For simplicity, we will refer to any such quantity as a document • What features of a document are of interest? • Most common is the actual terms in the document. CSC 8520 Spring 2013. Paula Matuszek
Tokenization • Tokenization is the process of breaking up a string of letters into words and other meaningful components (numbers, punctuation, etc. • Typically broken up at white space. • Very standard NLP tool • Language-dependent, and sometimes also domain-dependent. • 3,7-dihydro-1,3,7-trimethyl-1H-purine-2,6-dione • Tokens can also be larger divisions: sentences, paragraphs, etc. CSC 8520 Spring 2013. Paula Matuszek
A-Z 2 0 1 blank A-Z blank, EOF Lexical Analyser • Basic idea is a finite state machine • Triples of input state, transition token, output state • Must be very efficient; gets used a LOT CSC 8520 Spring 2013. Paula Matuszek
Design Issues for Tokenizer • Punctuation • treat as whitespace? • treat as characters? • treat specially? • Case • fold? • Digits • assemble into numbers? • treat as characters? • treat as punctuation? CSC 8520 Spring 2013. Paula Matuszek
NLTK Tokenizer • Natural Language ToolKit • http://text-processing.com/demo/tokenize/ • Call me Ishmael. Some years ago--never mind how long precisely--having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. CSC 8520 Spring 2013. Paula Matuszek
Document Features • Once we have a corpus of standardized and possibly tokenized documents, what features of text documents might be interesting? • Word frequencies: Bag of Words (BOW) • Language • Document Length -- characters, words, sentences • Named Entities • Parts of speech • Average word length • Average sentence length • Domain-specific stuff CSC 8520 Spring 2013. Paula Matuszek
Bag of Words • Specifically not interested in order • Frequency of each possible word in this document. • Very sparse vector! • In order to assign count to correct position, need to know all the words used in the corpus • two-pass • reverse-index CSC 8520 Spring 2013. Paula Matuszek
Simplifying BOW • Do we really want every word? • Stop words • omit common function words. • e.g. http://www.ranks.nl/resources/stopwords.html • Stemming or lemmatization • Convert words to standard form • lemma is the standard word; stem may not be a word at all • Synonym matching • TF*IDF • Use the “most meaningful” words CSC 8520 Spring 2013. Paula Matuszek
Stemming • Inflectional stemming: eliminate morphological variants • singular/plural, present/past • In English, rules plus a dictionary • books -> book, children -> child • few errors, but many omissions • Root stemming: eliminate inflections and derivations • invention, inventor -> invent • much more aggressive • Which (if either) depends on problem • http://text-processing.com/demo/stem/ CSC 8520 Spring 2013. Paula Matuszek
Synonym-matching • Map some words to their synonyms • http://thesaurus.com/browse/computer • Problematic in English • requires a large dictionary • many words have multiple meanings • In specific domains may be important • biological and chemical domains: http://en.wikipedia.org/wiki/Caffeine • any specific domain: Nova, Villanova, V’Nova CSC 8520 Spring 2013. Paula Matuszek
TF-IDF • Word frequency is simple, but: • affected by length of document • not best indicator of what doc is about • very common words don’t tell us much about differences between documents • very uncommon words are typos or idiosyncratic • Term Frequency Inverse Document Frequency • tf-idf(j) = tf(j)*idf(j). idf(j)=log(N/df(j)) CSC 8520 Spring 2013. Paula Matuszek
Language • Relevant if documents are in multiple languages • May know from source • Determining language itself can be considered a form of NLP. • http://translate.google.com/?hl=en&tab=wT • http://fr.wikipedia.org/ • http://de.wikipedia.org/ CSC 8520 Spring 2013. Paula Matuszek
Document Counts • Number of characters, words, sentences • Average length of words, sentences, paragraphs • EG: clustering documents to determine how many authors have written them or how many genres are represented • NLTK + Python makes this easy CSC 8520 Spring 2013. Paula Matuszek
Named Entities • What persons, places, companies are mentioned in documents? • “Proper nouns” • One of most common information extraction tasks • Combination of rules and dictionary • Example rules: • Capitalized word not at beginning of sentence • Two capitalized words in a row • One or more capitalized words followed by Inc • Dictionaries of common names, places, major corporations. Sometimes called “gazetteer” CSC 8520 Spring 2013. Paula Matuszek
Parts of Speech • Part-of-Speech (POS) taggers identify nouns, verbs, adjectives, noun phrases, etc • Brill is the best-known rule-based tagger • More recent work uses machine learning to create taggers from labeled examples • http://text-processing.com/demo/tag/ • http://cst.dk/online/pos_tagger/uk/index.html CSC 8520 Spring 2013. Paula Matuszek
Domain-Specific • Structure in document • email: To, From, Subject, Body • Villanova Campus Currents: News, Academic Notes, Events, Sports, etc • Tags in document • Medline corpus is tagged with MeSH terms • Twitter feeds may be tagged with #tags. • Intranet documents might have date, source, department, author, etc. CSC 8520 Spring 2013. Paula Matuszek
N-Grams • These language features will take you a long way • But it’s not hard to come up with examples where it fails: • Dog bites man. vs. Man bites dog. • Without adding explicit semantics we can still get substantial additional information by considering sequences of words. CSC 8520 Spring 2013. Paula Matuszek
Free Association Exercise • I am going to say some phrases. Write down the next word or two that occur to you. • Microsoft announced a new security ____ • NHL commissioner cancels rest ____ • One Fish, ______ • Colorless green ideas ______ • Conjunction Junction, what’s __________ • Oh, say, can you see, by the dawn’s ______ • After I finished my homework I went _____. CSC 8520 Spring 2013. Paula Matuszek
Human Word Prediction • Clearly, at least some of us have the ability to predict future words in an utterance. • How? • Domain knowledge • Syntactic knowledge • Lexical knowledge CSC 8520 Spring 2013. Paula Matuszek
Claim • A useful part of the knowledge needed to allow Word Prediction can be captured using simple statistical techniques • In particular, we'll rely on the notion of the probability of a sequence (a phrase, a sentence) CSC 8520 Spring 2013. Paula Matuszek
Applications • Why do we want to predict a word, given some preceding words? • Rank the likelihood of sequences containing various alternative hypotheses, e.g. for automated speech recognition, OCRing. • Theatre owners say popcorn/unicorn sales have doubled... • Assess the likelihood/goodness of a sentence, e.g. for text generation or machine translation • Como mucho pescado. --> At the most fished. CSC 8520 Spring 2013. Paula Matuszek
Real Word Spelling Errors • They are leaving in about fifteen minuets to go to her house. • The study was conducted mainly be John Black. • The design an construction of the system will take more than a year. • Hopefully, all with continue smoothly in my absence. • Can they lave him my messages? • I need to notified the bank of…. • He is trying to fine out. Example from Dorr, http://www.umiacs.umd.edu/~bonnie/courses/cmsc723-04/lecture-notes/Lecture5.ppt CSC 8520 Spring 2013. Paula Matuszek
Language Modeling • Fundamental tool in NLP • Main idea: • Some words are more likely than others to follow each other • You can predict fairly accurately that likelihood. • In other words, you can build a language model Adapted from Hearst, http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/lecture4.ppt CSC 8520 Spring 2013. Paula Matuszek
N-Grams • N-Grams are sequences of tokens. • The N stands for how many terms are used • Unigram: 1 term • Bigram: 2 terms • Trigrams: 3 terms • You can use different kinds of tokens • Character based n-grams • Word-based n-grams • POS-based n-grams • N-Grams give us some idea of the context around the token we are looking at. Adapted from Hearst, http://www.sims.berkeley.edu/courses/is290-2/f04/lectures/lecture4.ppt CSC 8520 Spring 2013. Paula Matuszek
N-Gram Models of Language • A language model is a model that lets us compute the probability, or likelihood, of a sentence S, P(S). • N-Gram models use the previous N-1 words in a sequence to predict the next word • unigrams, bigrams, trigrams,… • How do we construct or train these language models? • Count frequencies in very large corpora • Determine probabilities using Markov models, similar to POS tagging. CSC 8520 Spring 2013. Paula Matuszek
Counting Words in Corpora • What is a word? • e.g., are cat and cats the same word? • September and Sept? • zero and oh? • Is _ a word? * ? ‘(‘ ? • How many words are there in don’t ? Gonna ? • In Japanese and Chinese text -- how do we identify a word? • Back to stemming! CSC 8520 Spring 2013. Paula Matuszek