270 likes | 904 Views
Introduction to Natural Language Processing (NLP). Dekang Lin Department of Computing Science University of Alberta lindek@cs.ualberta.ca. Outline. What is NLP? Applications Challenges Linguistics Issues Course Overview. Textbook.
E N D
Introduction to Natural Language Processing (NLP) Dekang Lin Department of Computing Science University of Alberta lindek@cs.ualberta.ca
Outline • What is NLP? • Applications • Challenges • Linguistics Issues • Course Overview
Textbook • Daniel Jurafsky and James H. Martin, Speech and Language Processing, Prentice-Hall, 2000. • Note errata available on website; check before reading each chapter please
What is Natural Language Processing? • Natural Language Processing • Process information contained in natural language text. • Also known as Computational Linguistics (CL), Human Language Technology (HLT), Natural Language Engineering (NLE) • Can machines understand human language? • Define ‘understand’ • Understanding is the ultimate goal. However, one doesn’t need to fully understand to be useful.
Why Study NLP? • A hallmark of human intelligence. • Text is the largest repository of human knowledge and is growing quickly. • emails, news articles, web pages, IM, scientific articles, insurance claims, customer complaint letters, transcripts of phone calls, technical documents, government documents, patent portfolios, court decisions, contracts, …… • Are we reading any faster than before?
NLP Applications • Question answering • Who is the first Taiwanese president? • Text Categorization/Routing • e.g., customer e-mails. • Text Mining • Find everything that interacts with BRCA1. • Machine (Assisted) Translation • Language Teaching/Learning • Usage checking • Spelling correction • Is that just dictionary lookup?
Challenges in NLP: Ambiguity • Words or phrases can often be understood in multiple ways. • Teacher Strikes Idle Kids • Killer Sentenced to Die for Second Time in 10 Years • They denied the petition for his release that was signed by over 10,000 people. • child abuse expert/child computer expert • Who does Mary love? (three-way ambiguous)
Probabilistic/Statistical Resolution of Ambiguities • When there are ambiguities, choose the interpretation with the highest probability. • Example: how many times peoples say • “Mary loves …” • “the Mary love” • Which interpretation has the highest probability?
Challenges in NLP: Variations • Syntactic Variations • I was surprised that Kim lost • It surprised me that Kim lost • That Kim lost surprised me. • The same meaning can be expressed in different ways • Who wrote “The Language Instinct”? • Steven Pinker, a MIT professor and author of “The Language Instinct”, ……
Subareas of Linguistics • Morphology: • structures and patterns in words • analyzes how words are formed from minimal units of meaning, or morphemes, e.g., dogs= dog+s. • Syntax: • structures and patterns in phrases • how phrases are formed by smaller phrases and words
Subareas of Linguistics • Semantics: the meaning of a word or phrase within a sentence • How to represent meaning? • Semantic network? Logic? Policy? • How to construct meaning representation? • Is meaning compositional? • Pragmatics: structures and patterns in discourses • Co-reference resolution • Jane races Mary on weekends. She often beats her. • Implicatures: • How many times do you go skating each week? • Speech acts: • Do you know the time?
Morphology • Morphology is concerned with the internal make-up of words • Input: The fearsome cats attacked the foolish dog • Output: The fear-some cat-s attack-ed the fool-ish dog • Inflectional morphology • Does not change the grammatical category of words: cats/cat-s, attacked/attack-ed • Derivational morphology • May involve changes to grammatical categories: fearsome/fear-some, foolish/fool-ish
Morphology Is not as Easy as It May Seem to be • Examples from Woods et. al. 2000 • delegate (de + leg + ate) take the legs from • caress (car + ess) female car • cashier (cashy + er) more wealthy • lacerate (lace + rate) speed of tatting • ratify (rat + ify) infest with rodents • infantry (infant + ry) childish behavior
A Turkish Example [Oflazer & Guzey 1994] • uygarlastiramayabileceklerimizdenmissinizcesine • urgar/civilized las/BECOME tir/CAUS ama/NEG yabil/POT ecek/FUT ler/3PL imiz/POSS-1SG den/ABL mis/NARR siniz/2PL cesine/AS-IF • an adverb meaning roughly “(behaving) as if you were one of those whom we might not be able to civilize.”
Why not just Use a Dictionary? • How many words are there in a language? • English: OED 400K entries • Turkish: 600x106 forms • Finnish: 107 forms • New words are being invented all the time • e-mail • IM
Syntax is about Sentence Structures • Sentences have structures and are made up of constituents. • The constituents are phrases. • A phrase consists of a head and modifiers. • The category of the head determines the category of the phrase • e.g., a phrase headed by a noun is a noun phrase
S VP NP PP NP NP D N V D N P D N The student put the book on the table Parsing • Analyze the structure of a sentence
S S VP VP NP NP NP NP N N V N N V A N Teacher strikes idle kids Teacher strikes idle kids
Syntax • Syntax is the study of the regularities and constraints of word order and phrase structure • How words are organized into phrases • How phrases are combined into larger phrases (including sentences).
Course Overview: Background Theories • Linguistics • Syntax • Binding theory • Probability and Information Theory • Markov model • Bayesian network • EM (expectation/estimation maximization)
Course Overview: Enabling Technologies • Stemming • Reduce detects, detected, detecting, detect, to the same form. • POS Tagging • Determine for each word whether it is a noun, adjective, verb, ….. • Parsing • sentence parse tree • Word Sense Disambiguation • orange juice vs. orange coat • Learning from text
Course Overview: Applications • Question Answering • Machine Translation • Text Mining/Information Extraction