930 likes | 1.14k Views
Opportunities in Natural Language Processing. Outline. Overview of the field Why are language technologies needed? What technologies are there? What are interesting problems where NLP can and can’t deliver progress? NL/DB interface Web search Product Info, e-mail
E N D
Outline • Overview of the field • Why are language technologies needed? • What technologies are there? • What are interesting problems where NLP can and can’t deliver progress? • NL/DB interface • Web search • Product Info, e-mail • Text categorization, clustering, IE • Finance, small devices, chat rooms • Question answering
What is Natural Language Processing? • Natural Language Processing • Process information contained in natural language text. • Also known as Computational Linguistics (CL), Human Language Technology (HLT), Natural Language Engineering (NLE) • Can machines understand human language? • Define ‘understand’ • Understanding is the ultimate goal. However, one doesn’t need to fully understand to be useful.
What is it.. • Analyze, understand and generate human languages just like humans do. • Applying computational techniques to language domain.. • To explain linguistic theories, to use the theories to build systems that can be of social use.. • Started off as a branch of Artificial Intelligence.. • Borrows from Linguistics, Psycholinguistics, Cognitive Science & Statistics. • Make computers learn our language rather than we learn theirs.
Why Study NLP? • A hallmark of human intelligence. • Text is the largest repository of human knowledge and is growing quickly. • emails, news articles, web pages, IM, scientific articles, insurance claims, customer complaint letters, transcripts of phone calls, technical documents, government documents, patent portfolios, court decisions, contracts, …… • Are we reading any faster than before?
Why are language technologies needed? • Many companies would make a lot of money if they could use computer programmes that understood text or speech. Just imagine if a computer could be used for: • answering the phone, and replying to a question • understanding the text on a Web page to decide who it might be of interest to • translating a daily newspaper from Japanese to English (an attempt is made to do this already) • understanding text in journals / books and building an expert systems based on that understanding
Dreams?? • Also called Natural Language Processing (Application • part) • Show me Star Trek..?? (Talk to your TV set) • Will my computer talk to me like another human ?? • Will the search engine get me exactly what I am looking for?? • Can my PC read the whole newspaper and tell me the important news only..?? • Can my palmtop translate what that Japanese lady is telling me.. ?? • Ahhh.. Can my PC do my English homework ?? • Do you know how our brain processes language ??
NLP Applications • Question answering • Who is the first Taiwanese president? • Text Categorization/Routing • e.g., customer e-mails. • Text Mining • Find everything that interacts with BRCA1. • Machine (Assisted) Translation • Language Teaching/Learning • Usage checking • Spelling correction • Is that just dictionary lookup?
Application areas • Text-to-Speech & Speech recognition • Natural Language Dialogue Interfaces to Databases • Information Retrieval • Information Extraction • Document Classification • Document Image Analysis • Automatic Summarization • Text Proofreading – Spelling & Grammar • Machine Translation • Story understanding systems • Plagiarism detection • Can u think of anything else ??
Big Deal • L = Words + rules + exceptions.. • Ambiguity at all levels.. • We speak different languages.. • And language is a cultural entity.. • So they are not equivalent.. • Highly systematic but also complex.. • Keeps changing.. New words, New rules and New exceptions.. • Source : Electronic texts / Printed texts / Acoustic Speech Signal.. they are noisy.. • Language looks obvious to us.. But it is a Big Deal ☺!
Where does it fit in the CS taxonomy? Computers Databases Artificial Intelligence Algorithms Networking Search Robotics Natural Language Processing Information Retrieval Machine Translation Language Analysis Semantics Parsing
Early days.. • How to measure Intelligence of a Machine? • Turing test – Alan Turing (1950) • A machine can be accepted to be intelligent if it can fool a judge that its human over a tele-typing exercise. • ELIZA by Weizenbaum (1966) • Pretends to be a psychiatrist and converses with a user on his problems. • Uses Keyword pattern matching • Many users thought the machine really understood their problem. • Many such systems exist now. E.g. Alan, Alice, David Can such tests be taken as a measure for Intelligence ? Debate goes on..
Early days.. • SHRDLU • Can understand Natural Language command. • Developed by Terry Winograd MIT AI Lab (1968 –70) using Lisp. • Works on a “Blocks World” a simulated environment in which blocks like coloured cubes, cylinders, pyramids can be moved around, placed over each other, etc. • Understands a bit of anaphora. • Memory to store history. • Successful demonstration of AI.
What’s the world’s most used database? • Oracle? • Excel? • Perhaps, Microsoft Word? • Data only counts as data when it’s in columns? • But there’s oodles of other data: reports, spec. sheets, customer feedback, plans, … • “The Unix philosophy”
“Databases” in 1992 • Database systems (mostly relational) are the pervasive form of information technology providing efficient access to structured, tabular data primarily for governments and corporations: Oracle, Sybase, Informix, etc. • (Text) Information Retrieval systems is a small market dominated by a few large systems providing information to specialized markets (legal, news, medical, corporate info): Westlaw, Medline, Lexis/Nexis • Commercial NLP market basically nonexistent • mainly DARPA work
“Databases” in 2002 • A lot of new things seem important: • Internet, Web search, Portals, PeertoPeer, Agents, Collaborative Filtering, XML/Metadata, Data mining • Is everything the same, different, or just a mess? • There is more of everything, it’s more distributed, and it’s less structured. • Large textbases and information retrieval are a crucial component of modern information systems, and have a big impact on everyday people (web search, portals, email)
Linguistic data is ubiquitous • Most of the information in most companies, organizations, etc. is material in human languages (reports, customer email, web pages, discussion papers, text, sound, video) – not stuff in traditional databases • Estimates: 70%, 90% ?? [all depends how you measure]. Most of it. • Most of that information is now available in digital form: • Estimate for companies in 1998: about 60% [CAP Ventures/Fuji Xerox]. More like 90% now?
The problem • When people see text, they understand its meaning (by and large) • When computers see text, they get only character strings (and perhaps HTML tags) • We'd like computer agents to see meanings and be able to intelligently process text • These desires have led to many proposals for structured, semantically marked up formats • But often human beings still resolutely make use of text in human languages • This problem isn’t likely to just go away.
Levels of Language Analysis • Phonology • Morphology • Syntax • Semantics • Pragmatics • Discourse
Phonology • Speech processing • Humans process speech remarkably well. • Speech interface can replace keyboards and monitors. • Convert Acoustic signals to Text. • Phonemes are the smallest recognizable speech unit in a language. • Graphemes are the textual representation. • Phonemes can be identified using their phonetic & spectral features.
Speech – So is it difficult ? • “It's very hard to wreck a nice beach ” • Pronunciation of different speakers • Pace of speech • Speech ambiguity – Homonyms • I ate eight cakes • That band is banned • I went to the mall near by to buy some food • The Finnish were the first ones to finish • I know no James Bond.
Morphology: What is a word? • Morphology is all about the words. • Make more words from less ☺. • Structures and patterns in words • Analyzes how words are formed from minimal units of meaning, or morphemes, e.g., dogs= dog+s. • Words are a sequence of Morphemes. • Morpheme – smallest meaningful unit in a word. Free & Bound. • Inflectional Morphology – Same Part of Speech • Buses = Bus + es • Carried = Carry + ed • Derivational Morphology – Change PoS. • Destruct + ion = Destruction (Noun) • Beauty + ful = Beautiful (Adjective) • Affixes – Prefixes, Suffixes & Infixes • Rules govern the fusion.
Morphology Is not as Easy as It May Seem to be • Examples from Woods et. al. 2000 • Delegate(delegasyon,heyet) (de + leg + ate) take the legs from • Caress(okşamak) (car + ess(dişilik eki)) female car • cashier (cashy + er) more wealthy • lacerate (lace + rate) speed of tatting • ratify (yırtmak, yaralamak; (kalbini) kırmak) (rat + ify) infest with rodents(kemigenlerin istilası) • Infantry(piyade) (infant(bebek, küçük çocuk ) + ry) childish behavior
A Turkish Example [Oflazer & Guzey 1994] • uygarlastiramayabileceklerimizdenmissinizcesine • urgar/civilized las/BECOME tir/CAUS ama/NEG yabil/POT ecek/FUT ler/3PL imiz/POSS-1SG den/ABL mis/NARR siniz/2PL cesine/AS-IF • an adverb meaning roughly “(behaving) as if you were one of those whom we might not be able to civilize.”
Why not just Use a Dictionary? • How many words are there in a language? • English: OED 400K entries • Turkish: 600x106 forms • Finnish: 107 forms • New words are being invented all the time • e-mail • IM
Syntax • Words convey meaning. But when they are put together they convey more. • Syntax is the grammatical structure of the sentence. Just like the syntax in programming languages. • structures and patterns in phrases • how phrases are formed by smaller phrases and words • Identifying the structure is the first step towards understanding the meaning of the sentence. • Syntactic Analysis (Parsing) = Process of assigning a parse tree to a sentence. • Constituents, Grammatical relations, subcategorization and dependencies.
Is that all? • Grammar of a language is very complex. • No one can write down the set of all rules that governs the sentence construction. • Naturally the solution is Machine Learning. • Where do they learn from? – Tree banks. E.g. Penn Treebank – Manually annotated trees for sentences (over 2 mil words) from a large Wall Street Journal corpus.
Semantics • What do you mean..? • Words – Lexical Semantics • Sentences – Compositional Semantics • Converting the syntactic structures to semantic format – meaning representation. • Semantics: the meaning of a word or phrase within a sentence • How to represent meaning? • Semantic network? Logic? Policy? • How to construct meaning representation? • Is meaning compositional?
Pragmatics • Pragmatics: structures and patterns in discourses • Sentence standing alone may not mean so much. It may be ambiguous. • What information is contained in the contextual sentences that is not conveyed in the actual sentence? • Discourse / Context makes utterances more complicated. • Implicatures: • How many times do you go skating each week? • Speech acts: • Do you know the time? • Anaphora – Resolving the pronoun’s reference. Co-reference resolution • “I read the book by Dr. Kalam. It was great” • “We gave the monkeys the bananas because they were hungry” • “We gave the monkeys the bananas because they were over-ripe” • Jane races Mary on weekends. She often beats her. • Ellipsis – Incomplete sentences • “What’s your name?” • “Srini, and yours?” • The second sentence is not complete, but what it means can be inferred from the first one.
Why is Natural Language Understanding difficult? • The hidden structure of language is highly ambiguous • Structures for: Fed raises interest rates 0.5% in effort to control inflation (NYT headline 5/17/00)
Challenges in NLP: Ambiguity • Words or phrases can often be understood in multiple ways. • Teacher Strikes Idle Kids • Killer Sentenced to Die for Second Time in 10 Years • They denied the petition for his release that was signed by over 10,000 people. • child abuse expert/child computer expert • Who does Mary love? (three-way ambiguous)
Probabilistic/Statistical Resolution of Ambiguities • When there are ambiguities, choose the interpretation with the highest probability. • Example: how many times peoples say • “Mary loves …” • “the Mary love” • Which interpretation has the highest probability?
Challenges in NLP: Variations • Syntactic Variations • I was surprised that Kim lost • It surprised me that Kim lost • That Kim lost surprised me. • The same meaning can be expressed in different ways • Who wrote “The Language Instinct”? • Steven Pinker, a MIT professor and author of “The Language Instinct”, ……
S VP NP PP NP NP D N V D N P D N The student put the book on the table Parsing • Analyze the structure of a sentence
S S VP VP NP NP NP NP N N V N N V A N Teacher strikes idle kids Teacher strikes idle kids
Enabling Technologies • Stemming • Reduce detects, detected, detecting, detect, to the same form. • POS Tagging • Determine for each word whether it is a noun, adjective, verb, ….. • Parsing • sentence parse tree • Word Sense Disambiguation • orange juice vs. orange coat • Learning from text
Translating user needs User need User query Results For RDB, a lot of people know how to do this correctly, using SQL or a GUI tool The answers coming out here will then be precisely what the user wanted
Translating user needs User need User query Results For meanings in text, no IR-style query gives one exactly what one wants; it only hints at it The answers coming out may be roughly what was wanted, or can be refined Sometimes!
Translating user needs User need NLP query Results For a deeper NLP analysis system, the system subtly translates the user’s language If the answers coming back aren’t what was wanted, the user frequently has no idea how to fix the problem Risky!
Aim: Practical applied NLP goals Use language technology to add value to data by: • interpretation • transformation • value filtering • augmentation (providing metadata) Two motivations: • The amount of information in textual form • Information integration needs NLP methods for coping with ambiguity and context
Multi-dimensional Meta-data Extraction Knowledge Extraction Vision
Terms and technologies • Text processing • Stuff like TextPad (Emacs, BBEdit), Perl, grep. Semantics and structure blind, but does what you tell it in a nice enough way. Still useful. • Information Retrieval (IR) • Implies that the computer will try to find documents which are relevant to a user while understanding nothing (big collections) • Intelligent Information Access (IIA) • Use of clever techniques to help users satisfy an information need (search or UI innovations)
Terms and technologies • Locating small stuff. Useful nuggets of information that a user wants: • Information Extraction (IE): Database filling • The relevant bits of text will be found, and the computer will understand enough to satisfy the user’s communicative goals • Wrapper Generation (WG) [or Wrapper Induction] • Producing filters so agents can “reverse engineer” web pages intended for humans back to the underlying structured data • Question Answering (QA) – NL querying • Thesaurus/key phrase/terminology generation
Terms and technologies • Big Stuff. Overviews of data: • Summarization • Of one document or a collection of related documents (cross-document summarization) • Categorization (documents) • Including text filtering and routing • Clustering (collections) • Text segmentation: subparts of big texts • Topic detection and tracking • Combines IE, categorization, segmentation
Terms and technologies • Digital libraries • Text (Data) Mining (TDM) • Extracting nuggets from text. Opportunistic. • Unexpected connections that one can discover between bits of human recorded knowledge. • Natural Language Understanding (NLU) • Implies an attempt to completely understand the text … • Machine translation (MT), OCR, Speech recognition, etc. • Now available wherever software is sold!
find all web pages containing the word Liebermann read the last 3 months of the NY Times and provide a summary of the campaign so far Problems and approaches • Some places where I see less value • Some places where I see more value
Natural Language Interfaces to Databases • This was going to be the big application of NLP in the 1980s • > How many service calls did we receive from Europe last month? • I am listing the total service calls from Europe for November 2001. • The total for November 2001 was 1756. • It has been recently integrated into MS SQL Server (English Query) • Problems: need largely hand-built custom semantic support (improved wizards in new version!) • GUIs more tangible and effective?
NLP for IR/web search? • It’s a no-brainer that NLP should be useful and used for web search (and IR in general): • Search for ‘Jaguar’ • the computer should know or ask whether you’re interested in big cats [scarce on the web], cars, or, perhaps a molecule geometry and solvation energy package, or a package for fast network I/O in Java • Search for ‘Michael Jordan’ • The basketballer or the machine learning guy? • Search for laptop, don’t find notebook • Google doesn’t even stem: • Search for probabilistic model, and you don’t even match pages with probabilistic models.