210 likes | 854 Views
Natural Language Processing. Michel Bruley. June 2013. Natural Language Processing (NLP). NLP is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language
E N D
Natural Language Processing Michel Bruley June 2013
Natural Language Processing (NLP) • NLP is the branch of computer science focused on developing systems that allow computers to communicate with people using everyday language • NLP is considered as a sub-field of artificial intelligence and has significant overlap with the field of computational linguistics. It is concerned with the interactions between computers and human (natural) languages. • Natural language generation systems convert information from computer databases into readable human language • Natural language understanding systems convert human language into representations that are easier for computer programs to manipulate. • NLP encompasses both text and speech, but work on speech processing has evolved into a separate field
Where does it fit in the CS* taxonomy? Computers Databases Artificial Intelligence Algorithms Networking Search Robotics Natural Language Processing Information Retrieval Machine Translation Language Analysis * CS = Computer Science Semantics Parsing
Applications for processing large amounts of texts require NLP expertise Classify text into categories, index and search large texts: Classify documents by topics, language, author, spam filtering, information retrieval (relevant, not relevant), sentiment classification (positive, negative) Extracting data from text: converting unstructured text into structure data Information extraction: discover names of people and events they participate in, from a document, … Automatic summarization: Condense 1 book into 1 page, … Speech processing, artificial voice: get flight information or book a hotel over the phone, … Question answering: find answers to natural language questions in a text collection or database Spelling & Grammar Corrections Plagiarism detection Automatic translation Etc. Why Natural Language Processing?
The problem • When people see text, they understand its meaning (by and large) According to research, it deosn’t mttaer in what oredr the ltteers in a wrod are, the olny iprmoetnt tihng is that the frist and lsat ltteer are in the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by islelf but the wrod as a wlohe. • When computers see text, they get only character strings (and perhaps HTML tags) • We'd like computer agents to see meanings and be able to intelligently process text • These desires have led to many proposals for structured, semantically marked up formats • But often human beings still resolutely make use of text in human languages • This problem isn’t likely to just go away
Example: Natural language understanding Raw speech signal • Speech recognition Sequence of words spoken • Syntactic analysis using knowledge of the grammar Structure of the sentence • Semantic analysis using info. about meaning of words Partial representation of meaning of sentence • Pragmatic analysis using info. about context Final representation of meaning of sentence Natural language understanding process – Prof. Carolina Ruiz
Example detail: Syntactic Analysis • Syntactic analysis involves isolating phrases and sentences into a hierarchical structure, allowing the study of its constituents. • For example the sentence “the big cat is drinking milk” can be broken up into the following constituents:
Why NLP is difficult • Language is flexible • New words, new meanings • Different meanings in different contexts • Language is subtle • He arrived at the lecture • He chuckled at the lecture • He chuckled his way through the lecture • **He arrived his way through the lecture • Language is complex!
Why NLP is difficult • MANY hidden variables • Knowledge about the world • Knowledge about the context • Knowledge about human communication techniques • Can you tell me the time? • Problem of scale • Many (infinite?) possible words, meanings, context • Problem of sparsity • Very difficult to do statistical analysis, most things (words, concepts) are never seen before • Long range correlations
Why NLP is difficult • Key problems: • Representation of meaning • Language presupposes knowledge about the world • Language only reflects the surface of meaning • Language presupposes communication between people
Patented Natural Language Processing (NLP) “Reads” Every Communication • Each data feed is parsed through one or more of the 7 NLP engines • …it is then deconstructed to provide context, subject, and other information regarding the customer (gender, name etc.) • Finally each identified customer is matched back to the Discovery platform data to gain a full view Natural language processing (NLP) is the study of the interactions between computers and natural languages (e.g., English, Polish). The crucial challenge that NLP addresses is in deriving meaning from human or natural language input and allowing consumers to analyze parsed meanings in large volumes.
For Example…. I bought an iPad2 for my momlast week. She loves the weight, but doesn’t like the color. She wishes it came in blue. She saysif it came in blue, thenshe’d buy one for all her friends • Entities (brands, people, locations, times, products…) • Events and relationships (purchasing event, my mom…) • Sentiment (product specifications) • Suggestions (feature specifications) • Intent (to purchase, to leave) • Geo/Temporal QUESTION: Why is this a big deal? NLP takes a simple English statement, parses them into the categories above (and more categories) and VOILA…we got STRUCTURED DATA
Architecture Predictive Visualization (e.g., Tableau, MSTR) Other Unstructured Data Attensity Pipeline Real-time annotated social media data feed: 150+ million social and online sources ASTER DISCOVERY PLATFORM Emails; Surveys; CRM Notes…. Pipeline Connector Aster “Now-structured” data Customers / Sales / Other data ETL ASAS Wrapper SQL MR Churn Score SQL MR NLP
This integration provides types, subtypes, super types (“Savings”, “Checking”, “Investment”) Inclusion of the Anaphora: Connecting a subject (George Harrison) without repeating the full name (“He”, “Him”) Includes other languages besides English Attensity’s Semantic Annotation Server (ASAS) capabilities Entity Extraction: Automatic detection and extraction of more than 35 entities such as Name, Place Uses Attensity Triples to create context on entities and identify verbs, relationships, actions Auto Classification: Uses custom classification rules to classify articles by content, sort by relevance, and discovers repeated information Exhaustive Extraction: Application of linguistic principles to extract context, entities, and relationships similar to how the human mind would Voice Tags: to identify types of statements and auto classify them (Question, Intent, Conditional) Creates a unique identifier for each entity for cross reference Aster + Attensity = Competitive Advantage
Structuring Unstructured Data: Process Flow The flight was delayed and flight attendant would not give us any new information.
date region rec? source 10-02-06 telephone 0006 4 Who/What Behavior flight delay How Triples are Extracted & Structured Database Record from a Customer Survey Why would you recommend/not recommend? The flight was delayed and flight attendant would notgive us any new information. Fact/Triple Same Record with Relational Facts Extracted from Notes Field flight : delay Extract Extract relational facts & Triples from Notes field New Table: Customer Reactions Then Fuse Populate new table with attribute values and fuse with structured data. Newly Structured Data Provided by Attensity Original Structured Data