500 likes | 785 Views
Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8. Agenda Tasks in NLP Use NLTK. Quick Overview of Resources For Machine Learning, NLP, and Econometrics: Wharton Specific and Books. Machine Learning/ NLP classes. CIS 520 Machine Learning CIS 530 NLP
E N D
Intro to Practical Natural Language Processing Wharton Data Camp Sessions 8 Agenda Tasks in NLP Use NLTK
Quick Overview of Resources For Machine Learning, NLP, and Econometrics:Wharton Specific and Books
Machine Learning/ NLP classes • CIS 520 Machine Learning • CIS 530 NLP • CIS 630 Machine Learning for NLP • There are more classes for theory in STAT/CIS
Awesome Machine Learning Books! • The Element of Statistical Learning • Hastie and Tibshirani • ML bible 1 • Pattern Recognition and Machine Learning • Chris Bishop • ML bible 2 IF YOU WANT DEEP UNDERSTANDING OF THE MATERIALS • Statistical Learning Theory • Theory of ML Bible 1 • Probability theory of Pattern Recognition • Theory of ML Bible 2
If you are going to do any sort of empirical work • STAT 500 – If you have never taken a course in Econometrics • STAT 520 – basic econometrics • STAT 521 – Use of R for applied econometrics (this course went through a major change) • STAT 541 – Andreas Buja on multivariate stat and writing (There exist one and only one required textbook in this course and that’s a writing book) • STAT 542 – Shane Jensen Bayesian stat (Jensen is the man) • STAT 921 – Dylan Small Observational study (Required if you are doing any empirical work) • Econ 705-706 for theory
Subjective Econometric Books Recs • William H. Greene is great • “Mostly Harmless Econometrics” is great • Edward Frees’ longitudinal and panel data: analysis and applications in the social sciences IS one of my favorite econometric books • Lot more based on usage but ask me separately
ML in Business & Combining the two • Data Science for Business What you need to know about data mining and data-analytic thinking (For Intro & overview)http://data-science-for-biz.com/ • Foster Provost: Great researcher in IS at NYU • 72 Reviews on Amazon- 4.7 average! • Targeted Learning – Springer Series in Statistics (AKA Serious Series) • Incorporate Machine Learning into Causal Inference • UCLA Statisticians
Good Quick Cook-book style NLP books • http://www.nltk.org/ • http://nltk.org/book/ FREE BOOK online • Jurafsky& Martin “Speech and Language Processing” for deep theory • Bing Liu’s two books: http://www.cs.uic.edu/~liub/
There are many tasks that NLP can do and many are hard • Machine translation – Very hard • http://translationparty.com/ Funny • Hilarious Video (Fresh Prince of Bel-Air theme after it was translated several times into different languages) • http://www.youtube.com/watch?v=LMkJuDVJdTw • Sentiment detection • Automatic summarization • Etc
Today • Supervised Learning + NLP • Identifying certain content (this is what we will probably use the most). Content-coding. • A Research Example • Sentiment Analysis Example
Supervised Learning + NLP Given: • a set of texts (corpus), • and labels (comprising the training set) • Label can be • Certain content exist • Negative/Positive sentiment • etc Goal: • create algorithm that mimics the label
Imagine a task • You are an NSA agent OR You are a hacker • You are given a job OR You are on a mission and are looking for fellow hackers • Train an NLP algorithm to be able to tell if a sentence or short text on the internet contains any planning of hacking/ddos attack plans • “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!”
What do we, humans, do in realizing the existence of the content? • “Greetings, fellow anons, we have a new target in our movement against RIAA [...] WE WILL NOT TOLERATE ANY LONGER!” • Key words: target, movement, anons, RIAA, not, tolerate. • bigrams: new target, our movement, against RIAA, not tolerate • Use of upper case and “!” • ETC
Narrow and Specific NLP Example • I can only show you one very specific example of NLP today • You need to take at least a machine learning course and an NLP course to be able to do this type of processing comfortably – 2 courses will probably suffice for applying ML + NLP for your research
Overview of 1 Example in NLP: Identifying certain content in text e.g., positive/negative sentiment • Find text data (short text or a sentence – a review for example) • break the sentence down into basic building blocks using NLP techniques I’ll show – outcome is ordered list of building blocks • process the ordered list of building blocks and come up with many sentence-level patterns – these will be the x-variables or sentence-level attributes (e.g., content = “positive review or not”) • Count the number of word “great” occuring X1 • Count the number of laudatory words X2 • Etc. Recording certain patterns Xn • Obtain text data with labels (positive or not): this is called the gold set and comes with y-var {positive, negative} tags • Use machine learning techniques on the gold set to learn the relationship between X-var from 3 and Y-var from 4. This part is training the machine learning algorithm.
Basic idea in NLP: identifying certain content in text • Find text data • Breaking Sentence : break the sentence down into understandable building blocks (e.g., words or lemmas) • Sentence Attribute Generation :identify different sentence-attributes just as humans do when reading (many to be explained) • Gold Set Generation: obtain a set of training sentences with labels identifying if the sentences do or do not have certain content from a reliable source (gold data set) • Training: use statistical tools to infer which sentence-attributes are correlated with certain content outcomes, thereby “learning” to identify content in sentences.
NLP uses machine learning • Machine Learning (Classification) • Supervised Learning – given training data x-vars & y-vars, infer function “f” y=f(x). Curve fitting is a basic supervised learning. You need labeled training data which is X-Y pair. • Unsupervised Learning – problem of finding hidden structure from unlabeled data just x-vars. E.g. Clustering. • NLP uses both and in our context it’s supervised learning
Supervised Learning Taken from nltk.com
Breaking Sentence • Stop-wordsremoval:removing punctuation and words with low information such as the definite article “the” • Tokenizing: the process of breaking a sentence into words, phrases, and symbols or “tokens” • Stemming: the process of reducing inflected words to their root form, e.g., “playing” to “play” • Part-of-speechtagging: determining part-of-speech such as noun • etc
Sentence Attribute Generation • Bag of words: collect words • Counted bag of words: words and count the occurrence • Bigram: A bigram is formed by two adjacent words (e.g. “Bigram is”, “is formed” are bigrams). • Ngram: self-explanatory • Specific keywords (“like”, “love”, “bad”) • Frequency count of certain part of speech • Count the location of certain words • Count the use of !,?,etc • SO MANY MORE! • In big projects, engineers develop algorithm to automatically generate attributes!
Gold Set Generation • Get example sentences or text data • You tag them • Or get RAs • Or use Amazon Mechanical Turk • Or there maybe database already existing • Online tagged corpora • Speaking of database for NLP, it’s not used in this context but there exist great resources • Check out wordnetand framenet + more
Training the classifiers • You are done breaking the sentences and generating sentence attributes – these are x variables • Y-variables are the tags you obtained • Use your favorite ML algorithm or combinations • Regular GLM • SVM • Naïve Bayes • Neural Network • Decision Tree • Conditional Random Forest • Ensemble Learning: Boostingand Bagging • ETC
Natural Language ProcessingTasks “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” (Bloomberg article on Sandy) Slides Taken from: Bommarito Consulting
Natural Language ProcessingTasks What kind of questions can we ask? • Basic • What is the structure of the text? • Paragraphs • Sentences • Tokens/words • What are the words that appear in this text? • Nouns • Subjects • Direct objects • Verbs • Advanced • What are the conceptsthat appear in this text? • How does this text compare to other text? Slides Taken from: Bommarito Consulting
Natural Language ProcessingTasks Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” • Segments Types • Paragraphs • Sentences • Tokens Slides Taken from: Bommarito Consulting
Natural Language ProcessingTasks Segmentation and Tokenization But how does it work? • Paragraphs • Two consecutive line breaks • A hard line break followed by an indent • Sentences • Period, except abbreviation, ellipsis within quotation, etc. • Tokens and Words • Whitespace • Punctuation Slides Taken from: Bommarito Consulting
Natural Language ProcessingTasks Segmentation and Tokenization “Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts.” • Paragraphs: 2 • Sentences: 2 • Words: 561. • ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …] Slides Taken from: Bommarito Consulting
Natural Language ProcessingTasks What kind of questions can we ask? We now have an ordered listof tokens. ['Hurricane', 'Sandy', 'grounded', '3,200', 'flights', 'scheduled', 'for', 'today', 'and', 'tomorrow‘, …] Slides Taken from: Bommarito Consulting
Natural Language ProcessingTasks Stop Words Removal Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. Hurricane Sandy grounded 3,200 flights scheduled today tomorrow, prompted New York suspend subway bus service forced evacuation New Jersey shore headed toward land life-threatening wind rain. System, killed many 65 people Caribbean path north, may capable inflicting much $18 billion damage barrels New Jersey tomorrow knock power millions week, according forecasters risk experts. Slides Taken from: Bommarito Consulting
Natural language processing Tasks Stop Words Removal+ Stemming Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. HurricanSandi ground 3,200 flight schedul today tomorrow, prompt New York suspend subway buservicforcevacuNew Jersey shore head toward land life-threaten wind rain. System, kill mani65 peopl Caribbean path north, may capabl inflict much $18 billion damag barrel New Jersey tomorrow knock power million week, accord forecast risk expert. Slides Taken from: Bommarito Consulting
Natural language processing Part of Speech Tagging Hurricane Sandy grounded 3,200 flights scheduled for today and tomorrow, prompted New York to suspend subway and bus service and forced the evacuation of the New Jersey shore as it headed toward land with life-threatening wind and rain. The system, which killed as many as 65 people in the Caribbean on its path north, may be capable of inflicting as much as $18 billion in damage when it barrels into New Jersey tomorrow and knock out power to millions for a week or more, according to forecasters and risk experts. [('Hurricane', 'NNP'), ('Sandy', 'NNP'), ('grounded', 'VBD'), ('3,200', 'CD'), ('flights', 'NNS'), ('scheduled', 'VBN'), ('for', 'IN'), ('today', 'NN'), ('and', 'CC'), ('tomorrow', 'NN'), …] NNP: Proper Noun, Plural NNS: Noun, Plural VBD: Verb, Past tense VBN:Verb, Past Participle CD: Cardinal Number IN: Proposition/sub-conjetc For more http://www.mozart-oz.org/mogul/doc/lager/brill-tagger/penn.html
Let’s go deeper into each stagesSecond, Sentence Attribute Generation
Remember one thing • When you read sentences yourself, what do you notice about what you notice? • Make those into attributes! • The goal is to mimic what we humans do
Resources for Gold Set Generation • Yourself • RA: pretty expensive • AMT: Amazon Mechanical Turk • Obtain multiple tags and you have to check inter-rater agreement to be robust
Research Question What content attributes of social media messages elicit greater consumer response & engagement? E.g., What’s the comparative effect of informative advertising (product, price information, etc) VS persuasive advertising(Emotion, humor, etc) on engagement? Differences across industries? Introduction & Motivation
Sample Messages from Walmart(Dec 2012 -https://www.facebook.com/walmart) • Score an iPad 3 for an iPad2 price! Now at your local store, $50 off the iPad 3. Plus, get a $30 iTunes Gift Card. Offer good through 12/31 or while supplies last. (Product Advertisement + Deal + Product Location + Product Stock Availability) • Rollback with Vizio. Select models have lower prices ranging from $228 for a 32" (diagonal screen size 31.5") LCD TV to $868 on a 55” (diagonal screen size 54.6") LED TV. http://walmarturl.com/10oZ6yS(Product Advertisement + Price info + Brand Mention + Link) • Maria’s mission is helping veterans and their families find employment. Like this and watch Maria’s story. http://walmarturl.com/VzWFlh(Philanthropic Message + Explicit Like solicitation + Link) Data
Data • Post-level panel data on messages posted by many companies from Sep 2011 to July 2012 • Message content • Impressions, likes and comments on a daily basis • Page-level panel data on each pages • Page statistics on a daily basis (e.g., Fan number, Industry type) • Aggregate demographics of fans and post viewers (impressions demographics) • After Cleaning: 106,316 unique messages posted by 782 companies • Daily Likes & Comments: 1.3 million rows of post-level snapshots recording about 450 million page fans’ responses. Data
Variables • Engagement Metric (Dependent Variable) • Variables that affect engagement (Independent Variables) Empirical Strategy
Message Content Tagging • Worker Eligibility Criteria • Must have > 97% accuracy • Must have > 100 previously approved tasks • Location: US only • Criteria for using the input • Question for detecting if the worker is paying attention • Completion duration > 30 seconds (avg took 3 min) • Plus, 5+ more protocols • At least 9 different workers per message + Majority vote • Used to train natural language processing algorithm to tag remaining posts • 7 Statistical classifiers + rule-based method combined by ensemble learning • Greater than 99% accuracy, precision, and recall for most variables (10-fold CV) Data
WITH BAD NLP WITH GOOD NLP “COMPUTER, HOT EARL GREY TEA” “COMPUTER, TEA, EARL GREY, HOT”
This Concludes the 2014 Wharton Tech/Data CampPlease help me and give feedback on this course for improvement. Thank you!http://wharton.qualtrics.com/SE/?SID=SV_agzfeKZvPQD0hUN