480 likes | 566 Views
Using Machine Learning to Monitor Collaborative Interactions. Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute. VMT-Basilica (Kumar & Ros é, 2010). Labeled Texts. Labeled Texts. TagHelper. Behavior. Unlabeled Texts.
E N D
Using Machine Learning to Monitor Collaborative Interactions Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute
Labeled Texts Labeled Texts TagHelper Behavior Unlabeled Texts A Model that can Label More Texts Time Monitoring Collaboration with Machine Learning Technology Download tools at: http://www.cs.cmu.edu/~cprose/TagHelper.html http://www.cs.cmu.edu/~cprose/SIDE.html <Triggered Intervention>
TagHelper Tools and SIDE Define Summaries Annotate Data Visualize Annotated Data TagHelper Tools uses text mining technology to automate annotation of conversational data SIDE facilitates rapid prototyping of reporting interfaces for group learning facilitators http://www.cs.cmu.edu/~cprose/TagHelper.html http://www.cs.cmu.edu/~cprose/SIDE.html
Important caveat!! • Machine learning isn’t magic • But it can be useful for identifying meaningful patterns in your data when used properly • Proper use requires insight into your data ?
Data Target Representation Naïve Approach: When all you have is a hammer…
Data Target Representation Naïve Approach: When all you have is a hammer… Problem: there isn’t one universally best approach!!!!!
Data Target Representation Slightly less naïve approach: Aimless wandering…
Data Target Representation Slightly less naïve approach: Aimless wandering… Problem 1: It takes too long!!!
Data Target Representation Slightly less naïve approach: Aimless wandering… Problem 2: You might not realize all of the options that are available to you!
Data Target Representation Expert Approach: Hypothesis driven
Data Target Representation Expert Approach: Hypothesis driven You might end up with the same solution in the end, but you’ll get there faster.
Data Target Representation Expert Approach: Hypothesis driven Today we’ll start to learn how!
Classification Engine Learning Algorithm Data Model Prediction New Data What is machine learning? • Automatically or semi-automatically • Inducing concepts (i.e., rules) from data • Finding patterns in data • Explaining data • Making predictions
Outlook: Sunny -> No Overcast -> Yes Rainy-> Yes A slightly more sophisticated rule learner will find the feature that gives the most information about the result class. What do you think that would be in this case? The simplest rule learner will learn to predict whatever is the most frequent result class. This is called the majority Class. <Feature Name>: <value> -> <prediction> <value> -> <prediction> … What will the rule be in this case? It will always predict yes. How does machine learning work?
Outlook: Sunny -> No Overcast -> Yes Rainy-> Yes Yes What will be the prediction? Model New Data
More Complex Algorithm… • Two simple algorithms last time • 0R – Predict the majority class • 1R – Use the most predictive single feature • Today – Intro to Decision Trees • Today we will stay at a high level • We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!
More Complex Algorithm… • Two simple algorithms last time • 0R – Predict the majority class • 1R – Use the most predictive single feature • Today – Intro to Decision Trees • Today we will stay at a high level • We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!
What will it do with this example? More Complex Algorithm… • Two simple algorithms last time • 0R – Predict the majority class • 1R – Use the most predictive single feature • Today – Intro to Decision Trees • Today we will stay at a high level • We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!
What will it do with this example? More Complex Algorithm… • Two simple algorithms last time • 0R – Predict the majority class • 1R – Use the most predictive single feature • Today – Intro to Decision Trees • Today we will stay at a high level • We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!
What will it do with this example? More Complex Algorithm… • Two simple algorithms last time • 0R – Predict the majority class • 1R – Use the most predictive single feature • Today – Intro to Decision Trees • Today we will stay at a high level • We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!
What will it do with this example? More Complex Algorithm… • Two simple algorithms last time • 0R – Predict the majority class • 1R – Use the most predictive single feature • Today – Intro to Decision Trees • Today we will stay at a high level • We’ll investigate more details of the algorithm next time * Only makes 2 mistakes!
Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? Now lets say you don’t know the shape, now what would you learn? Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake. Now lets say you don’t know the shape, now what would you learn? Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake. Now lets say you don’t know the shape, now what would you learn? Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake. Now lets say you don’t know the shape, now what would you learn? Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
Let’s say you know the rule you are trying to learn is a circle and you have these points. What rule would you learn? If you know the shape, you have fewer degrees of freedom – less room to make a mistake. Now lets say you don’t know the shape, now what would you learn? Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
Why is it better? • Not because it is more complex • Sometimes more complexity makes performance worse • What is different in what the three rule representations assume about your data? • 0R • 1R • Trees • The best algorithm for your data will give you exactly the power you need
R B S X X T X X X C X Clarification: Concepts as Lines
Machine Learning Process Overview • Get to know your data • What distinguishes messages from different categories • Represent messages in terms of features • Use feature table tab • Build machine learning model • Use machine learning tab • Learn from mistakes, and try again • Use feature analyzer tab Coding Features
Algorithms you will use • Decision Trees (J48): good with small feature sets, can find contingencies between features • Naïve Bayes: fast, makes decisions based on probabilities • Support Vector Machines (SMO), makes decisions based on weights, usually works well on text
I versus you is not a reliable predictor Not all WH words occur in questions Not all questions end in a question mark. How do you know when you have coded enough data? What distinguishes Questions and Statements? You need to code enough to avoid learning rules that won’t work
Basic IdeaRepresent text as a vector where each position corresponds to a termThis is called the “bag of words” approach Cows make cheese 110001 Hens lay eggs 001110 Cheese Cows Eggs Hens Lay Make But same representation for “Cheese makes cows.”!
What can’t you conclude from “bag of words” representations? • Causality: “X caused Y” versus “Y caused X” • Roles and Mood: “Which person ate the food that I prepared this morning and drives the big car in front of my cat” versus “The person, which prepared food that my cat and I ate this morning, drives in front of the big car.” • Who’s driving, who’s eating, and who’s preparing food?
1. CC Coordinating conjunction 2. CD Cardinal number 3. DT Determiner 4. EX Existential there 5. FW Foreign word 6. IN Preposition/subord 7. JJ Adjective 8. JJR Adjective, comparative 9. JJS Adjective, superlative 10.LS List item marker 11.MD Modal 12.NN Noun, singular or mass 13.NNS Noun, plural 14.NNP Proper noun, singular 15.NNPS Proper noun, plural 16.PDT Predeterminer 17.POS Possessive ending 18.PRP Personal pronoun 19.PP Possessive pronoun 20.RB Adverb 21.RBR Adverb, comparative 22.RBS Adverb, superlative Part of Speech Tagging http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
23.RP Particle 24.SYM Symbol 25.TO to 26.UH Interjection 27.VB Verb, base form 28.VBD Verb, past tense 29.VBG Verb, gerund/present participle 30.VBN Verb, past participle 31.VBP Verb, non-3rd ps. sing. present 32.VBZ Verb, 3rd ps. sing. present 33.WDT wh-determiner 34.WP wh-pronoun 35.WP Possessive wh-pronoun 36.WRB wh-adverb Part of Speech Tagging http://www.ldc.upenn.edu/Catalog/docs/treebank2/cl93.html
Feature Space Design • Feature Space Design • Think like a computer! • Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful • Look for approximations • If you want to find questions, you don’t need to do a complete syntactic analysis • Look for question marks • Look for wh-terms that occur immediately before an auxilliary verb
Feature Space Design • Feature Space Design • Punctuation can be a “stand in” for mood • “you think the answer is 9?” • “you think the answer is 9.” • Bigrams capture simple lexical patterns • “common denominator” versus “common multiple” • POS bigrams capture syntactic or stylistic information • “the answer which is …” vs “which is the answer” • Line length can be a proxy for explanation depth
Feature Space Design • Feature Space Design • Contains non-stop word can be a predictor of whether a conversational contribution is contentful • “ok sure” versus “the common denominator” • Remove stop words removes some distracting features • Stemming allows some generalization • Multiple, multiply, multiplication • Removing rare features is a cheap form of feature selection • Features that only occur once or twice in the corpus won’t generalize, so they are a waste of time to include in the vector space