TagHelper Tools: Conversational Data Analysis Support

TagHelper ToolsSupporting the Analysis of Conversational Data Carolyn P. Rosé Language Technologies Institute and Human-Computer Interaction Institute Carnegie Mellon University

Outline • What is TagHelper tools? • What can TagHelper Tools do for YOU? • How EASY is it to use TagHelper tools? • What are some TagHelper success stories? • What problems are we working on?

What is TagHelper tools?

What is TagHelper tools? • A PSLC Enabling Technology project • Machine learning technology for processing conversational data • Chat data • Newsgroup style conversational data • Short answers and explanations • Goal: automate the categorization of spans of text

What is TagHelper tools? • An add-on to Microsoft Excel • Research Focus: identify and solve text classification problems specific to learning sciences • Types of categories, nature and size of data sets

What can TagHelper tools do for YOU?

Main Uses for TagHelper tools • Supporting data analysis involving conversational data • Triggering interventions • Supporting on-line assessment

Example: Data Analysis

Example: Triggering an Intervention • ST1: well what values do u have for the reheat cycle ? • ST2: for some reason I said temperature at turbine to be like 400 C • Tutor: Let's think about the motivation for Reheat. What process does the steam undergo in the Turbines ? • …

Example: Supporting on-line assessment * Using instructor assigned ratings as gold standard * Best performance without TagHelper tools: .16 correlation coefficient * Best performance with TagHelper tools: .63 correlation coefficient

How EASY is it to use TagHelper tools?

Setting Up Your Data

Iterative Process for Using TagHelper tools • Obtain data in natural language form • Iterative process • Decide on a unit of analysis • Single contributions, topic segments, whole messages, etc. • Decide on a set of categories or a rating system • Set up data in Excel • Assign categories to part of your data • Use TagHelper to assign categories to the remaining portion of your data

Training and Testing • Start TagHelper tools by double clicking on the portal.bat icon • You will then see the following tool pallet • Train a prediction model on your coded data and then apply that model to uncoded data

Loading a File First click on Add a File Then select a file

Simplest Usage • Once your file is loaded, you have two options • The first option is to code your data using the default settings • To do this, simply click on “GO!” • The second option is to modify the default settings and then code • We will start with the first option • Note that the performance will not be optimal

Results Performance on coded data Results on uncoded data

A slightly more complex case…

Example: Data Analysis

Setting Up Your Data

What are some TagHelper success stories?

Success Story 1: Supporting Data Analysis • Peer tutoring in Algebra LearnLab • Data coded for high-level-help, low-level-help, and no-help • Important predictor of learning (e.g., Webb et al., 2003) • TagHelper achieves agreement of .82 Kappa • Can be used for follow-up studies in same domain * Contributed by Erin Walker

Success Story 2: Triggering Interventions • Collaborative idea generation in the Earth Sciences domain • Chinese TagHelper learns hand-coded topic analysis • Human agreement .84 Kappa • TagHelper performance .7 Kappa • Trained models used in follow-up study to trigger interventions and facilitate data analysis

Example Dialogue * Feedback during idea generation increases both idea generation and learning (Wang et al., 2007)

Unique Ideas 12 Nom+N Nom+F Real+N 10 Real+F 8 #Unique Ideas 6 4 2 0 0 5 10 15 20 25 30 Time Stamp Process Analysis Process loss Pairs vs Individuals: F(1,24)=12.22, p<.005, 1 sigma Individuals+Feedback Individuals+NoFeedback Pairs+Feedback Pairs+NoFeedback Process loss Pairs vs Individuals: F(1,24)=4.61, p<.05, .61 sigma Negative effect of Feedback: F(1,24)= 7.23, p<.05, -1.03 sigma Positive effect of feedback: F(1,24)=16.43, p<.0005, 1.37 sigma

What problems are we working on?

Interesting Problems • Highly skewed data sets • Very infrequent classes are often the most interesting and important • Careful feature space design helps more than powerful algorithms • Huge problem with non-independence of data points from same student • Off-the shelf machine learning algorithms not set up for this • New sampling techniques offer promise • “Medium” sized data sets • Contemporary machine learning approaches designed for huge data sets • Supplementing with alternative data sources may help

Example Lesson Learned Problem Context oriented coding Finding Careful feature space design goes farther than powerful algorithms

Back to Argumentation Data

Sequential Learning • Notes sequential dependencies • Perhaps claims are stated before their warrants • Perhaps counter-arguments are given before new arguments • Perhaps people first build on their partner’s ideas and then offer a new idea

Thread Depth Best Parent Semantic Similarity Seg1 Seg2 Seg3 Seg1 Seg2 Seg3 Seg1 Seg2 Seg3 Thread Structure Features

Sequence Oriented Features • Notes whether text is within a certain proximity to quoted material

Context-Based Feature Approach

Sequential Learning

What did we learn? • Intuition confirmed • Different dimensions responded differently to context based enhancements • Feature based approach was more effective • Thread structure features were especially informative for Social Modes dimension • Thread structure information is more difficult to extract from chat data • Best results of similar approach on chat data only achieved a kappa of .45

Special Thanks To: William Cohen Pinar Donmez Jaime Arguello Gahgene Gweon Rohit Kumar Yue Cui Mahesh Joshi Yi-Chia Wang Hao-Chuan Wang Emil Albright Cammie Williams Questions?

TagHelper Tools: Conversational Data Analysis Support