300 likes | 312 Views
Learn how to set up data, train a model, evaluate performance using TagHelper tools. Detailed steps and tips provided for effective text feature extraction.
E N D
TagHelper:Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval Research, Cognitive and Neural Sciences Division
Outline • Setting up your data • Creating a trained model • Evaluating performance • Using a trained model • Overview of basic feature extraction from text
How do you know when you have coded enough data? What distinguishes Questions and Statements? Not all questions end in a question mark. Not all WH words occur in questions I versus you is not a reliable predictor You need to code enough to avoid learning rules that won’t work
Training and Testing • Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder • You will then see the following tool pallet • The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data • Click on Train New Models
Loading a File First click on Add a File Then select a file
Simplest Usage • Click “GO!” • TagHelper will use its default setting to train a model on your coded examples • It will use that model to assign codes to the uncoded examples
More Advanced Usage • The second option is to modify the default settings • You get to the options you can set by clicking on >> Options • After you finish that, click “GO!”
Output • You can find the output in the OUTPUT folder • There will be a text file named Eval_[name of coding dimension]_[name of input file].txt • This is a performance report • E.g., Eval_Code_SimpleExample.xls.txt • There will also be a file named [name of input file]_OUTPUT.xls • This is the coded output • E.g., SimpleExample_OUTPUT.xls
Using the Output file Prefix • If you use the Output file prefix, the text you enter will be prepended to the output files • There will be a text file named [prefix]_Eval_[name of coding dimension]_[name of input file].txt • E.g., Prefix1_Eval_Code_SimpleExample.xls.txt • There will also be a file named [prefix]_[name of input file]_OUTPUT.xls • E.g., Prefix1_SimpleExample.xls
Performance report • The performance report tells you: • What dataset was used • What the customization settings were • At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made
Performance report • The performance report tells you: • What dataset was used • What the customization settings were • At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made
Performance report • The performance report tells you: • What dataset was used • What the customization settings were • At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made
Output File • The output file contains • The codes for each segment • Note that the segments that were already coded will retain their original code • The other segments will have their automatic predictions • The prediction column indicates the confidence of the prediction
Applying a Trained Model • Select a model file • Then select a testing file
Applying a Trained Model • Testing data should be set up with ? on uncoded examples • Click Go! to process file
Customizations • To customize the settings: • Select the file • Click on Options
Setting the Language You can change the default language from English to German Chinese requires an additional license to Academia Sinica in Taiwan
Preparing to get a performance report You can decide whether you want it to prepare a performance report for you. (It runs faster when this is disabled.)
TagHelper Customizations • Typical classification algorithms • Naïve Bayes • SMO (Weka’s implementation of Support Vector Machines) • J48 (decision trees) • Rules of thumb: • SMO is state-of-the-art for text classification • J48 is best with small feature sets – also handles contingencies between features well • Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules
TagHelper Customizations • Feature Space Design • Think like a computer! • Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful • Look for approximations • If you want to find questions, you don’t need to do a complete syntactic analysis • Look for question marks • Look for wh-terms that occur immediately before an auxilliary verb • Look for topics likely to be indicative of questions (if you’re talking about ice cream, and someone mentions flavor without mentioning a specific flavor, it might be a question)
TagHelper Customizations • Feature Space Design • Punctuation can be a “stand in” for mood • “you think the answer is 9?” • “you think the answer is 9.” • Bigrams capture simple lexical patterns • “common denominator” versus “common multiple” • POS bigrams capture stylistic information • “the answer which is …” vs “which is the answer” • Line length can be a proxy for explanation depth
TagHelper Customizations • Feature Space Design • Contains non-stop word can be a predictor of whether a conversational contribution is contentful • “ok sure” versus “the common denominator” • Remove stop words removes some distracting features • Stemming allows some generalization • Multiple, multiply, multiplication • Removing rare features is a cheap form of feature selection • Features that only occur once or twice in the corpus won’t generalize, so they are a waste of time to include in the vector space