1 / 30

TagHelper Basics: Data Setup, Model Training & Performance Evaluation

Learn how to set up data, train a model, evaluate performance using TagHelper tools. Detailed steps and tips provided for effective text feature extraction.

fdrake
Download Presentation

TagHelper Basics: Data Setup, Model Training & Performance Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TagHelper:Basics Part 1 Carolyn Penstein Rosé Carnegie Mellon University Funded through the Pittsburgh Science of Learning Center and The Office of Naval Research, Cognitive and Neural Sciences Division

  2. Outline • Setting up your data • Creating a trained model • Evaluating performance • Using a trained model • Overview of basic feature extraction from text

  3. Setting Up Your Data

  4. Setting Up Your Data

  5. How do you know when you have coded enough data? What distinguishes Questions and Statements? Not all questions end in a question mark. Not all WH words occur in questions I versus you is not a reliable predictor You need to code enough to avoid learning rules that won’t work

  6. Creating a Trained Model

  7. Training and Testing • Start TagHelper tools by double clicking on the portal.bat icon in your TagHelperTools2 folder • You will then see the following tool pallet • The idea is that you will train a prediction model on your coded data and then apply that model to uncoded data • Click on Train New Models

  8. Loading a File First click on Add a File Then select a file

  9. Simplest Usage • Click “GO!” • TagHelper will use its default setting to train a model on your coded examples • It will use that model to assign codes to the uncoded examples

  10. More Advanced Usage • The second option is to modify the default settings • You get to the options you can set by clicking on >> Options • After you finish that, click “GO!”

  11. Output • You can find the output in the OUTPUT folder • There will be a text file named Eval_[name of coding dimension]_[name of input file].txt • This is a performance report • E.g., Eval_Code_SimpleExample.xls.txt • There will also be a file named [name of input file]_OUTPUT.xls • This is the coded output • E.g., SimpleExample_OUTPUT.xls

  12. Using the Output file Prefix • If you use the Output file prefix, the text you enter will be prepended to the output files • There will be a text file named [prefix]_Eval_[name of coding dimension]_[name of input file].txt • E.g., Prefix1_Eval_Code_SimpleExample.xls.txt • There will also be a file named [prefix]_[name of input file]_OUTPUT.xls • E.g., Prefix1_SimpleExample.xls

  13. Evaluating Performance

  14. Performance report • The performance report tells you: • What dataset was used • What the customization settings were • At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made

  15. Performance report • The performance report tells you: • What dataset was used • What the customization settings were • At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made

  16. Performance report • The performance report tells you: • What dataset was used • What the customization settings were • At the bottom of the file are reliability statistics and a confusion matrix that tells you which types of errors are being made

  17. Output File • The output file contains • The codes for each segment • Note that the segments that were already coded will retain their original code • The other segments will have their automatic predictions • The prediction column indicates the confidence of the prediction

  18. Using a Trained Model

  19. Applying a Trained Model • Select a model file • Then select a testing file

  20. Applying a Trained Model • Testing data should be set up with ? on uncoded examples • Click Go! to process file

  21. Results

  22. Overview of Basic Feature Extraction from Text

  23. Customizations • To customize the settings: • Select the file • Click on Options

  24. Setting the Language You can change the default language from English to German Chinese requires an additional license to Academia Sinica in Taiwan

  25. Preparing to get a performance report You can decide whether you want it to prepare a performance report for you. (It runs faster when this is disabled.)

  26. TagHelper Customizations • Typical classification algorithms • Naïve Bayes • SMO (Weka’s implementation of Support Vector Machines) • J48 (decision trees) • Rules of thumb: • SMO is state-of-the-art for text classification • J48 is best with small feature sets – also handles contingencies between features well • Naïve Bayes works well for models where decisions are made based on accumulating evidence rather than hard and fast rules

  27. TagHelper Customizations • Feature Space Design • Think like a computer! • Machine learning algorithms look for features that are good predictors, not features that are necessarily meaningful • Look for approximations • If you want to find questions, you don’t need to do a complete syntactic analysis • Look for question marks • Look for wh-terms that occur immediately before an auxilliary verb • Look for topics likely to be indicative of questions (if you’re talking about ice cream, and someone mentions flavor without mentioning a specific flavor, it might be a question)

  28. TagHelper Customizations • Feature Space Design • Punctuation can be a “stand in” for mood • “you think the answer is 9?” • “you think the answer is 9.” • Bigrams capture simple lexical patterns • “common denominator” versus “common multiple” • POS bigrams capture stylistic information • “the answer which is …” vs “which is the answer” • Line length can be a proxy for explanation depth

  29. TagHelper Customizations • Feature Space Design • Contains non-stop word can be a predictor of whether a conversational contribution is contentful • “ok sure” versus “the common denominator” • Remove stop words removes some distracting features • Stemming allows some generalization • Multiple, multiply, multiplication • Removing rare features is a cheap form of feature selection • Features that only occur once or twice in the corpus won’t generalize, so they are a waste of time to include in the vector space

  30. Questions?

More Related