1 / 31

Baoli Li, Yandong Liu , and Eugene Agichtein Emory University

CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation. Baoli Li, Yandong Liu , and Eugene Agichtein Emory University. Community Question Answering. An effective way of seeking information from other users

lily
Download Presentation

Baoli Li, Yandong Liu , and Eugene Agichtein Emory University

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CoCQA: Co-Training Over Questions and Answerswith an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein Emory University

  2. Community Question Answering • An effective way of seeking information from other users • Can be searched for resolved questions

  3. Community Question Answering (CQA) • Yahoo! Answers • Users • Asker: post questions • Answerer: post answers • Voter: vote for existing answers • Questions • Subject • Detail • Answers • Answer text • Votes • Archive: millions of questions and answers

  4. + + - + - + - User Lifecycle of a Question in CQA Choose a category Compose the question Open question Examine Answer Answer Answer Close question Choose best answers Give ratings Find the answer? Yes No Question is closed by system. Best answer is chosen by voters

  5. Problem Statement • How can we exploit structure of CQA to improve question classification? • Case Study: Question Subjectivity Prediction • Subjective questions: seek answers containing private states such as personal opinion, judgment, and experience; • Objective questions: are expected to be answered with reliable or authoritative information;

  6. Example Questions • Subjective: • Has anyone got one of those home blood pressure monitors? and if so what make is it anddo you think they are worth getting? • Objective: • What is the difference between chemotherapy and radiation treatments?

  7. Motivation • Guiding the CQA engine to process questions more intelligently • Some Applications • Ranking/filtering answers • Improving question archive search • Evaluating answers provided by users • Inferring user intent

  8. Challenges • Some challenges in online real question analysis: • Typically complex and subjective • Can be ill-phrased and vague • Not enough annotated data

  9. Key Observations • Can we utilize the inherent structure of the CQA interactions, and use the unlimited amounts of unlabeled data to improve classification performance?

  10. Natural Approach: Co-Training • Introduced by • Combining labeled and unlabeled data with co-training, Blum and Mitchell, 1998 • Two views of the data • E.g.: content and hyperlinks in web pages • Provide complementary information for each other • Iteratively construct additional labeled data • Can often significantly improve accuracy

  11. Questions and Answers: Two Views • Example: • Q: Has anyone got one of those home blood pressure monitors? and if so what make is it and do you think they are worth getting? • A: My momhas one as she is diabetic so its important for her to monitor it she finds it useful. • Answers usually match/fit question • My mom… she finds… • Askers can usually identify matching answers by selecting the “best answer”

  12. CoCQA: A Co-Training Framework over Questions and Answers Unlabeled Data ?????????? ?????????? Unlabeled Data ?????????? ?????????? Labeled Data Labeled Data CQ Q Q CA A A Classify +--++-- --++--+ Validation (Holdout training data) Stop

  13. Details of CoCQA implementation • Base classifier • LibSVM • Term Frequency as Term Weight • Also tried Binary, TF*IDF • Select top K examples with highest confidence • Margin value in SVM

  14. Feature Set • Character 3-grams • has, any, nyo, yon, one… • Words • Has, anyone, got, mom, she, finds… • Word with Character 3-grams • Word n-grams (n<=3, i.e. Wi, WiWi+1, WiWi+1Wi+2) • Has anyone got, anyone got one, she finds it… • Word and POS n-gram (n<=3, i.e. Wi, WiWi+1, Wi POSi+1, POSiWi+1, POSiPOSi+1, etc.) • NP VBP, She PRP, VBP finds…

  15. Overview of Experimental Setup • Datasets • From Yahoo! Answers • Manually labeled data by Amazon Mechanical Turk • Metrics • Compare CQA to state-of-the semi-supervised method

  16. Dataset • 1,000 Labeled Questions from Yahoo! Answers • 5 categories (Arts, Education, Science, Health & Sports) • 200 questions from each category • 10,000 Unlabeled Questions from Yahoo! Answers • 2,000 questions from each category • Data available at • http://ir.mathcs.emory.edu/shared

  17. Manual Labeling • Annotated using Amazon’s Mechanical Turk service • Each question was judged by 5 Mechanical Turk workers • 25 questions included in each HIT task • Worker needs to pass the qualification test • Majority vote to derive gold standard • Discarded small fraction (22 out of 1000) of nonsensical questions such as “Upward Soccer Shorts?” and “1+1=?fdgdgdfg” by manual inspection

  18. Example HIT task

  19. Subjectivity Statistics by Category Objective Subjective

  20. Evaluation Metric • Macro-Averaged F-1 • Prediction performance on both subjective questions and objective questions is equally important • F-1 • Averaged over subjective and objective classes

  21. Experimental Settings • 5 fold cross validation • Methods Compared: • Supervised: LibSVM(Chang and Lin, 2001) • Generalized Expectation (GE): (Mann and McCallum, 2007) • CoCQA: our method • Base classifier: LibSVM • View 1: question text; View 2: answer text

  22. F1 for Supervised Learning F1 with different sets of features

  23. Semi Supervised Learning: Adding unlabeled data Comparison between Supervised, GE and CoCQA

  24. CoCQA with varying K(# new examples added in each iteration)

  25. CoCQA for varying # iterations

  26. CoCQA for varying amount of labeled data

  27. Conclusions and Future Work • Problem: Non-topical text classification in CQA • CoCQA: a co-training framework that can exploit information from both question and answers • Case study: subjectivity classification for real questions in CQA • We plan to explore: • more sophisticated features; • related variants of semi-supervised learning; • other applications (Sentiment classification)

  28. Thank you!Baoli Licsblli@gmail.comYandong Liuyandong.liu@emory.eduEugene Agichteineugene@mathcs.emory.edu

  29. Performance of Subjective vs. Objective classes • Subjective class • 80% • Objective class • 60%

  30. Related work • Some related work: • Question Classification: (Zhang and Lee, 2003)( Tri et al., 2006) • Sentiment Analysis: (Pang and Lee, 2004) • (Yu and Hatzivassiloglou, 2003) • (Somasundaran et al. 2007)

  31. Important words for Subjective, Objective classes by Information Gain

More Related