1 / 29

Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

An Exploratory Study of the W3C Mailing List Test Collection for Retrieval of Emails with Pro/Con Arguments. Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff National Institute of Standards and Technology. July 27-28, 2006. CEAS, Mountain View, CA. 2.

aerona
Download Presentation

Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Exploratory Study of the W3C Mailing List Test Collection for Retrieval of Emails with Pro/Con Arguments Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff National Institute of Standards and Technology July 27-28, 2006 CEAS, Mountain View, CA

  2. 2 Outline • Build the test collection • Evaluate the test collection (intrinsic evaluation) • Use the test collection (extrinsic evaluation) • Next steps to improve the test collection

  3. 3 W3C Mailing List Corpus w3c.org NIST (6/2004) html-tidy@w3c.org semantic-web@w3c.org w3c-news@w3c.org … w3c-rdfcore-wg@w3c.org lists-000-9978864 lists-001-0094883 … lists-003-9630221 Webpages Unique DocIDs Parsing lists-000-9978864 lists-001-0094883 … lists-003-9630221 174,311 emails 515MB

  4. IR Test Collection Design 4 Query Formulation Information seeking: • Documents • Information needs • Interactive process Automatic Search Docs Interactive Selection Measure system: 2 variations: system, user

  5. IR Test Collection Design 5 Topic Statement (by Assessors) Freeze user. Query Formulation Test Collection: • Documents • Topic statements • Relevance judgments • Metric Automatic Search Docs Ranked Lists Relevance Judgments (by Assessors) Evaluation Evaluation Metric (Mean Average Precision)

  6. DOCNO="lists-000-9978864” RECEIVED="Sat Mar 18 08:56:28 2000" ISORECEIVED="20000318135628" SENT="Fri, 10 Mar 2000 13:26:29 -0500 (EST)" ISOSENT="20000310182629" NAME="Kerri Golden" EMAIL="KGolden@Hynet.com" SUBJECT="RTF Word 2000 spec?" ID="C14D28BA032AD3118BE000104B87DDEC18A39A@solomon.hynet.com" EXPIRES="-1” TO=“html-tidy@w3.org” We are trying to convert Word 2000 docs to XML. Our converter worked fine for W97 documents, but W2000 has a much different RTF format (tables especially). Does anyone know where I can get a hold of a spec for this version of RTF? thanks Kerri Golden kgolden@hynet.com 6

  7. 7 Topic Statement TopicID: DS8 Query: html vs. xhtml Narrative: A relevant message will compare the advantages/disadvantages of the two standards.

  8. 8 Pool Top 50 Docs/Run for Relevance Judgments … Team1 Run1 Team2 Run3 Team 12 Run2 1 2 3 4 … 50 lists-000-9978864 lists-000-7643767 lists-011-6087388 lists-012-1019722 … lists-008-2365001 lists-009-8065221 lists-006-2570023 lists-000-9978864 … lists-012-2365001 lists-005-5500248 ... … … lists-000-7643767 lists-012-2365001 lists-004-0205442 lists-003-6603021 … lists-009-8065221 Average 529 emails/topic Researchers as assessors Relevance Judgments lists-000-9978864 Topic: , Pro/Con:  lists-000-7643767 Topic: , Pro/Con:  … Lists-008-2365001 Topic: , Pro/Con:  12 teams*3 runs = 36 runs

  9. B AP 0.50 0.38 0.36 0.09 0.10 0.83 0.28 0.20 0.41 0.30 ------ 0.35 Topic Topic DS1 System A Topic DS1 DS2 DS3 DS4 DS5 DS6 DS7 DS8 DS9 DS10 AP 0.73 0.45 0.56 0.00 0.13 1.00 0.24 0.47 0.53 0.23 Doc# lists-000-9743321 lists-000-7456300 lists-001-3400432 lists-002-6590811 lists-004-5566320 lists-009-1349620 lists-011-0383209 lists-005-5201023 lists-007-5610095 lists-002-3204102 A-B + + + - + + - + + - ------ N+=7 N-=3 Score 0.95 0.91 0.88 0.82 0.80 0.77 0.63 0.62 0.55 0.51 Rank 1 2 3 4 5 6 7 8 9 10 ------- ------ MAP: 0.43 Use of Test Collection 9 Measure systems of ranked retrieval Rel?      Prec. 1.00 1.00 0.60 0.57 0.50 -------------------------------------------- Avg. Prec. (AP): 0.73 Difference is not significant (two-tailed, p<0.05)

  10. 10 Emerging Topic Types Type/Category: Method, tip, solution • Example1: Query: Annotea installation Narrative: A relevant message will provide at least a tip on Annotea installation. • Example2: • Query:file upload http • Narrative: A relevant message will discuss methods of doing file uploads using http.

  11. 11 Topic Type Analysis Find categories amenable to pro/con classification

  12. 12 Measuring Agreement lists-000-9874732 lists-001-0683001 lists-003-0000221 lists-004-8436200 … lists-002-8833514   …  lists-000-9874732 lists-001-0683001 lists-003-0000221 lists-004-8436200 … lists-002-8833514   …  Chance corrected overlap Cohen’s Kappa= a+b c+d a+c b+d a+b+c+d=N a Overlap Kappa b a c Non Perfect Inverse Chance Perfect 1 0 1 -1 0

  13. 13 Assessor Agreement by Category Overlap Kappa Correlation b/t Overlap and kappa >0.9, significant at p<0.01

  14. 14 Effect of Disagreement on Ranking Primary Judge Secondary Judge 1 2 3 4 5 1 2 3 4 5 Important difference in relevance judgment Kendall’s Tau = 1- = 1- 3/5 = 0.4

  15. 15 Outline • Intrinsic evaluation • -- topic type analysis • -- inter-assessor agreement analysis • Extrinsic evaluation: • --Use W3C to evaluate a topic & pro/con system

  16. 16  Experiment Design: Round Robin Pro/ Con Non- Pro/ Con Topic Pro/Con feature 48 Training Topics … … 48-fold Cross- Validation Pro/ Con Non- Pro/ Con Pro/Con feature Topic  Top N terms (N=100) 1 Evaluation Topic INQUERY Query Search Ranked List Query relevance set (Relevance Judgments) Evaluation MAP

  17. Compare Two Systems

  18. All Topics

  19. Topic Type A

  20. Topic Type B

  21. Topic Type C

  22. Topic Type D

  23. Topic Type E

  24. 24 Effects of Topic and Topic Types Overlap Kappa • Two-way ANOVA • Topic difficulty levels: 27 improved, 16 hurt, 7 unused • Topic types: A, B, C, D, E, F

  25. 25 Conclusion – Test Collection Evaluation • Test collection generally useful • Important differences in judgments • Relevance judgments could be improved • Topic type: factor of agreement of pro/con relevance • Categories less of a pro/con nature: -- B (method, tip, solution) : not lead to pro/con -- C (discuss an issue) : vague • Rocchio style system: 4.2% improvement in MAP • Major improvements in A and E • Pro/con relevance judgments useful.

  26. 26 Future Work – Better Test Collection Design • Balance topic types: -- half in A. -- F (reason, design rationale): 1 topic. • Study information needs and search process • Improve the process --e.g., better defining topics for pro/con • Use within-category topics for training -- examine the quality of training data by category • Other classification methods: SVM, Naïve Bayes • Separate models for detecting pros and cons. • THANKS!

  27. Pro/Con Feature Selection … Topic1 Topic2 Topic48 20 15 Pro/Con docs 8 … 18 Non Pro/Con 30 5 log(15+1) … Topic Weight: log(20+1) log(5+1) “advantage”: TF=38+1 TF=40+1 TF=30+1 Pro/Con docs … TF=10+1 Non Pro/Con TF=10+1 TF=28+1 39/20 log21* log------- 11/30 31/8 log6* log-------- 29/5 41/15 log16*log-------- 11/18 + … + + “strength” … “Microsoft” … “Html” … “opinion” … … 1 2 3 4 5 … 100 advantage strength weakness hate opinion … wow

  28. 28 Feature Selection • Pro/con feature vector term weight log odds ratio: Pos: Pro/Con relevant documents Neg: Non Pro/Con relevant documents

  29. 29 Rocchio-style Implementation • Appropriate for topic and pro/con retrieval. • Baseline classifier to test the utility of test collection • Expanded query: • Q0: initial query; Q1: expanded query. • Ri: vectors from positive docs • Si: vectors from negative docs • , : parameters

More Related