290 likes | 414 Views
An Exploratory Study of the W3C Mailing List Test Collection for Retrieval of Emails with Pro/Con Arguments. Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff National Institute of Standards and Technology. July 27-28, 2006. CEAS, Mountain View, CA. 2.
E N D
An Exploratory Study of the W3C Mailing List Test Collection for Retrieval of Emails with Pro/Con Arguments Yejun Wu & Douglas W. Oard University of Maryland, College Park Ian Soboroff National Institute of Standards and Technology July 27-28, 2006 CEAS, Mountain View, CA
2 Outline • Build the test collection • Evaluate the test collection (intrinsic evaluation) • Use the test collection (extrinsic evaluation) • Next steps to improve the test collection
3 W3C Mailing List Corpus w3c.org NIST (6/2004) html-tidy@w3c.org semantic-web@w3c.org w3c-news@w3c.org … w3c-rdfcore-wg@w3c.org lists-000-9978864 lists-001-0094883 … lists-003-9630221 Webpages Unique DocIDs Parsing lists-000-9978864 lists-001-0094883 … lists-003-9630221 174,311 emails 515MB
IR Test Collection Design 4 Query Formulation Information seeking: • Documents • Information needs • Interactive process Automatic Search Docs Interactive Selection Measure system: 2 variations: system, user
IR Test Collection Design 5 Topic Statement (by Assessors) Freeze user. Query Formulation Test Collection: • Documents • Topic statements • Relevance judgments • Metric Automatic Search Docs Ranked Lists Relevance Judgments (by Assessors) Evaluation Evaluation Metric (Mean Average Precision)
DOCNO="lists-000-9978864” RECEIVED="Sat Mar 18 08:56:28 2000" ISORECEIVED="20000318135628" SENT="Fri, 10 Mar 2000 13:26:29 -0500 (EST)" ISOSENT="20000310182629" NAME="Kerri Golden" EMAIL="KGolden@Hynet.com" SUBJECT="RTF Word 2000 spec?" ID="C14D28BA032AD3118BE000104B87DDEC18A39A@solomon.hynet.com" EXPIRES="-1” TO=“html-tidy@w3.org” We are trying to convert Word 2000 docs to XML. Our converter worked fine for W97 documents, but W2000 has a much different RTF format (tables especially). Does anyone know where I can get a hold of a spec for this version of RTF? thanks Kerri Golden kgolden@hynet.com 6
7 Topic Statement TopicID: DS8 Query: html vs. xhtml Narrative: A relevant message will compare the advantages/disadvantages of the two standards.
8 Pool Top 50 Docs/Run for Relevance Judgments … Team1 Run1 Team2 Run3 Team 12 Run2 1 2 3 4 … 50 lists-000-9978864 lists-000-7643767 lists-011-6087388 lists-012-1019722 … lists-008-2365001 lists-009-8065221 lists-006-2570023 lists-000-9978864 … lists-012-2365001 lists-005-5500248 ... … … lists-000-7643767 lists-012-2365001 lists-004-0205442 lists-003-6603021 … lists-009-8065221 Average 529 emails/topic Researchers as assessors Relevance Judgments lists-000-9978864 Topic: , Pro/Con: lists-000-7643767 Topic: , Pro/Con: … Lists-008-2365001 Topic: , Pro/Con: 12 teams*3 runs = 36 runs
B AP 0.50 0.38 0.36 0.09 0.10 0.83 0.28 0.20 0.41 0.30 ------ 0.35 Topic Topic DS1 System A Topic DS1 DS2 DS3 DS4 DS5 DS6 DS7 DS8 DS9 DS10 AP 0.73 0.45 0.56 0.00 0.13 1.00 0.24 0.47 0.53 0.23 Doc# lists-000-9743321 lists-000-7456300 lists-001-3400432 lists-002-6590811 lists-004-5566320 lists-009-1349620 lists-011-0383209 lists-005-5201023 lists-007-5610095 lists-002-3204102 A-B + + + - + + - + + - ------ N+=7 N-=3 Score 0.95 0.91 0.88 0.82 0.80 0.77 0.63 0.62 0.55 0.51 Rank 1 2 3 4 5 6 7 8 9 10 ------- ------ MAP: 0.43 Use of Test Collection 9 Measure systems of ranked retrieval Rel? Prec. 1.00 1.00 0.60 0.57 0.50 -------------------------------------------- Avg. Prec. (AP): 0.73 Difference is not significant (two-tailed, p<0.05)
10 Emerging Topic Types Type/Category: Method, tip, solution • Example1: Query: Annotea installation Narrative: A relevant message will provide at least a tip on Annotea installation. • Example2: • Query:file upload http • Narrative: A relevant message will discuss methods of doing file uploads using http.
11 Topic Type Analysis Find categories amenable to pro/con classification
12 Measuring Agreement lists-000-9874732 lists-001-0683001 lists-003-0000221 lists-004-8436200 … lists-002-8833514 … lists-000-9874732 lists-001-0683001 lists-003-0000221 lists-004-8436200 … lists-002-8833514 … Chance corrected overlap Cohen’s Kappa= a+b c+d a+c b+d a+b+c+d=N a Overlap Kappa b a c Non Perfect Inverse Chance Perfect 1 0 1 -1 0
13 Assessor Agreement by Category Overlap Kappa Correlation b/t Overlap and kappa >0.9, significant at p<0.01
14 Effect of Disagreement on Ranking Primary Judge Secondary Judge 1 2 3 4 5 1 2 3 4 5 Important difference in relevance judgment Kendall’s Tau = 1- = 1- 3/5 = 0.4
15 Outline • Intrinsic evaluation • -- topic type analysis • -- inter-assessor agreement analysis • Extrinsic evaluation: • --Use W3C to evaluate a topic & pro/con system
16 Experiment Design: Round Robin Pro/ Con Non- Pro/ Con Topic Pro/Con feature 48 Training Topics … … 48-fold Cross- Validation Pro/ Con Non- Pro/ Con Pro/Con feature Topic Top N terms (N=100) 1 Evaluation Topic INQUERY Query Search Ranked List Query relevance set (Relevance Judgments) Evaluation MAP
24 Effects of Topic and Topic Types Overlap Kappa • Two-way ANOVA • Topic difficulty levels: 27 improved, 16 hurt, 7 unused • Topic types: A, B, C, D, E, F
25 Conclusion – Test Collection Evaluation • Test collection generally useful • Important differences in judgments • Relevance judgments could be improved • Topic type: factor of agreement of pro/con relevance • Categories less of a pro/con nature: -- B (method, tip, solution) : not lead to pro/con -- C (discuss an issue) : vague • Rocchio style system: 4.2% improvement in MAP • Major improvements in A and E • Pro/con relevance judgments useful.
26 Future Work – Better Test Collection Design • Balance topic types: -- half in A. -- F (reason, design rationale): 1 topic. • Study information needs and search process • Improve the process --e.g., better defining topics for pro/con • Use within-category topics for training -- examine the quality of training data by category • Other classification methods: SVM, Naïve Bayes • Separate models for detecting pros and cons. • THANKS!
Pro/Con Feature Selection … Topic1 Topic2 Topic48 20 15 Pro/Con docs 8 … 18 Non Pro/Con 30 5 log(15+1) … Topic Weight: log(20+1) log(5+1) “advantage”: TF=38+1 TF=40+1 TF=30+1 Pro/Con docs … TF=10+1 Non Pro/Con TF=10+1 TF=28+1 39/20 log21* log------- 11/30 31/8 log6* log-------- 29/5 41/15 log16*log-------- 11/18 + … + + “strength” … “Microsoft” … “Html” … “opinion” … … 1 2 3 4 5 … 100 advantage strength weakness hate opinion … wow
28 Feature Selection • Pro/con feature vector term weight log odds ratio: Pos: Pro/Con relevant documents Neg: Non Pro/Con relevant documents
29 Rocchio-style Implementation • Appropriate for topic and pro/con retrieval. • Baseline classifier to test the utility of test collection • Expanded query: • Q0: initial query; Q1: expanded query. • Ri: vectors from positive docs • Si: vectors from negative docs • , : parameters