200 likes | 426 Views
A Framework for Human Evaluation of Search Engine Relevance. Kamal Ali Chi Chao Chang Yun-fang Juan. Outline. The Problem Framework of Test Types Results Expt. 1: Set-level versus Item-level Expt. 2: Perceived Relevance versus Landing-Page Relevance
E N D
A Framework for Human Evaluation of Search Engine Relevance Kamal Ali Chi Chao Chang Yun-fang Juan
Outline • The Problem • Framework of Test Types • Results • Expt. 1: Set-level versus Item-level • Expt. 2: Perceived Relevance versus Landing-Page Relevance • Expt. 3: Editorial Judges versus Panelist Judges • Related Work • Contributions
The Problem • Classical IR: Cranfield technique • Search Engine Experience • Results + UI + Speed + Advertising, spell suggestions ... • Search Engine Relevance • Users see summaries (abstracts), not documents • Heterogeneity of results: images, video, music, ... • User population more diverse • Information needs more diverse (commercial, entertainment) • Human Evaluation costs are high • Holistic alternative: judge set of results
Framework of test types • Dimension 1: Query Category • Images, News, Product Research, Navigational, … • Dimension 2: Modality • Implicit (click behavior) versus Explicit (judgments) • Dimension 3: Judge Class • Live users versus Panelists versus Domain experts • Dimension 4: Granularity • Set-level versus item-level • Dimension 5: Depth • Perceived relevance versus Landing-Page (“Actual”) relevance • Dimension 6: Relativity • Absolute judgments versus Relative judgments
Explored in this work Dimension 1: Query Category Random (mixed), Adult, How-to, Map, Weather, Biographical, Tourism, Sports, … Dimension 2: Modality
Explored in this work Dimension 3: Judge Class Dimension 4: Granularity
Explored in this work Dimension 5: Depth Dimension 6: Relativity
Experimental Methodology • Random set of queries taken from Yahoo! Web logs • Per-set: • Judge sees 10 items from a search engine • Gives one judgment for entire set • Order is as supplied by search engine • Per-item • Judge sees one item (document) at a time • Gives one judgment per item • Order is scrambled • Side-by-side • Judge sees 2 sets; sides are scrambled • Within each set, order is preserved
Expt. 1: Effect of Granularity:Set-level versus Item-level • Domain/expert editors • Self-selection of queries • Value 1: Per-set judgmentsGiven on a 3-scale: 1=best, 3=worst, thus producing a discrete random variable • Value 2: Item-level judgments10 item-level judgments are rolled up to a single number (using DCG roll-up function) • DCG values discretized into 3 bins/levels; thus producing the 2nd discrete random variable • Look at resulting 3 * 3 contingency matrix • Compute correlation between these variables Methodology:
Expt 1: Effect of Granularity:Set-level versus Item-level: Images • Domain 1: Search at Image site or Image tab • 299 image queries; 2 – 3 judges per query; 6856 judgments • 198 queries in common between set-level, item-level tests • 20 images (in 4*5 matrix) shown per query
Expt 1: Effect of Granularity:Set-level versus Item-level: Images • Interpretation of Results • Spearman Correlation is a middling 0.54 • Look at outlier queries • “Hollow Man” – high set-level, low item-level scores • Most items irrelevant – explain low item-level DCG; recall judges were seeing images one at a time in scrambled order • Set-level: eye can quickly (in parallel) see relevant image; less sensitive to irrelevant images • Set-level: Ranking function was poor leading to low DCG score. Unusual since normally set-level picks out poor ranking.
Expt 2: Effect of Depth:Perceived Relevance vs. Landing Page • Fix granularity at Item level • Perceived Relevance: • Title, abstract (summary) and URL shown (T.A.U.) • Judgment 1: Relevance of title • Judgment 2: Relevance of abstract • Click on URL to reach Landing Page • Judgment 3: Relevance of Landing Page
Expt 2: Effect of Depth:Perceived vs. “Actual”: Advertisements • Domain 1: Search at Newssite • Created compound random variable for Perceived Relevance:AND’d Title Relevance and Abstract Relevance • Correlated with Landing Page (“Actual”) Relevance • Higher Correlation: News Title/Abstract carefully constructed
Expt. 3: Effect of Judge Class:Editors versus Panelists • 1000 randomly selected queries frequency-biased sampling • 40 judges, few hundred panelists • Panelist • Recruitment: long-standing panel • Reward: gift certificate • Remove panelists that completed test too quickly or missed sentinel questions • Query=*Modality=ExplicitGranularity=Set-levelDepth=MixedRelativity=Relative • Questions: • 1. Does judge class affect overall conclusion on which Search engine is better? • 2. Are there particular types of queries for which significant differences exist? Methodology:
Expt. 3: Effect of Judge Class:Editors versus Panelists • Column percentages: p( P | E ): p(P=eng1) = .28 but p(P=eng1 | E=eng1) = .35Lift = .35 / .28 = 1.25 … 25% modest lift
Expt. 3: Effect of Judge Class:Editors versus Panelists • Editor marginal distrib (.33, .33, .33) • Panelist less likely to discern diff: (.25, .50, .25) • Given P favors Eng1, E are more likely to favor Eng1 than if P favors Eng2.
Expt. 3: Effect of Judge Class:Correlation, Association • Linear Model: r2 = 0.03 • 3*3 Categorical Modelf = 0.29 • Test for Non-zero Association: c2 = 16.3, 99.5% signif.
Expt. 3: Effect of Judge Class:Qualitative Analysis • Editors’ top feedback: • Ranking not as good (precision) • Both equally good (precision) • Relevant sites missing (recall) • Perfect site missing (recall) • Panelists’ top feedback: • Both equally good (precision) • Too general (precision) • Both equally bad (precision) • Ranking not as good (precision) Panelists need to see other SE to penalize poor recall.
Related Work • Mizzaro • Framework with 3 key dimensions of evaluation: • Information needs (expression level) • Information resources (TAU, documents,..) • Information context • High level of disagreement among judges • Amento et al. • Correlation Analysis • Expert Judges • Automated Metrics: in-degree, PageRank, page size, ...
Contributions • Framework • Set-level to Item-level correlation: • Middling: measuring different aspects of relevance • measure aspects missed by per-item:Poor ranking, duplicates, missed senses • Perceived Relevance to “Actual” relevance • Higher correlation – maybe because domain = News • Editorial judges versus Panelists • Panelists sit on fence more • Panelist focus on precision more, need other SE for recall • Panelist methodology, reward structure is crucial