1 / 20

A Framework for Human Evaluation of Search Engine Relevance

A Framework for Human Evaluation of Search Engine Relevance. Kamal Ali Chi Chao Chang Yun-fang Juan. Outline. The Problem Framework of Test Types Results Expt. 1: Set-level versus Item-level Expt. 2: Perceived Relevance versus Landing-Page Relevance

sasha
Download Presentation

A Framework for Human Evaluation of Search Engine Relevance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Framework for Human Evaluation of Search Engine Relevance Kamal Ali Chi Chao Chang Yun-fang Juan

  2. Outline • The Problem • Framework of Test Types • Results • Expt. 1: Set-level versus Item-level • Expt. 2: Perceived Relevance versus Landing-Page Relevance • Expt. 3: Editorial Judges versus Panelist Judges • Related Work • Contributions

  3. The Problem • Classical IR: Cranfield technique • Search Engine Experience • Results + UI + Speed + Advertising, spell suggestions ... • Search Engine Relevance • Users see summaries (abstracts), not documents • Heterogeneity of results: images, video, music, ... • User population more diverse • Information needs more diverse (commercial, entertainment) • Human Evaluation costs are high • Holistic alternative: judge set of results

  4. Framework of test types • Dimension 1: Query Category • Images, News, Product Research, Navigational, … • Dimension 2: Modality • Implicit (click behavior) versus Explicit (judgments) • Dimension 3: Judge Class • Live users versus Panelists versus Domain experts • Dimension 4: Granularity • Set-level versus item-level • Dimension 5: Depth • Perceived relevance versus Landing-Page (“Actual”) relevance • Dimension 6: Relativity • Absolute judgments versus Relative judgments

  5. Explored in this work Dimension 1: Query Category Random (mixed), Adult, How-to, Map, Weather, Biographical, Tourism, Sports, … Dimension 2: Modality

  6. Explored in this work Dimension 3: Judge Class Dimension 4: Granularity

  7. Explored in this work Dimension 5: Depth Dimension 6: Relativity

  8. Experimental Methodology • Random set of queries taken from Yahoo! Web logs • Per-set: • Judge sees 10 items from a search engine • Gives one judgment for entire set • Order is as supplied by search engine • Per-item • Judge sees one item (document) at a time • Gives one judgment per item • Order is scrambled • Side-by-side • Judge sees 2 sets; sides are scrambled • Within each set, order is preserved

  9. Expt. 1: Effect of Granularity:Set-level versus Item-level • Domain/expert editors • Self-selection of queries • Value 1: Per-set judgmentsGiven on a 3-scale: 1=best, 3=worst, thus producing a discrete random variable • Value 2: Item-level judgments10 item-level judgments are rolled up to a single number (using DCG roll-up function) • DCG values discretized into 3 bins/levels; thus producing the 2nd discrete random variable • Look at resulting 3 * 3 contingency matrix • Compute correlation between these variables Methodology:

  10. Expt 1: Effect of Granularity:Set-level versus Item-level: Images • Domain 1: Search at Image site or Image tab • 299 image queries; 2 – 3 judges per query; 6856 judgments • 198 queries in common between set-level, item-level tests • 20 images (in 4*5 matrix) shown per query

  11. Expt 1: Effect of Granularity:Set-level versus Item-level: Images • Interpretation of Results • Spearman Correlation is a middling 0.54 • Look at outlier queries • “Hollow Man” – high set-level, low item-level scores • Most items irrelevant – explain low item-level DCG; recall judges were seeing images one at a time in scrambled order • Set-level: eye can quickly (in parallel) see relevant image; less sensitive to irrelevant images • Set-level: Ranking function was poor leading to low DCG score. Unusual since normally set-level picks out poor ranking.

  12. Expt 2: Effect of Depth:Perceived Relevance vs. Landing Page • Fix granularity at Item level • Perceived Relevance: • Title, abstract (summary) and URL shown (T.A.U.) • Judgment 1: Relevance of title • Judgment 2: Relevance of abstract • Click on URL to reach Landing Page • Judgment 3: Relevance of Landing Page

  13. Expt 2: Effect of Depth:Perceived vs. “Actual”: Advertisements • Domain 1: Search at Newssite • Created compound random variable for Perceived Relevance:AND’d Title Relevance and Abstract Relevance • Correlated with Landing Page (“Actual”) Relevance • Higher Correlation: News Title/Abstract carefully constructed

  14. Expt. 3: Effect of Judge Class:Editors versus Panelists • 1000 randomly selected queries frequency-biased sampling • 40 judges, few hundred panelists • Panelist • Recruitment: long-standing panel • Reward: gift certificate • Remove panelists that completed test too quickly or missed sentinel questions • Query=*Modality=ExplicitGranularity=Set-levelDepth=MixedRelativity=Relative • Questions: • 1. Does judge class affect overall conclusion on which Search engine is better? • 2. Are there particular types of queries for which significant differences exist? Methodology:

  15. Expt. 3: Effect of Judge Class:Editors versus Panelists • Column percentages: p( P | E ): p(P=eng1) = .28 but p(P=eng1 | E=eng1) = .35Lift = .35 / .28 = 1.25 … 25% modest lift

  16. Expt. 3: Effect of Judge Class:Editors versus Panelists • Editor marginal distrib (.33, .33, .33) • Panelist less likely to discern diff: (.25, .50, .25) • Given P favors Eng1, E are more likely to favor Eng1 than if P favors Eng2.

  17. Expt. 3: Effect of Judge Class:Correlation, Association • Linear Model: r2 = 0.03 • 3*3 Categorical Modelf = 0.29 • Test for Non-zero Association: c2 = 16.3, 99.5% signif.

  18. Expt. 3: Effect of Judge Class:Qualitative Analysis • Editors’ top feedback: • Ranking not as good (precision) • Both equally good (precision) • Relevant sites missing (recall) • Perfect site missing (recall) • Panelists’ top feedback: • Both equally good (precision) • Too general (precision) • Both equally bad (precision) • Ranking not as good (precision) Panelists need to see other SE to penalize poor recall.

  19. Related Work • Mizzaro • Framework with 3 key dimensions of evaluation: • Information needs (expression level) • Information resources (TAU, documents,..) • Information context • High level of disagreement among judges • Amento et al. • Correlation Analysis • Expert Judges • Automated Metrics: in-degree, PageRank, page size, ...

  20. Contributions • Framework • Set-level to Item-level correlation: • Middling: measuring different aspects of relevance • measure aspects missed by per-item:Poor ranking, duplicates, missed senses • Perceived Relevance to “Actual” relevance • Higher correlation – maybe because domain = News • Editorial judges versus Panelists • Panelists sit on fence more • Panelist focus on precision more, need other SE for recall • Panelist methodology, reward structure is crucial

More Related