1 / 44

TREC 2009 Review

TREC 2009 Review. Lanbo Zhang. 7 tracks. Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track. 67 Participating Groups. The new dataset: ClueWeb09. 1 billion web pages, in 10 languages, half are in English

becka
Download Presentation

TREC 2009 Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TREC 2009 Review Lanbo Zhang

  2. 7 tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track

  3. 67 Participating Groups

  4. The new dataset: ClueWeb09 • 1 billion web pages, in 10 languages, half are in English • Crawled by CMU in Jan. and Feb. 2009 • 5 TB (compressed), 25 TB (uncompressed) • Subset B • 50 million English pages • Includes all Wikipedia pages • The original dataset and the Indri index of subset B are available on our lab machines

  5. Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track

  6. Web Track • Two tasks • Adhoc Retrieval Task • Diversity Task • Return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the return list.

  7. Web Track • Topic type 1: ambiguous

  8. Web Track • Topic type 2: faceted

  9. Web Track • Results of adhoc task

  10. Web Track • Results of diversity task

  11. Waterloo at Web track • Two runs • Top 10000 docs in the entire collection • Top 10000 docs in the Wikipedia set • Wikipedia docs as pseudo relevance feedback • Machine learning methods to re-rank the top 20000 docs, and return the top 1000 • Diversity task • A Naïve Bayes classifier designed to re-rank the top 20000 to exclude duplicates

  12. MSRA at Web track • Mining subtopics for a query by • Anchor texts • Search results clusters • Sites of search results • Search results diversification • A greedy algorithm to iteratively select the next best document

  13. Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track

  14. Relevance Feedback Track • Tasks • Phase 1: find a set of 5 documents that are good for relevance feedback. • Phase 2: develop an RF algorithm to do retrieval based on the relevance judgments of 5 docs.

  15. Results of RF track: Phase 1

  16. Results of RF track: Phase 2

  17. UCSC at RF track • Phase 1: documents selection • Clustering top ranked documents • Transductive Experimental Design (TED) • Phase 2: RF algorithm • Combining different document representations • Title, anchor, heading, document • Incorporating term position information • Phrase match, text window match • Incorporating document similarities to labeled docs

  18. UMas at RF track • A supervised method to estimate the weights of expanded terms for RF • Train collection: wt10g • Term features given a query: • Term frequency in FB docs and entire collection • Co-occurrence with query terms • Term proximity to query terms • Document frequency

  19. UMas at RF track • Model: Boosting

  20. Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track

  21. Entity Track • Task • Given an input entity, find the related entities • Return 100 related entities and their homepages

  22. Results of Entity track

  23. Purdue at Entity track • Entity Extraction • Hierarchical Relevance Model • Three levels of relevance: document, passage, entity

  24. Purdue at Entity track • Homepage Finding for Entities • Logistic Regression model

  25. Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track

  26. Blog Track • Tasks • Faceted Blog Distillation • Top Stories Identification • Collection: Blogs08 • Crawled between 01/14/2008 and 02/10/2009 • 1.3 million unique blogs

  27. Blog Track • Task 1: Faceted Blog Distillation • Given a topic and the faceted restriction, find the relevant blogs. • Facets • Opinionated vs. Factual • Personal vs. Official • In-depth vs. Shallow • Topic example

  28. Blog Track • Task 2: Top Stories Identification • Given a date, find the hottest news headlines for that day and select the relevant and diverse blog posts for those headlines • News headlines from New York Times used • Topic example

  29. Results of Blog track • Faceted Blog Distillation

  30. Results of Blog track • Top Stories Identification • Find the hottest news headlines • Identify the related blog posts

  31. BUPT at Blog track • Faceted Blog Distillation • Scoring function: • The title section of a topic plus automatically selected terms from the DESC and NARR sections • Phrase match • Facets Analysis • Opinionated v.s. Factual: a sentiment analysis model • Personal v.s Official: the maximum frequency of an organization entity occurring in a blog (Stanford Named Entity Recognizer) • In-depth v.s. Shallow: post length • Linear combination of the above two parts

  32. Univ. of Glasgow at Blog track • Top Stories Identification • The model: • Incorporating the following days • Using Wikipedia to enrich news headline terms and keep the top 10 terms for each headline

  33. Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track

  34. Legal Track • Tasks • Interactive task (Enron email collection) • Retrieval with topic authorities involved, participants can ask topic authorities to clarify topics, judge the relevance of sample docs • Batch task (IIT CDIP 1.0) • Retrieval with relevance evidence (RF)

  35. Results of Legal track

  36. Waterloo at Legal track • Interactive task • Phase 1: interactive search and judging • To find a large and diverse set of training examples • Phase 2: interactive learning • To find more potentially relevant documents • Batch task • Run three spam filters on every document: • An on-line logistic regression filter, • A Naïve Bayes spam filter • An on-line version of BM25 RF method

  37. Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track

  38. Million Query Track • Tasks • Adhoc retrieval for 40000 queries • Predict query types • Query intent: Precision-oriented vs. Recall-oriented • Query difficulty: Hard vs. Easy • Precision-oriented • Navigational: Find a specific URL or web page. • Closed: Find a short, unambiguous answer to a specific question. • Resource: Locate a web-based resource or download. • Recall-oriented • Open: Answer an open-ended question, or nd all available information about a topic. • Advice : Find advice or ideas regarding a general question or problem. • List: Find a list of results that will help satisfy an open-ended goal.

  39. Results of Million Query track Precision vs. Recall Hard vs. Easy

  40. Northeastern Univ. at MQ track • Query-specific learning to rank • Learn different ranking functions for queries in different classes • Using SVM to classify queries • Training data: MQ 2008 dataset • Features • Document features: document length, TF, IDF, TF*IDF, normalized TF, Robertson’s TF, Robertson’s IDF, BM25, Language Models (Laplace, Dirichlet, JM). • Field features: title, heading, anchor text, and URL • Web graph features

  41. Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track

  42. Chemical IR Track • Tasks • Technical Survey Task • Retrieve documents in response to each topic given by chemical patent experts • Prior Art Search Task • Find relevant patents with respect to a set of 1000 existing patents

  43. Results of Chemical track

  44. Geneva at Chemical track • Document Representation: • Title,Description,Abstract, Claims, Applicants, Inventors, IPC codes, Patent references • Exploiting Citation Networks • Query expansion using chemical annotations • Filtering based on IPC codes • Re-ranking based on claims

More Related