440 likes | 673 Views
TREC 2009 Review. Lanbo Zhang. 7 tracks. Web track Relevance Feedback track (RF) Entity track Blog track Legal track Million Query track (MQ) Chemical IR track. 67 Participating Groups. The new dataset: ClueWeb09. 1 billion web pages, in 10 languages, half are in English
E N D
TREC 2009 Review Lanbo Zhang
7 tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track
The new dataset: ClueWeb09 • 1 billion web pages, in 10 languages, half are in English • Crawled by CMU in Jan. and Feb. 2009 • 5 TB (compressed), 25 TB (uncompressed) • Subset B • 50 million English pages • Includes all Wikipedia pages • The original dataset and the Indri index of subset B are available on our lab machines
Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track
Web Track • Two tasks • Adhoc Retrieval Task • Diversity Task • Return a ranked list of pages that together provide complete coverage for a query, while avoiding excessive redundancy in the return list.
Web Track • Topic type 1: ambiguous
Web Track • Topic type 2: faceted
Web Track • Results of adhoc task
Web Track • Results of diversity task
Waterloo at Web track • Two runs • Top 10000 docs in the entire collection • Top 10000 docs in the Wikipedia set • Wikipedia docs as pseudo relevance feedback • Machine learning methods to re-rank the top 20000 docs, and return the top 1000 • Diversity task • A Naïve Bayes classifier designed to re-rank the top 20000 to exclude duplicates
MSRA at Web track • Mining subtopics for a query by • Anchor texts • Search results clusters • Sites of search results • Search results diversification • A greedy algorithm to iteratively select the next best document
Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track
Relevance Feedback Track • Tasks • Phase 1: find a set of 5 documents that are good for relevance feedback. • Phase 2: develop an RF algorithm to do retrieval based on the relevance judgments of 5 docs.
UCSC at RF track • Phase 1: documents selection • Clustering top ranked documents • Transductive Experimental Design (TED) • Phase 2: RF algorithm • Combining different document representations • Title, anchor, heading, document • Incorporating term position information • Phrase match, text window match • Incorporating document similarities to labeled docs
UMas at RF track • A supervised method to estimate the weights of expanded terms for RF • Train collection: wt10g • Term features given a query: • Term frequency in FB docs and entire collection • Co-occurrence with query terms • Term proximity to query terms • Document frequency
UMas at RF track • Model: Boosting
Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track
Entity Track • Task • Given an input entity, find the related entities • Return 100 related entities and their homepages
Purdue at Entity track • Entity Extraction • Hierarchical Relevance Model • Three levels of relevance: document, passage, entity
Purdue at Entity track • Homepage Finding for Entities • Logistic Regression model
Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track
Blog Track • Tasks • Faceted Blog Distillation • Top Stories Identification • Collection: Blogs08 • Crawled between 01/14/2008 and 02/10/2009 • 1.3 million unique blogs
Blog Track • Task 1: Faceted Blog Distillation • Given a topic and the faceted restriction, find the relevant blogs. • Facets • Opinionated vs. Factual • Personal vs. Official • In-depth vs. Shallow • Topic example
Blog Track • Task 2: Top Stories Identification • Given a date, find the hottest news headlines for that day and select the relevant and diverse blog posts for those headlines • News headlines from New York Times used • Topic example
Results of Blog track • Faceted Blog Distillation
Results of Blog track • Top Stories Identification • Find the hottest news headlines • Identify the related blog posts
BUPT at Blog track • Faceted Blog Distillation • Scoring function: • The title section of a topic plus automatically selected terms from the DESC and NARR sections • Phrase match • Facets Analysis • Opinionated v.s. Factual: a sentiment analysis model • Personal v.s Official: the maximum frequency of an organization entity occurring in a blog (Stanford Named Entity Recognizer) • In-depth v.s. Shallow: post length • Linear combination of the above two parts
Univ. of Glasgow at Blog track • Top Stories Identification • The model: • Incorporating the following days • Using Wikipedia to enrich news headline terms and keep the top 10 terms for each headline
Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track
Legal Track • Tasks • Interactive task (Enron email collection) • Retrieval with topic authorities involved, participants can ask topic authorities to clarify topics, judge the relevance of sample docs • Batch task (IIT CDIP 1.0) • Retrieval with relevance evidence (RF)
Waterloo at Legal track • Interactive task • Phase 1: interactive search and judging • To find a large and diverse set of training examples • Phase 2: interactive learning • To find more potentially relevant documents • Batch task • Run three spam filters on every document: • An on-line logistic regression filter, • A Naïve Bayes spam filter • An on-line version of BM25 RF method
Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track
Million Query Track • Tasks • Adhoc retrieval for 40000 queries • Predict query types • Query intent: Precision-oriented vs. Recall-oriented • Query difficulty: Hard vs. Easy • Precision-oriented • Navigational: Find a specific URL or web page. • Closed: Find a short, unambiguous answer to a specific question. • Resource: Locate a web-based resource or download. • Recall-oriented • Open: Answer an open-ended question, or nd all available information about a topic. • Advice : Find advice or ideas regarding a general question or problem. • List: Find a list of results that will help satisfy an open-ended goal.
Results of Million Query track Precision vs. Recall Hard vs. Easy
Northeastern Univ. at MQ track • Query-specific learning to rank • Learn different ranking functions for queries in different classes • Using SVM to classify queries • Training data: MQ 2008 dataset • Features • Document features: document length, TF, IDF, TF*IDF, normalized TF, Robertson’s TF, Robertson’s IDF, BM25, Language Models (Laplace, Dirichlet, JM). • Field features: title, heading, anchor text, and URL • Web graph features
Tracks • Web track • Relevance Feedback track (RF) • Entity track • Blog track • Legal track • Million Query track (MQ) • Chemical IR track
Chemical IR Track • Tasks • Technical Survey Task • Retrieve documents in response to each topic given by chemical patent experts • Prior Art Search Task • Find relevant patents with respect to a set of 1000 existing patents
Geneva at Chemical track • Document Representation: • Title,Description,Abstract, Claims, Applicants, Inventors, IPC codes, Patent references • Exploiting Citation Networks • Query expansion using chemical annotations • Filtering based on IPC codes • Re-ranking based on claims