1 / 15

Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

20 May 2010 LREC 2010. Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval. Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones School of Computing, Dublin City University, Ireland. Outline. CNGL Objective

Download Presentation

Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 20 May 2010 LREC 2010 Building a Domain-Specific Document Collection for Evaluating Metadata Effects on Information Retrieval Walid Magdy, Jinming Min, Johannes Leveling, Gareth Jones School of Computing, Dublin City University, Ireland

  2. Outline • CNGL • Objective • Data collection preparation and overview • IR test collection design • Baseline Experiments • Summary

  3. CNGL • Centre of Next Generation Localisation (CNGL) • 4 Universities: DCU, TCD, UCD, and UL • Team: 120 PhD students, PostDocs, and PIs • Supported by Science Foundation of Ireland (SFI) • 9 Industrial Partners: IBM, Microsoft, Symantec, … • Objective: Automation of the localisation process • Technologies: MT, AH, IR, NLP, Speech, and Dev.

  4. Objective • Create a collection of data that is: • Suitable for IR tasks • Suitable for other research fields (AH, NLP) • Large enough to produce conclusive results • Associated with defined evaluation strategies • Prepare the collection from freely available data • YouTube • Domain specific (Basketball) • Build standard IR test collection (document set + topics set + relevance assessment)

  5. YouTube Videos Features Description Tags Posting User Length Posting date Category Document • Video URL • Video Title Comments Responded Videos Number of Favorited Number of Views Related Videos Number of Ratings

  6. Methodology for Crawling Data • 50 NBA related queries used to search YouTube • First 700 results per query crawled with related videos • Crawled pages parsed and metadata extracted. • Extracted data represented in XML format • Non-sport category results filtered out • Used Queries: • NBA - NBA Highlights - NBA All Starts - NBA fights • Top ranked 15 NBA players in 2008 + Jordan + Shaq • 29 NBA teams

  7. Data Collection Overview • Crawled video pages: 61,340 pages • Max crawled related/responded video pages: 20 • Max crawled comments for a given video page: 500 • Comments associated with contributing user’s ID • Crawled user profiles ≈250k

  8. XML sample

  9. Topics Creation • 40 topics (queries) created • Specific topics related to NBA • TREC topic = query (title) + description + narrative <title>Michael Jordan best dunks</title> <description>Find the best dunks through the career of Michael Jordan in NBA. It can be a collection of dunks in matches, or dunk contest he participated in.</description> <narrative>A relevant video should contain at least one dunk for Jordan. Videos of dunks for other players are not relevant. And other plays for Jordan other than dunks are not relevant as well</narrative>

  10. Relevance Assessment • 4 indexes created: • Title • Title +Tags • Title + Tags + Description • Title + Tags + Description + Related videos titles • 5 different retrieval models used • 20 different result lists, each contains 60 documents • Result lists merged with random ranking • 122 to 466 documents assessed per topic • 1 to 125 relevant documents per topic (avg. = 23)

  11. Baseline Experiments • Search 4 different indexes: • Title • Title +Tags • Title + Tags + Description • Title + Tags + Description + Related videos titles • Indri retrieval model used to rank results • 1000 results retrieved for each search • Mean average precision (MAP) used to compare the results

  12. Results

  13. Top bigrams in “Tags” field Kobe Bryant NBA Basketball Lebron James Michael Jordan Los Angeles All Star Chicago Bulls Boston Celtics Allen Iverson Angeles Lakers Slam Dunk Basketball NBA Dwight Howard Vince Carter Dwyane Wade Kevin Garnett Toronto Raptors Houston Rockets Miami Heat O’Neal Phoenix Suns Detroit Pistons Tracy Mcgrady Yao Ming Chris Paul Amazing Highlights New York Pau Gasol Cleveland Cavaliers NBA Amazing IR test set NER 40 topics +rel. assess. Videos 250,000 User profiles 61,340 XML docs AH/Personalisation Reranking using ML Sentiment Analysis Multimedia processing Summary (new language resource) Metadata Tags Comments Ratings # Views

  14. Questions & Answers Q: Is this collection available for free? A: No Q: Nothing could be provided? A: Scripts + Topics + Rel. assess. (needs updating) Q: Any other questions? A: …

  15. Thank you

More Related