1 / 12

Automatic Discovery of Useful Facet Terms

Researchers and reporters spend a significant amount of time searching through news archives. This project aims to automatically discover and extract useful facet terms from news archives to improve search efficiency and relevance.

pattonj
Download Presentation

Automatic Discovery of Useful Facet Terms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Discovery of Useful Facet Terms Wisam Dakka – Columbia University Rishabh Dayal – Columbia University Panagiotis G. Ipeirotis – NYU

  2. Searching the NYT Archive for Book Research

  3. Motivation: News Archive • Accessing and searching is not an easy task • Researchers and reporters spend a large amount of time going through their long query results • News archives are huge and available for tens of years • Many relevant results • Results in the first page are not more relevant than the results in the 5th or the 10th page (NYT archive) • Search engines of news archive mainly follow the paradigm • Search, skim through long results, modify, and search again • Goal: Multifaceted Interfaces (MI) over the news archive of Newsblaster • Newsblaster archive • About 6 years of news from 24 news sources • Stories are clustered daily into hierarchies of topics and events • Events are threaded over time, summarized, and classified

  4. Motivation: MI for Newsblaster Archive • Our multifaceted interfaces work has some limitations [CIKM2005]: • Supervised learning: facets that could be identified by our algorithm appear in the training set • WordNet hypernyms • WordNet has rather poor coverage of named entities • Free text collections • The quality of the hierarchies built on top of news stories was low.

  5. Challenge: Automatic Extraction of the Useful Facets from News Archive • Automatically discover, in an unsupervised manner, a set of candidate facet terms from free text • Automatically group together facet terms that belong to the same facet • Build the appropriate browsing structure for each facet

  6. Intuition: Look for Facet Terms Elsewhere • Pilot study - 100 stories from The NYTimes • Common facets: Location, Institutes, History, People, Social Phenomenon, Markets, Nature, and Event • Sub-facets: Leaders under People, Corporations under Markets • Clear phenomenon: the terms for the useful facets do not usually appear in the news stories • A journalist writing a story about Jacques Chirac will not necessarily use the terms Political Leader, Europe, or France. Such missing terms are tremendously useful for identifying the appropriate facets for the story • We will look for these terms elsewhere • infrequent terms in the original collection, but are frequent in expanded documents

  7. Google Wordnet Wikipedia Google Wordnet Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bayoil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bayoil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bayoil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Text Wiki Text Wiki Text Wiki Text Wiki Text Wiki Text Wiki Text Google Text Wordnet Text Context-Aware Expansion Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bayoil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Wiki Murkowski made theannouncement three days after BP said it would shut down a Prudhoe Bayoil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Murkowski made the announcement three days after BP said it would shut down a Prudhoe Bay oil field after a small leak was found. Energy officials have said pipeline repairs are likely to take months, curtailing Alaskan production into next year Yahoo Term Extractor Name Entities

  8. Useful Facets Terms are Elsewhere Original Collection Context-aware Collection Infrequent Terms ti

  9. Term Frequency Analysis • Frequency-based shifting  Due to the Zipfian nature, we favor terms that have already high frequencies (inverse problem) • Rank-shifting

  10. Summary: Candidate Facet Terms • For each document in the database, identify the important terms that are useful to characterize the contents of the document • For each term in the original database, query the external resource and retrieve the terms that appear in the results. Add the retrieved terms in the original document, in order to create an expanded, “context-aware” document • Analyze the frequency of the terms, in both the original and the expanded database and identify the candidate facet terms

  11. Indicative

  12. Research in Progress • Cleaning and filtering • Grouping similar facet terms under one facet • Evaluation • The resulted candidate terms • The resulted hierarchies

More Related