1 / 22

On Burstiness-Aware Search for Document Sequences

On Burstiness-Aware Search for Document Sequences. Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009. Outline. The Problem: How to effectively search through large document sequences (e.g. newspapers) Previous Work

Download Presentation

On Burstiness-Aware Search for Document Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009

  2. Outline • The Problem: How to effectively search through large document sequences (e.g. newspapers) • Previous Work • Using Bursty Terms to identify Events • Modeling Burstiness using Discrepancy Theory • Our Search Framework • Experiments

  3. The Problem • Given a large sequence of documents (e.g. a daily newspaper) and a query of terms, find documents that discuss major events relevant to the query. • Consider the San Francisco Call : a daily 1900s newspaper • We are given the query <theater, disaster> • Two candidate events, relevant to the query: • The disastrous fire of 1903 in the Iroquois Theater in Chicago • A disastrous performance given by an actor in a local theater • Clearly the first event is far more influential: articles on this event should be ranked higher!

  4. Previous Work • Burstiness explored in different domains • Burst Detection - Kleinberg 2002 • Stream clustering - He et al. 2007 • Graph Evolution - Kumar et al. 2003 • Event Detection - Fung et al. 2005 • Nothing on Burstiness-aware Search: • Standard Information Retrieval techniques do not consider the underlying events discussed in the collection. • Event Detection Techniques do not consider user input.

  5. Burstiness • Major Events are discussed in numerous articles for an extended timeframe. • The event’s keywords exhibit high frequency bursts during the timeframe • Bursty periods: periods of “unusually” high frequency • Unusual?  Deviating from an expected baseline. • Frequency of the term “earthquake”, as it appeared in the SF Call , (1908 - 1909).

  6. Modeling Burstiness using Discrepancy Theory • Discrepancy: Used to express and quantify the deviation from the norm • In our case: find intervals on the timeline were the observed frequency differs the most from the expected frequency • Maximal Interval : One that does not include and is not included in an interval of higher score. • MAX-1: Linear-Time Algorithm for Maximal Interval Extraction.

  7. Baseline - Discussion • Baseline can be dynamic : • frequency sequence(s) from previous year(s) • Time Series Decomposition to extract Seasonal, Trend and Irregular Components

  8. A Diagram of our framework

  9. Phase 1 : Preprocessing • The output is the set of terms to be monitored • The input is a raw document sequence. Preprocessing Methods: • Stemming, Synonym matching, etc. • Stopwords Removal • Frequency Pruning for rare words

  10. Phase 2 – Retrieval of Bursty Intervals • Input: A term • Output: Set of non-overlapping intervals + their burstiness scores • Create the frequency sequence for the term. • Extract bursty intervals using the MAX-1 algorithm

  11. Phase 3 – Interval Indexing • Input: Set of bursty intervals for each term • Output: An Index of Intervals • Simple, easily updatable structure • Need to support multi-term queries

  12. Inverted Interval Index Up Next: Query Evaluation

  13. Phase 4 : Top- k Evaluation for Multi-Term Queries • Customized Version of the Threshold Algorithm (TA) for top-k Evaluation. • Standard Version: • Terms-to-Documents • Each document either appears in a term’s list or not • Our Version (TA*): • Terms-to-Intervals • A bursty interval of a term t1 may overlap multiple intervals of a term t2. Up Next: Experiments

  14. Empirical Evaluation • San Francisco Call : a daily newspaper with publication dates between 1900-1909. ~400,000 articles • List of Major Events from 1900-1909 (from Wikipedia) + query for each event.

  15. Major Events List

  16. Experiment 1 - Query Expansion • Submit respective query for each event in Major Events List. • Get top interval • Report the 10 terms that appear in the most document titles within the interval

  17. Example 1 Event: King Umberto I of Italy is assassinated by Italian-born anarchist Gaetano Bressi. Query: “king assassination” Umberto july state anarchist italy unit Rome Bressi general police

  18. Example 2 Event: Louis Bleriot is the first man to fly across the English Channel in an aircraft. Query: “English channel” flight july miles cross aviator attempt return Bleriot condition machine

  19. Experiment 2 – Burst Detection • Submit respective query for each event in Major Events List. • Get top reported interval • Compare with actual event date • We use MAX-1, MAX-2 to extract bursty intervals. • MAX-2 : • Re-run MAX-1 on each interval • Obtain nested structure

  20. Examples • Event:A fire at the Iroquois Theater in Chicago kills 600. • Query: < theater, disaster> • Event: A fire aboard the steamboat General Slocum in New York City’s East River kills 1,021. • Query: < steamboat, disaster >

  21. Conclusion • The 1st efficient end-to-end framework for burstiness-aware search in document sequences. • Future Work: • Evaluate on even larger Corpora • Evaluate on more types of text

  22. Thank you!!!

More Related