220 likes | 341 Views
On Burstiness-Aware Search for Document Sequences. Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009. Outline. The Problem: How to effectively search through large document sequences (e.g. newspapers) Previous Work
E N D
On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009
Outline • The Problem: How to effectively search through large document sequences (e.g. newspapers) • Previous Work • Using Bursty Terms to identify Events • Modeling Burstiness using Discrepancy Theory • Our Search Framework • Experiments
The Problem • Given a large sequence of documents (e.g. a daily newspaper) and a query of terms, find documents that discuss major events relevant to the query. • Consider the San Francisco Call : a daily 1900s newspaper • We are given the query <theater, disaster> • Two candidate events, relevant to the query: • The disastrous fire of 1903 in the Iroquois Theater in Chicago • A disastrous performance given by an actor in a local theater • Clearly the first event is far more influential: articles on this event should be ranked higher!
Previous Work • Burstiness explored in different domains • Burst Detection - Kleinberg 2002 • Stream clustering - He et al. 2007 • Graph Evolution - Kumar et al. 2003 • Event Detection - Fung et al. 2005 • Nothing on Burstiness-aware Search: • Standard Information Retrieval techniques do not consider the underlying events discussed in the collection. • Event Detection Techniques do not consider user input.
Burstiness • Major Events are discussed in numerous articles for an extended timeframe. • The event’s keywords exhibit high frequency bursts during the timeframe • Bursty periods: periods of “unusually” high frequency • Unusual? Deviating from an expected baseline. • Frequency of the term “earthquake”, as it appeared in the SF Call , (1908 - 1909).
Modeling Burstiness using Discrepancy Theory • Discrepancy: Used to express and quantify the deviation from the norm • In our case: find intervals on the timeline were the observed frequency differs the most from the expected frequency • Maximal Interval : One that does not include and is not included in an interval of higher score. • MAX-1: Linear-Time Algorithm for Maximal Interval Extraction.
Baseline - Discussion • Baseline can be dynamic : • frequency sequence(s) from previous year(s) • Time Series Decomposition to extract Seasonal, Trend and Irregular Components
Phase 1 : Preprocessing • The output is the set of terms to be monitored • The input is a raw document sequence. Preprocessing Methods: • Stemming, Synonym matching, etc. • Stopwords Removal • Frequency Pruning for rare words
Phase 2 – Retrieval of Bursty Intervals • Input: A term • Output: Set of non-overlapping intervals + their burstiness scores • Create the frequency sequence for the term. • Extract bursty intervals using the MAX-1 algorithm
Phase 3 – Interval Indexing • Input: Set of bursty intervals for each term • Output: An Index of Intervals • Simple, easily updatable structure • Need to support multi-term queries
Inverted Interval Index Up Next: Query Evaluation
Phase 4 : Top- k Evaluation for Multi-Term Queries • Customized Version of the Threshold Algorithm (TA) for top-k Evaluation. • Standard Version: • Terms-to-Documents • Each document either appears in a term’s list or not • Our Version (TA*): • Terms-to-Intervals • A bursty interval of a term t1 may overlap multiple intervals of a term t2. Up Next: Experiments
Empirical Evaluation • San Francisco Call : a daily newspaper with publication dates between 1900-1909. ~400,000 articles • List of Major Events from 1900-1909 (from Wikipedia) + query for each event.
Experiment 1 - Query Expansion • Submit respective query for each event in Major Events List. • Get top interval • Report the 10 terms that appear in the most document titles within the interval
Example 1 Event: King Umberto I of Italy is assassinated by Italian-born anarchist Gaetano Bressi. Query: “king assassination” Umberto july state anarchist italy unit Rome Bressi general police
Example 2 Event: Louis Bleriot is the first man to fly across the English Channel in an aircraft. Query: “English channel” flight july miles cross aviator attempt return Bleriot condition machine
Experiment 2 – Burst Detection • Submit respective query for each event in Major Events List. • Get top reported interval • Compare with actual event date • We use MAX-1, MAX-2 to extract bursty intervals. • MAX-2 : • Re-run MAX-1 on each interval • Obtain nested structure
Examples • Event:A fire at the Iroquois Theater in Chicago kills 600. • Query: < theater, disaster> • Event: A fire aboard the steamboat General Slocum in New York City’s East River kills 1,021. • Query: < steamboat, disaster >
Conclusion • The 1st efficient end-to-end framework for burstiness-aware search in document sequences. • Future Work: • Evaluate on even larger Corpora • Evaluate on more types of text