On Burstiness-Aware Search for Document Sequences

On Burstiness-Aware Search for Document Sequences Theodoros Lappas Benjamin Arai Manolis Platakis Dimitrios Kotsakos Dimitrios Gunopulos SIGKDD 2009

Outline • The Problem: How to effectively search through large document sequences (e.g. newspapers) • Previous Work • Using Bursty Terms to identify Events • Modeling Burstiness using Discrepancy Theory • Our Search Framework • Experiments

The Problem • Given a large sequence of documents (e.g. a daily newspaper) and a query of terms, find documents that discuss major events relevant to the query. • Consider the San Francisco Call : a daily 1900s newspaper • We are given the query <theater, disaster> • Two candidate events, relevant to the query: • The disastrous fire of 1903 in the Iroquois Theater in Chicago • A disastrous performance given by an actor in a local theater • Clearly the first event is far more influential: articles on this event should be ranked higher!

Previous Work • Burstiness explored in different domains • Burst Detection - Kleinberg 2002 • Stream clustering - He et al. 2007 • Graph Evolution - Kumar et al. 2003 • Event Detection - Fung et al. 2005 • Nothing on Burstiness-aware Search: • Standard Information Retrieval techniques do not consider the underlying events discussed in the collection. • Event Detection Techniques do not consider user input.

Burstiness • Major Events are discussed in numerous articles for an extended timeframe. • The event’s keywords exhibit high frequency bursts during the timeframe • Bursty periods: periods of “unusually” high frequency • Unusual?  Deviating from an expected baseline. • Frequency of the term “earthquake”, as it appeared in the SF Call , (1908 - 1909).

Modeling Burstiness using Discrepancy Theory • Discrepancy: Used to express and quantify the deviation from the norm • In our case: find intervals on the timeline were the observed frequency differs the most from the expected frequency • Maximal Interval : One that does not include and is not included in an interval of higher score. • MAX-1: Linear-Time Algorithm for Maximal Interval Extraction.

Baseline - Discussion • Baseline can be dynamic : • frequency sequence(s) from previous year(s) • Time Series Decomposition to extract Seasonal, Trend and Irregular Components

A Diagram of our framework

Phase 1 : Preprocessing • The output is the set of terms to be monitored • The input is a raw document sequence. Preprocessing Methods: • Stemming, Synonym matching, etc. • Stopwords Removal • Frequency Pruning for rare words

Phase 2 – Retrieval of Bursty Intervals • Input: A term • Output: Set of non-overlapping intervals + their burstiness scores • Create the frequency sequence for the term. • Extract bursty intervals using the MAX-1 algorithm

Phase 3 – Interval Indexing • Input: Set of bursty intervals for each term • Output: An Index of Intervals • Simple, easily updatable structure • Need to support multi-term queries

Inverted Interval Index Up Next: Query Evaluation

Phase 4 : Top- k Evaluation for Multi-Term Queries • Customized Version of the Threshold Algorithm (TA) for top-k Evaluation. • Standard Version: • Terms-to-Documents • Each document either appears in a term’s list or not • Our Version (TA*): • Terms-to-Intervals • A bursty interval of a term t1 may overlap multiple intervals of a term t2. Up Next: Experiments

Empirical Evaluation • San Francisco Call : a daily newspaper with publication dates between 1900-1909. ~400,000 articles • List of Major Events from 1900-1909 (from Wikipedia) + query for each event.

Major Events List

Experiment 1 - Query Expansion • Submit respective query for each event in Major Events List. • Get top interval • Report the 10 terms that appear in the most document titles within the interval

Example 1 Event: King Umberto I of Italy is assassinated by Italian-born anarchist Gaetano Bressi. Query: “king assassination” Umberto july state anarchist italy unit Rome Bressi general police

Example 2 Event: Louis Bleriot is the first man to fly across the English Channel in an aircraft. Query: “English channel” flight july miles cross aviator attempt return Bleriot condition machine

Experiment 2 – Burst Detection • Submit respective query for each event in Major Events List. • Get top reported interval • Compare with actual event date • We use MAX-1, MAX-2 to extract bursty intervals. • MAX-2 : • Re-run MAX-1 on each interval • Obtain nested structure

Examples • Event:A fire at the Iroquois Theater in Chicago kills 600. • Query: < theater, disaster> • Event: A fire aboard the steamboat General Slocum in New York City’s East River kills 1,021. • Query: < steamboat, disaster >

Conclusion • The 1st efficient end-to-end framework for burstiness-aware search in document sequences. • Future Work: • Evaluate on even larger Corpora • Evaluate on more types of text

Thank you!!!

On Burstiness-Aware Search for Document Sequences

On Burstiness-Aware Search for Document Sequences

Presentation Transcript

Context-Aware Search Personalization with Concept Preference

LISP primitives on sequences

Index-based search of single sequences

Efficient Indexing of Versioned Document Sequences

DOCUMENT FOR:

Accessing information on molecular sequences

Accessing information on molecular sequences

Pilot Tests on Training Sequences

Profiles for Sequences

DOCUMENT FOR:

Web Services for Image Conversion and Document Database Search

An analysis framework for search sequences

Memory-aware BWT by Segmenting Sequences

Word Hashing for Efficient Search in Document Image Collections

Search for Life on Mars

Search for Life on MARS

Web-based geographic search engine for location-aware search in Singapore

location-aware mobile search

HSE Document Search for C-09 Preparation

Search for jobs on ConnectsJob

Word Hashing for Efficient Search in Document Image Collections

Configuration aware document generation