170 likes | 494 Views
An Introduction to Detection of News Events Mouli Venkataramani . References James Allan et al, topic detection and tracking pilot study final report, proceedings of the DARPA broadcast news transcription and understanding workshop, Feb 1998.
E N D
An Introduction to Detection of News EventsMouli Venkataramani References • James Allan et al, topic detection and tracking pilot study final report, proceedings of the DARPAbroadcast news transcription and understanding workshop, Feb 1998. • Yiming yang et al, learning approaches for detecting and tracking news events
OUTLINE • Importance of news • Terminology • Event Evolution • Patterns in Event Distribution • TDT • Major Tasks • New Event Detection • Clustering • On-Line New Event Detection
Importance of News Examples • A person returns from an extended vacation and needs to find out quickly what happened in the world • A foreign policy specialist who wants to study the Asian economic crisis • Query based retrieval is useful only when one knows precisely the nature of the events or facts one is seeking • Retrieval based on immediate-content-focussed queries is often insufficient for tracking the gradual evolution of events through time
News in Financial World • Impact of news on stock prices is a phenomenon that has been widely studied in the financial world. Examples of news are • Earnings reports • Splits • Merger Talks • Good News/ Bad News
Some Terminology • Topic • A seminal event or activity along with all directly related events and activities • Event • Something that happens at some specific time and place • Event Vs topic • The property of time is what distinguishes an event from the more general topic • Example event • Computer virus detected at British telecom march 3, 1993 • Example topic • Computer virus outbreaks
Event Evolution • As an event evolves, new lexical features appear • Example • Oklahoma city bombing
Patterns in Event Distributions • News stories discussing the same event tend to be temporally proximate • A time gap between burst of topically similar stories is often an indication of different events • Different earthquakes • Airplane accidents • A significant vocabulary shift and rapid changes in term frequency are typical of stories reporting a new event, including previously unseen proper nouns • Events are typically reported in a relatively brief time window 1- 4 weeks
TDT & The Corpus • TDT • Topic detection and tracking • A corpus of text and transcribed news has been developed to support the TDT study effort • This study corpus spans the period from July 1 1994 to June 30 1995 • Includes 16,000 stories, half from Reuters newswire and half from CNN broadcast news • Stories are arranged in chronological order • A set of 25 target events has been identified to support the TDT effort
Tasks in News Detection News Feeds Segmentation Detection Retro On-Line Tracking
Task Explained • Segmentation • Defined as the task of segmenting a continuous stream of text into its constituent stories i.E. Locate the boundaries between adjacent stories • Detection • Characterized by lack of knowledge of event to be detected. Leads to one of the following • Retrospective detection, where task is to identify all the the events in a corpus o f stories • On-line new event detection where the task is to identify new events in a stream of stories • Tracking • Defined as the task of associating incoming stories with events known to the system
New Event Detection • New event detection is an unsupervised learning task • Detection may consist of discovering previously unidentified events in an accumulated collection – retro • Flagging onset of new events from live news feeds in an on-line fashion • Lack of advance knowledge of new events, but have access to unlabeled historical data as a contrast set • The input to on-line detection is the stream of TDT stories in chronological order simulating real-time incoming documents • The output of on-line detection is a YES/NO decision per document
Clustering in Information Retrieval • Document clustering is an unsupervised process that groups documents with similar content • Clustering methods cluster documents in groups containing overlapping sets of words • Used effectively in query based retrieval systems – web search engines • Improves speed, effectiveness as the query is matched to the different clusters instead of all documents and the best matching cluster is then returned • Agglomerative clustering and single pass clustering are most commonly used
Clustering Algorithms • Agglomerative clustering – reviewed in class • Single pass clustering or incremental clustering • Documents are processed serially • The representation for the first document becomes the cluster representative for the first cluster • Each subsequent document is matched against all cluster representative existing at processing time • A given document is assigned to one cluster according to some similarity measure • When a document is assigned to a cluster the representative for that cluster is recomputed • If a document fails a certain similarity test it becomes the cluster representative of a new cluster
Modified Single Pass Clustering • A slightly different version of single pass clustering is to use all the documents for comparison instead of just the cluster representative • Example
On-line New Event Detection • A new document is absorbed by the most similar cluster in the past if the similarity between the document and the cluster is above a pre-selected clustering threshold • For on-line new event detection we need another threshold called the novelty threshold. If the maximal similarity score between the current document and any cluster in the past is below the threshold then the document is labeled “new” meaning that it is the first story of a new event; Else it is labeled “old” • Both the thresholds are user specified and require tuning • Most important functionality is time penalty. There are two approaches • Uniformly weighted time window • Linear decaying-weight function
New Event Detection (Contd.) • Given the current document (x) in the input stream, we impose a time window of (m) documents prior to (x), we define similarity between (x) and any cluster (c) in the past to be • sim (x,c) = sim (x,c) if cluster (c) has any member document in the time window • sim (x,c) = (1- i/m) * sim (x,c) if cluster (c) has any member document in the time window • Where (i) is the number of documents between document (x) and the most recent member document in cluster (c) • sim (x,c) is the usual cosine similarity
Take Home Message • Event detection, tracking and clustering form an integral part of news detection • The field is relatively young and is very “hot” due to rapid advances in the internet domain • As we saw in the beginning , timely news detection and handling generic queries are important • Methodologies from multivariate statistics form the backbone for all applications