Monitoring Message Streams : Retrospective and Prospective Event Detection

Monitoring Message Streams: Retrospective and Prospective Event Detection

OBJECTIVE: Monitor streams of textualized communication to detect pattern changes and "significant" events • Motivation: • sniffing and monitoring email traffic

TECHNICAL PROBLEM: • Given stream of text in any language. • Decide whether "new events" are present in the flow of messages. • Event: new topic or topic with unusual level of activity. • Retrospective or “Supervised” Event Identification: Classification into pre-existing classes.

More Complex Problem: Prospective Detection or “Unsupervised” Learning • Classes change - new classes or change meaning • A difficult problem in statistics • Recent new C.S. approaches • Algorithm suggests a new class • Human analyst labels it; determines its significance

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING (1). Compression of Text -- to meet storage and processing limitations; (2). Representation of Text -- put in form amenable to computation and statistical analysis; (3). Matching Scheme -- computing similarity between documents; (4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”) (5). Fusion Scheme -- combine methods (scores) to yield improved detection/clustering.

COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II These distinctions are somewhat arbitrary. Many approaches to message processing overlap several of these components of automatic message processing.

OUR APPROACH: WHY WE THOUGHT WE COULD DO BETTER THAN STATE OF THE ART: • Existing methods don’t exploit the full power of the 5 components, synergies among them, and/or an understanding of how to apply them to text data. • Dave Lewis' method at TREC-2001 used an off-the-shelf support vector machine supervised learner, but tuned it for frequency properties of the data. • The combination still dominated competing approaches in the TREC-2001 batch filtering evaluation.

OUR APPROACH:WHY WE THOUGHT WE COULD DO BETTER II: • Existing methods aim at fitting into available computational resources without paying attention to upfront data compression. • We hoped to do better by a combination of: • more sophisticated statistical methods • sophisticated data compression in a pre-processing stage • optimization of component combinations

COMPRESSION: • Reduce the dimension before statistical analysis. • Recent results: “One-pass” through data can reduce volume significantly w/o degrading performance significantly. (E.g.: use random projections.) • Unlike feature-extracting dimension reduction, which can lead to bad results. We believed that sophisticated dimension reduction methods in a preprocessing stage followed by sophisticated statistical tools in a detection/filtering stage can be a very powerful approach. Our methods so far give us some confidence that we were right.

MORE SOPHISTICATED STATISTICAL APPROACHES STUDIED OR TO BE STUDIED: • Representations: Boolean representations; weighting schemes • Matching Schemes: Boolean matching; nonlinear transforms of individual feature values • Learning Methods: new kernel-based methods; more complex Bayes classifiers; boosting; • Fusion Methods: combining scores based on ranks, linear functions, or nonparametric schemes

OUTLINE OF THE APPROACH • Identify best combination of newer methods through careful exploration of variety of tools. • Address issues of effectiveness (how well task is done) and efficiency (in computational time and space) • Use combination of new or modified algorithms and improved statistical methods built on the algorithmic primitives.

THE PROJECT TEAM: Strong team: Statisticians: David Madigan, Rutgers Statistics; Ilya Muchnik, Rutgers CS, collaborator Wen-Hua Ju, Avaya Labs Experts in Info. Retrieval & Library Science & Text Classification: Paul Kantor, Rutgers Info. And Library Science; David Lewis, Private Consultant

THE PROJECT TEAM: Learning Theorists/Operations Researchers: Endre Boros, Rutgers Operations Research Computer Scientists: Muthu Muthukrishnan, Rutgers CS, Martin Strauss, AT&T Labs, Rafail Ostrovsky, Telcordia Technologies Decision Theorists/Mathematical Modelers: Fred Roberts, Rutgers Math/DIMACS Homeland Security Consultants: David Goldschmidt, IDA-CCR

THE PROJECT TEAM: Graduate Students: Andrei Anghelescu, Dmitriy Fradkin Programmers: Alexander Genkin, Vladimir Menkov

DATA SETS USED: • No readily available data set has all the characteristics of data on which we expect our methods to be used • However: Our methods depend essentially only on term frequencies by document: • No spellings • No document or topic names • No information on word order in the document • All entities referenced by numeric id’s only: terms, documents, topics

DATA SETS USED - II: • Thus, many available data sets can be used for experimentation. • Real data could be readily sanitized for our use. • We are using: • TREC data, in particular several time-stamped subsets of the data (order 105 to 106 messages) • Reuters Corpus Volume 1 (8 x 105 messages) • Medline Abstracts (order 107 with human indexing)

OVERVIEW OF PROJECT TO DATE IDA/CCR Workshop/Tutorial on Mining Massive Data Sets and Streams Good way to gear up for the project, review the state of the art in streaming message filtering. Workshop and tutorial held at IDA/CCR, Princeton, in June 2002.

OVERVIEW OF PROJECT TO DATE Infrastructure Work Built common platform for text filtering experiments *Modified CMU Lemur retrieval toolkit to support filtering *Created newswire testset with test information needs (250 topics, 240K documents) *Wrote evaluation and adaptive thresholding software

OVERVIEW OF PROJECT TO DATE Infrastructure Work - II Implemented fundamental adaptive linear classifier (Rocchio) Benchmarked them using our data sets and submitted to NIST TREC evaluation Work of: Anghelescu, Lewis, Menkov, Kantor

OVERVIEW OF PROJECT TO DATE Reducing Dimensionality by Preprocessing and Matching Interplay between intelligent selection of features and speed-up of computation/reduction of memory and storage. Mix of compression and matching.

Reducing Dimensionality by Preprocessing and Matching - II Three directions of work involving adaptation of algorithms from theoretical computer science: Use of random projections into real subspaces. (Still promising, though not competitive for our data.) (Lewis, Strauss) Random projections into Hamming cubes (Lewis, Ostrofsky) Efficient discovery of “deviant” cases in stream of vectorized entities (Lewis, Muthukrishnan)

OVERVIEW OF PROJECT TO DATE Feature Selection and Batch Filtering Feature: number that enters into a scoring function (e.g., frequency of a term in a document) Feature selection: process of selecting among existing features or creating new composite features to improve classification performance or computational efficiency Batch filtering: Given relevant documents up front. Adaptive filtering: “pay” for information about relevance as process moves along.

OVERVIEW OF PROJECT TO DATE Feature Selection and Batch Filtering - II A combination of compression, representation, and matching. Two approaches: Combinatorial clustering (Anghelescu, Muchnik) Term selection using measures of term effectiveness (Boros)

OVERVIEW OF PROJECT TO DATE Bayesian Approach to Text Categorization This is a key part of the learning component. Building on recent work on “sparse Bayesian classifiers” Using an open-ended search for a linear functional which (when combined with a suitable monotone function) provides a good estimate of the probability that a document is relevant to a topic.

OVERVIEW OF PROJECT TO DATE Bayesian Approach to Text Categorization - II Key is solution to an optimization problem using Expectation Maximization Algorithms. We started with batch learning (Madigan, Lewis, Genkin) We have begun to extend this to online learning. (Here, we sequentially update the posterior distribution of model parameters as each new labeled example arrives.) (Madigan, Ju)

OVERVIEW OF PROJECT TO DATE Fusion of Methods Work is at the beginning stages. Developed computational/visualization tools to facilitate study of relationships among algorithms for different components. So far, we are not limiting this to machine learning; we are involving human intervention. Work of Kantor.

OVERVIEW OF PROJECT TO DATE Streaming Data Analysis Motivated by need to make decisions about data during an initial scan as data “stream by.” Recent development by theoreticians and practitioners of algorithms for analyzing streaming data have arisen from applications requiring immediate decision making. Motivating work: intrusion detection, transactional applications, time series applications.

OVERVIEW OF PROJECT TO DATE Streaming Data Analysis - II At this stage, just some discussions of possible approaches. More work in this direction later. Work of Muthukrishnan

OVERVIEW OF PROJECT TO DATE A Formal Framework for Monitoring Message Streams: Goal: Gain understanding of interrelations among different aspects of adaptive learning -- by building a formal model of decision making Cast Monitoring Message Streams as a multistage decision problem For each message, decide whether or not to send to an analyst

A Formal Framework for Monitoring Message Streams - II • Positive utility for sending an “interesting” message; else negative…but • …positive “value of information” even for negative documents • Use Influence Diagrams as a modeling framework • Key input is the learning curve; building simple learning curve models • BinWorld – discrete model of feature space • (Kantor and Madigan)

PROJECT “PRODUCTS” TO DATE • Reports on Algorithms and Experimental Evaluation: draft writeups: • overview report • 16 reports on individual pieces of the project • Research Quality Code: in various stages of development • Dissemination: • June workshop and tutorial • project website • project public meetings and seminars

S.O.W: REMAINDER OF FIRST 12 MONTHS: • Continue to concentrate on supervised learning and detection • Explore NN methods of compression and matching • Extend Bayesian methods to online learning and explore additional novel learning tools • Further develop feature selection methods using term effectiveness and clustering concepts/algorithms • Explore streaming data analysis algorithms • Systematically explore & compare combinations of compression schemes, representations, matching schemes, learning methods, and fusion schemes • Test combinations of methods on common data sets

IMPACT AFTER 12 MONTHS: • We will have developed innovative methods for classification of accumulated documents in relation to known tasks/targets/themes and building profiles to track future relevant messages. • We are optimistic that by end-to-end experimentation, we will continue to discover new uses relevant to each of the component tasks for recently developed mathematical and statistical methods. We expect to achieve significant improvementsin performance on accepted measures that could not be achieved by piecemeal study of one or two component tasks or by simply concentrating on pushing existing methods further.

S.O.W: YEARS 2 AND 3: • Still concentrate on new methods for the 5 components. • Develop research quality code for the leading identified methods for supervised learning • Develop the extension to unsupervised learning : • Detect suspicious message clusters before an event has occurred • Use generalized stress measures indicating a significant group of interrelated messages don’t fit into the known family of clusters

S.O.W: YEARS 2 AND 3 CONTINUED: • Emphasize “semi-supervised learning” - human analysts help to focus on features most indicative of anomaly or change; algorithms assess incoming documents as to deviation on those features. • Develop new techniques to represent data to highlight significant deviation: • Through an appropriately defined metric • With new clustering algorithms • Building on analyst-designated features • Work on algorithms for streaming data analysis.

IMPACT AFTER THREE YEARS: • prototype code for testing the concepts and a precise system specification for commercial or government development. • we will have extended our analysis to semi-supervised discovery of potentially interesting clusters of documents. • this should allow us to identify potentially threatening events in time for cognizant agencies to prevent them from occurring.

WE ARE OFF TO A GOOD START • The task we face is of great value in forensic activities. • We are bringing to bear on this task a multidisciplinary approach with a large, enthusiastic, and experienced team. • Preliminary results are very encouraging. • Work is needed to make sure that our ideas are of use to analysts.

Monitoring Message Streams : Retrospective and Prospective Event Detection