220 likes | 230 Views
HOMELAND SECURITY RESEARCH AT DIMACS. Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis. Health surveillance a core activity in public health Concerns about bioterrorism have attracted attention to new surveillance methods: OTC drug sales Subway worker absenteeism
E N D
Working Group on Adverse Event/Disease Reporting, Surveillance, and Analysis • Health surveillance a core activity in public health • Concerns about bioterrorism have attracted attention to new surveillance methods: • OTC drug sales • Subway worker absenteeism • Ambulance dispatches • Spawns need for novel statistical methods for surveillance of multiple data streams.
Working Group on Privacy & Confidentiality of Health Data • Privacy concerns are a major stumbling block to public health surveillance, in particular bioterrorism surveillance. • Challenge: produce anonymous data specific enough for research. • Exploring ways to remove identifiers (s.s. #, tel. #, zip code) from data sets. • Exploring ways to aggregate, remove information from data sets.
Working Group on Analogies between Computer Viruses and Biological Viruses • Can ideas for defending against biological viruses lead to ideas for defending against computer viruses? • Concern about large gap between initial time of attack and implementation of defensive strategies • “Public health” approach: Once a virus has infected a machine, it tries to connect it to as many computers as possible, as fast as possible. A “throttle” limits rate at which a computer can connect to new computers.
Working Group on Modeling Social Responses to Bioterrorism • Models of the spread of infectious disease commonly assume passive bystanders and rational actors who will comply with health authorities. • It is not clear how well this assumption applies to situations like a bioterrorist attack using smallpox or plague. 1947, NYC, smallpox outbreak • Interdisciplinary group is discussing incorporating social behavior into models, models of public health decisionmaking, risk communication.
The Bioterrorism Sensor Location Problem • Early warning is critical • This is a crucial factor underlying government’s plans to place networks of sensors/detectors to warn of a bioterrorist attack The BASIS System
Two Fundamental Problems • Sensor Location Problem (SLP): • Choose an appropriate mix of sensors • decide where to locate them for best protection and early warning
Two Fundamental Problems • Pattern Interpretation Problem (PIP): When sensors set off an alarm, help public health decision makers decide • Has an attack taken place? • What additional monitoring is needed? • What was its extent and location? • What is an appropriate response?
Monitoring Message Streams: Algorithmic Methods for Automatic Processing of Messages Supported by Interagency KD-D Group
OBJECTIVE: Monitor huge streams of textualized communication to automatically detect pattern changes and "significant" events Motivation: monitoring email traffic
TECHNICAL PROBLEM: • Given stream of text in any language. • Decide whether "new events" are present in the flow of messages. • Event: new topic or topic with unusual level of activity. • Retrospective or “Supervised” Event Identification: Classification into pre-existing classes.
TECHNICAL PROBLEM: • Batch filtering: Given relevant documents up front. • Adaptive filtering: “pay” for information about relevance as process moves along.
MORE COMPLEX PROBLEM: PROSPECTIVE DETECTION OR “UNSUPERVISED” LEARNING • Classes change - new classes or change meaning • A difficult problem in statistics • Recent new C.S. approaches “Semi-supervised Learning”: • Algorithm suggests a new class • Human analyst labels it; determines its significance
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING (1). Compression of Text -- to meet storage and processing limitations; (2). Representation of Text -- put in form amenable to computation and statistical analysis; (3). Matching Scheme -- computing similarity between documents; (4). Learning Method -- build on judged examples to determine characteristics of document cluster (“event”) (5). Fusion Scheme -- combine methods (scores) to yield improved detection/clustering.
COMPONENTS OF AUTOMATIC MESSAGE PROCESSING - II These distinctions are somewhat arbitrary. Many approaches to message processing overlap several of these components of automatic message processing. Existing methods don’t exploit the full power of the 5 components, synergies among them, and/or an understanding of how to apply them to text data.
COMPRESSION: • Reduce the dimension before statistical analysis. • We often have just one shot at the data as it comes “streaming by”
COMPRESSION II: • Recent results: “One-pass” through data can reduce volume significantly w/o degrading performance significantly. We believe that sophisticated dimension reduction methods in a preprocessing stage followed by sophisticated statistical tools in a detection/filtering stage can be a very powerful approach. Our methods so far give us some confidence that we are right.
COMPRESSION III: Three directions of work involving adaptation of nearest neighbor (NN) algorithms from theoretical computer science: Use of random projections into real subspaces. (Still promising, though not competitive for our data.) Random projections into Hamming cubes Efficient discovery of “deviant” cases in stream of vectorized entities
MORE SOPHISTICATED STATISTICAL APPROACHES BEING STUDIED: • Representations: Boolean representations; weighting schemes • Matching Schemes: Boolean matching; nonlinear transforms of individual feature values • Learning Methods: new kernel-based methods; more complex Bayes classifiers; boosting; • Fusion Methods: combining scores based on ranks, linear functions, or nonparametric schemes
DATA SETS USED: • No readily available data set has all the characteristics of data on which we expect our methods to be used • However: Many of our methods depend essentially only on term frequencies by document. • Thus, many available data sets can be used for experimentation.
DATA SETS USED II: • TREC (Text Retrieval Conference) data: time-stamped subsets of the data (order 105 to 106 messages) • Reuters Corpus Vol. 1 (8 x 105 messages) • Medline Abstracts (order 107 with human indexing)
THE MONITORING MESSAGE STREAMS PROJECT TEAM: Endre Boros, RUTCOR Paul Kantor, SCILS Dave Lewis, Consultant Ilya Muchnik, DIMACS/CS S. Muthukrishnan, CS David Madigan, Statistics Rafail Ostrovsky, Telcordia Technologies Fred Roberts, Rutgers Martin Strauss, AT&T Labs Wen-Hua Ju, Avaya Labs (collaborator)