1 / 35

Novelty Detection and Profile Tracking from Massive Data

Novelty Detection and Profile Tracking from Massive Data. Jaime Carbonell Eugene Fink. Santosh Ananthraman. Motivation. Search for interesting patterns in large data sets. Motivation. Search for interesting patterns in large data sets. Current applications

ferreri
Download Presentation

Novelty Detection and Profile Tracking from Massive Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Novelty Detection and Profile Tracking from Massive Data Jaime Carbonell Eugene Fink Santosh Ananthraman

  2. Motivation Search for interesting patternsin large data sets

  3. Motivation Search for interesting patternsin large data sets Current applications • Processing of intelligence data • Prediction of “natural” threats Future applications • Scientific discoveries • Analysis of business data • … and more …

  4. Outline Main results of the ARGUS project - Approximate matching - Streaming data - Novelty detection More about approximate matching - Records and queries - Search for matches - Experimental results

  5. Large data sets Large: From a million (106) to several billion (1010) records Data: Structured records with numbers, strings, and nominal values Sets: Databases and streams of records Specific sets: • Hospital admissions (1.7 million records) • Network flow (5 trillion records) • Federal wire (simulated data)

  6. Main results We have developed a system thataddresses three problems: Retrieval of approximate matches for known patterns Processing of streaming data Identification of new patterns and gradual changes in old patterns

  7. Approximate matching Fast identification of approximatematches in large sets of records Examples • Misspelled names • Inexact numbers • Spatial proximity

  8. Streaming data Continuous search for matchesin a stream of new records Maintain a set of “pending” queries Identify matches for these queries among incoming records

  9. RETE network Identify common parts of queries andarrange them into a RETE network, which significantly reduces the matching time Hundreds to thousands of pending queries Tens to hundreds of records per second

  10. Identify “normal” clusters in the historic data Search for new clusters in the incoming data Track density changes in the existing clusters Novelty detection

  11. density distance fromthe center Example: Static event

  12. density distance Example: New event density distance

  13. Example: Hidden event density distance

  14. Example: Growing event density distance

  15. Visualization Display of records, clusters, and queries in two and three dimensions Access to data tables and analysis results

  16. Example: Data and clusters

  17. Example: Density analysis

  18. Information flow

  19. Outline Main results of the ARGUS project - Approximate matching - Streaming data - Novelty detection More about approximate matching - Records and queries - Search for matches - Experimental results

  20. Motivation Retrieval of relevant records basedon partially inaccurate information Inaccurate records Inaccurate queries Incomplete knowledge

  21. Table of records We specify a table of records by a list of attributes Example We can describe patients in a hospitalby their sex, age, and diagnosis

  22. Example Record Sex: female Age: 30 Dx: asthma Records and queries A record includes a specificvalue for each attribute A query may include lists ofvalues and numeric ranges Query Sex: male, female Age: 20..40 Dx: asthma, flu

  23. A point query includes a specificvalue for each attribute A region query includes lists of values or numeric ranges Example Region query Sex: male, female Age: 20..40 Dx: asthma, flu Point query Sex: female Age: 30 Dx: asthma Query types

  24. Record Dx Age Query Sex Exact matches A record is an exact match for a query if every value in the record belongs tothe respective range in the query

  25. Dx Age Query Sex Approximate matches A record is an approximate match for aquery if it is “close” to the query region Record

  26. Approximate queries An approximate query includes Point or region Distance function Number of matches Distance limit

  27. Group nodes into fixed-size disk blocks diagnosis diagnosis diagnosis diagnosis age age sex female, 30,fracture female, 50,flu female, 30,ulcer female, 30,asthma male, 30,asthma male, 40,flu Indexing tree Maintain a PATRICIA tree of records male female 30 50 40 30 asthma ulcer fracture flu asthma flu

  28. diagnosis diagnosis diagnosis diagnosis age age sex female, 30,fracture female, 50,flu female, 30,asthma female, 30,ulcer male, 30,asthma male, 40,flu Search for matches Depth-first search for exact matches Best-first search for approximate matches male female 30 50 40 30 asthma ulcer fracture flu asthma flu

  29. Performance Experiments with a database of all patientsadmitted to Massachusetts hospitals fromOctober 2000 to September 2002 Twenty-one attributes 1.7 million records Use of a Pentium computer • 2.4 GHz CPU • 1 Gbyte memory • 400 MHz bus

  30. Variables Control variables • Number of records • Memory size • Query type Measurements • Retrieval time

  31. Availablememory 1000 Approximatequeries n0.5 Rangequeries Retrieval Time (msec) 100 n0.15 lg n lg n Point queries 10 103 105 106 102 104 Number of Records Small memory Number of records: 100 to 1,670,000 Memory size: 4 MByte

  32. 10,000 1,000 Approximatequeries 100 Retrieval Time (msec) Range queries 10 Point queries 1 128 512 1,024 64 256 Memory Size (MBytes) Large memory Number of records: 1,670,000 Memory size: 64 to 1,024 MByte

  33. Scalability Retrieval time grows as fractionalpower (about 0.5) of database size

  34. When the system receives a new record, it adds the record to one of the trees When the system receives a query, it searches all trees in parallel query new record Distributed architecture Indexing trees on multiple computers

  35. Conclusions We have developed a set of tools for analysis of massive structured data Experiments have shown that it improves the productivity of intelligence analysts Future work includes development of more tools and application to other domains

More Related