1 / 10

Project Argus Massive Data NIMD PI Meeting December 2, 2004

Project Argus Massive Data NIMD PI Meeting December 2, 2004. Massive Structured Data. Static data Focus on 10 10 to 10 12 records Typical record size 100 to 1,000 bytes Typical collection size between terabyte and petabyte

frye
Download Presentation

Project Argus Massive Data NIMD PI Meeting December 2, 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Project ArgusMassive DataNIMD PI MeetingDecember 2, 2004

  2. Massive Structured Data • Static data • Focus on 1010 to 1012 records • Typical record size 100 to 1,000 bytes • Typical collection size between terabyte and petabyte • Smaller than large collections including unstructured data because field size is much smaller • Streaming data • 1,000 to 5,000 records per second • Approx 100M to 400M records per day • Static data corresponds to a few years of stream

  3. Approximate Structured Matching Near Match No Match Distance Near Match Range or Point Query Distance Distance Exact Match No Match

  4. Data Matching and Retrieval • Matcher finds data that matches query exactly or is close to it • Different versions for different data volumes

  5. 100 Approximatequeries Availablememory n0.5 Rangequeries Retrieval Time (msec) 10 n0.15 lg n Exact queries lg n 1 103 105 106 102 104 Number of Records Disk-Matcher Experiments

  6. Monitoring Streaming Data Data Tables Stream Anomaly Monitoring Intermediate Tables Data Streams Query Table Do_queries Analyst Rete Network Generator Query Scheduler Rete Networks Identified Threats

  7. Monitoring Streaming Data • Monitoring structured data streams for anomalies, hazards or alerts posted by analysts. • Alert profiles = continuous persistent queries (105) • Daily stream volumes target 108+ records. • System is optimized for very high selectivity queries • “Needle in a field of haystacks” challenge • Alert profiles can be anything (relational, aggregation, …) • Functions atop DBMS (now), or full DYNAMiX matcher (coming soon) • Based on modified Rete algorithm

  8. Old Results New Incremental Results Adapted Rete Algorithm • (n+Δn) (m+Δm) = n m + Δn m + n Δm + Δn Δm • When Δn and Δm are very small compared to n and m, rete time complexity of incremental join is worse case O(n+m), and using b-trees it goes to O(logn+logm+Δn+Δm)

  9. Finding Novel Patterns in Data • Primary topic of Hypothesis Generation and Tracking paper • Scales well for massive data because algorithms are near linear in number of records, rather than n2

  10. Need for Suitable Data • Most suitable data is classified or proprietary • Fabricated data does not have “right” distribution • Risk of tailoring solution to fabricated characteristics • Ideal is real data processed to be unclassified, but still retaining relevant characteristics of original

More Related