100 likes | 212 Views
Project Argus Massive Data NIMD PI Meeting December 2, 2004. Massive Structured Data. Static data Focus on 10 10 to 10 12 records Typical record size 100 to 1,000 bytes Typical collection size between terabyte and petabyte
E N D
Massive Structured Data • Static data • Focus on 1010 to 1012 records • Typical record size 100 to 1,000 bytes • Typical collection size between terabyte and petabyte • Smaller than large collections including unstructured data because field size is much smaller • Streaming data • 1,000 to 5,000 records per second • Approx 100M to 400M records per day • Static data corresponds to a few years of stream
Approximate Structured Matching Near Match No Match Distance Near Match Range or Point Query Distance Distance Exact Match No Match
Data Matching and Retrieval • Matcher finds data that matches query exactly or is close to it • Different versions for different data volumes
100 Approximatequeries Availablememory n0.5 Rangequeries Retrieval Time (msec) 10 n0.15 lg n Exact queries lg n 1 103 105 106 102 104 Number of Records Disk-Matcher Experiments
Monitoring Streaming Data Data Tables Stream Anomaly Monitoring Intermediate Tables Data Streams Query Table Do_queries Analyst Rete Network Generator Query Scheduler Rete Networks Identified Threats
Monitoring Streaming Data • Monitoring structured data streams for anomalies, hazards or alerts posted by analysts. • Alert profiles = continuous persistent queries (105) • Daily stream volumes target 108+ records. • System is optimized for very high selectivity queries • “Needle in a field of haystacks” challenge • Alert profiles can be anything (relational, aggregation, …) • Functions atop DBMS (now), or full DYNAMiX matcher (coming soon) • Based on modified Rete algorithm
Old Results New Incremental Results Adapted Rete Algorithm • (n+Δn) (m+Δm) = n m + Δn m + n Δm + Δn Δm • When Δn and Δm are very small compared to n and m, rete time complexity of incremental join is worse case O(n+m), and using b-trees it goes to O(logn+logm+Δn+Δm)
Finding Novel Patterns in Data • Primary topic of Hypothesis Generation and Tracking paper • Scales well for massive data because algorithms are near linear in number of records, rather than n2
Need for Suitable Data • Most suitable data is classified or proprietary • Fabricated data does not have “right” distribution • Risk of tailoring solution to fabricated characteristics • Ideal is real data processed to be unclassified, but still retaining relevant characteristics of original