300 likes | 531 Views
CISA. Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003. Agenda. Background and Overview Architecture Algorithms Results. MURALS: Multiple Use Real-time Analytics for Large Scale Data. Major information technology initiative
E N D
CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003
Agenda • Background and Overview • Architecture • Algorithms • Results
MURALS:Multiple Use Real-time Analytics for Large Scale Data • Major information technology initiative • Objective: Develop intellectual property addressing the challenges created by: • Data generation/collection at previously unimaginable rates • Growing expectation that real time decision-making is feasible and necessary for competitive advantage • Dramatic increase in the data to information ratio • Compelling need for balance between result precision and timeliness • Sponsored development of two technologies • InfoRes: Addresses IT issues associated with real-time querying of very large relational databases • CISA: Addresses IT issues associated with real-time analysis of high volume (varying arrival speed) stream data
Background:Our problem space • Many data sources supplying stream data • Stream data can be summarized by a set of features/summary statistics over some time window • Each data source needs continually classified or characterized • Classification/characterization of a single data source may depend on data from other data sources • Examples: • Computers connecting to a firewall • Sensor networks
Internet Security ExampleWho is trying to inappropriately access a company’s network? • There are 19 firewalls recording connections in a log file • Date/Time • Source and Destination IP addresses • Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule • Inbound and outbound connections and warnings over a six day period in July 2002 were logged • but connections from site to site VPNs are not • only externally initiated connections are being analyzed • more data (6 days in September) were provided later
Every data arrival initiates some tasks (store data, recalculate features, update decisions, etc.), which each require computational time Systems designed for gushing data waste resources when data trickles. Systems designed for slower data flow fail when data arrives too fast. More sophisticated analysis techniques (better features, decision algorithms, etc.) require more computational time, but can provide better answers Analytics designed for gushing data don’t provide the best answer possible when data trickles. Analytics designed for slower data flow don’t provide timely answers when data arrives too fast The Problem: The faster data arrives, the more processing power required for real-time analysis. To what data arrival rate should system be designed?
The CISA Answer: A precision-speed trade-off • When the data arrives more slowly than the system design rate, the best possible answer is provided • All data is considered. • Best analysis techniques are used. • As the data flows faster than the system design rate the accuracy and/or precision of the solution degrades smoothly. • System achieves precision-speed trade-off through: • Architecture • Answer not based on all current data • Requires feedback from algorithm so most important data is considered • Algorithms • Partial/approximate solutions provided
Architecture Assign analysis tasks to asynchronously operating objects storage, characterization, decision-making, and visualization Prioritize analysis tasks associated with each new piece of data Data likely to impact analysis is analyzed sooner Algorithm Use incremental algorithms where possible Update previous answer with new data rather than re-analyze all data Stop or modify iterative or multi-step algorithms before completion when new data arrivals need to enter algorithm Partial/approximate solutions provided Architecture and Algorithm OverviewHow CISA achieves precision-speed trade-off
Agenda • Background and Overview • Architecture • Algorithms • Results
Internet Security Example ArchitectureDiagram Java Access database JMS object communication SAS Analytics
Advantages Asynchronous Prioritized Lists Open Source / Off-the-shelf Platform Independent Issues Slow – system resources, ”thrashing”, db, (network speeds) JMS Implementations vary slightly Advantages Easy communication with Java Easily and quickly developed data storage and feature calculation Issues Slow Not available on many platforms Advantages / IssuesRelated to rapid prototyping decisions JMS Access
Agenda • Background and Overview • Architecture • Algorithms • Results
Feature characteristics Relies on more than one feature Some of the individual features take time to compute or measure Meaningful nested "sub-algorithms" can be built on increasing sets of features Data source characteristics The algorithm can efficiently, update its current solution when feature values for only a small group of source objects change There is a natural method for prioritizing objects Candidate CISA AlgorithmsA very broad group of statistical methods…
Construction MethodologiesGeneral • Feature Priority • Order features (statically) • Create series of nested models that use an increasing number of features • Develop a function to assign priorities based on feature order and current object classification • Data Source Priority • Order data sources (dynamically) • Assign priorities based on uncertainty of classification or cost of misclassification • Incremental algorithms are usually essential • Combinations of Both
Construction MethodologiesExamples • Feature Priority: Decompose an algorithm into subalgorithms that use subsets of features. Prioritize feature computation. • Example: Decision tree using X1,X2,… , Xn • Prioritize order of Xi computation based on tree structure • Use pruned trees to classify: {X1}, {X1,X2}, {X1, X2, X3}, …, {X1, X2, …, Xn} • Data Source Priority: • Example: Cluster analysis—All features needed • Objects with incomplete feature sets get higher priority • Objects with more uncertain classifications get higher priority
Agenda • Background and Overview • Architecture • Algorithms • Results
Internet Security ExampleWho is trying to inappropriately access the company’s network? • There are 19 firewalls recording connections in a log file • Date/Time • Source and Destination IP addresses • Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule • Inbound and outbound connections and warnings over a six day period in July 2002 were logged • but connections from site to site VPNs are not • only externally initiated connections are being analyzed • more data (6 days in September) were provided later
Quickly calculated features % Drop % Accept Hits/Sec # Hits More time consuming features # Different Services Different Services/Hit # Different IPs Different IPs/Hit External Network Connectors Summary statistics/features
N=3 Slow Port and IP Scans High Services High Number of IPs High Number of Hits Low Hits/Sec Large Drop % N=4636 Suspicious Large Drop % Medium IP/Hit Low everything else N=10 Fast IP Address Scans Low Services High Number of Hits High IP/Hit High Number of Hits/Sec Large Drop % Mostly Foreign Represent 40% of External Connections N=7828 Normal High Accept % N=8055 Suspicious-Too Early to Tell Large Drop % High IP/Hit Few Hits N=36 Port Scans High Services Large Drop % Dates: 7/21/02 -7/27/02
External Network ConnectorsClassifications 70%-80% of IPs stay in same group from day to day.
External Network ConnectorsRule-based, feature priority classification algorithm Priority
Precision-Speed Trade-offExpected results 100 % 0 Connections per second Correctly classified same level algorithm Correctly classified different level algorithm Consistently classified Inconsistently classified
External Network ConnectorsDynamic, data source priority algorithm • Traditional cluster analysis (e.g., K-means) is time consuming on large datasets • Incremental clustering algorithm required for reasonable performance • Our approach: • After first cluster analysis, use centroid locations to seed the next analysis • Used the SAS procedure FASTCLUS for proof-of-concept purposes
Dates: 8/11/02 - 8/17/02 Outlier Outlier: n=1 (0.32% of connections) Extremely high services China
Dates: 8/11/02 - 8/17/02 Cluster 0: n = 5207 (10.11% of connections) High Accept % Mix Max Hits Mix IP/Hit Cluster 1: n = 2561 (17.16% of connections) High Drop % Medium IP/Hit Cluster 2: n = 7 (50.35% of connections) High Drop % High Num Hits High Num IPs High Max Hits/Sec Cluster 3: n = 180 (17.81% of connections) High Services and/or Max Hits/Sec Mixed Cluster 4: n = 4 (01.42% of connections) High Drop % High Services 94.5% of connections from Korea 1 of 4 IPs from Korea Average 23 sec between hits Cluster 5: n = 5104 (02.82% of connections) High IP/Hit High Drop % Cluster 0 Cluster 2 Cluster 4 Cluster 1 Cluster 5 Cluster 3
Drop % Service/Hit IPS/Hit Max Hit/Sec IPs Scanned Services Scanned % of Sources % Connections External Network Connector Classifications Dashboard report
Drop % Service/Hit IPS/Hit Max Hit/Sec IPs Scanned Services Scanned External Network Connector ClassificationsOutlier report