1 / 30

CISA

CISA. Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003. Agenda. Background and Overview Architecture Algorithms Results. MURALS: Multiple Use Real-time Analytics for Large Scale Data. Major information technology initiative

yazid
Download Presentation

CISA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CISA Continually Improving Stream Analysis Nancy McMillan Doug Mooney Dave Burgoon March 14, 2003

  2. Agenda • Background and Overview • Architecture • Algorithms • Results

  3. MURALS:Multiple Use Real-time Analytics for Large Scale Data • Major information technology initiative • Objective: Develop intellectual property addressing the challenges created by: • Data generation/collection at previously unimaginable rates • Growing expectation that real time decision-making is feasible and necessary for competitive advantage • Dramatic increase in the data to information ratio • Compelling need for balance between result precision and timeliness • Sponsored development of two technologies • InfoRes: Addresses IT issues associated with real-time querying of very large relational databases • CISA: Addresses IT issues associated with real-time analysis of high volume (varying arrival speed) stream data

  4. Background:Our problem space • Many data sources supplying stream data • Stream data can be summarized by a set of features/summary statistics over some time window • Each data source needs continually classified or characterized • Classification/characterization of a single data source may depend on data from other data sources • Examples: • Computers connecting to a firewall • Sensor networks

  5. Internet Security ExampleWho is trying to inappropriately access a company’s network? • There are 19 firewalls recording connections in a log file • Date/Time • Source and Destination IP addresses • Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule • Inbound and outbound connections and warnings over a six day period in July 2002 were logged • but connections from site to site VPNs are not • only externally initiated connections are being analyzed • more data (6 days in September) were provided later

  6. Every data arrival initiates some tasks (store data, recalculate features, update decisions, etc.), which each require computational time Systems designed for gushing data waste resources when data trickles. Systems designed for slower data flow fail when data arrives too fast. More sophisticated analysis techniques (better features, decision algorithms, etc.) require more computational time, but can provide better answers Analytics designed for gushing data don’t provide the best answer possible when data trickles. Analytics designed for slower data flow don’t provide timely answers when data arrives too fast The Problem: The faster data arrives, the more processing power required for real-time analysis. To what data arrival rate should system be designed?

  7. The CISA Answer: A precision-speed trade-off • When the data arrives more slowly than the system design rate, the best possible answer is provided • All data is considered. • Best analysis techniques are used. • As the data flows faster than the system design rate the accuracy and/or precision of the solution degrades smoothly. • System achieves precision-speed trade-off through: • Architecture • Answer not based on all current data • Requires feedback from algorithm so most important data is considered • Algorithms • Partial/approximate solutions provided

  8. Architecture Assign analysis tasks to asynchronously operating objects storage, characterization, decision-making, and visualization Prioritize analysis tasks associated with each new piece of data Data likely to impact analysis is analyzed sooner Algorithm Use incremental algorithms where possible Update previous answer with new data rather than re-analyze all data Stop or modify iterative or multi-step algorithms before completion when new data arrivals need to enter algorithm Partial/approximate solutions provided Architecture and Algorithm OverviewHow CISA achieves precision-speed trade-off

  9. Agenda • Background and Overview • Architecture • Algorithms • Results

  10. CISA Architectural ComponentsDiagram

  11. Internet Security Example ArchitectureDiagram Java Access database JMS object communication SAS Analytics

  12. Advantages Asynchronous Prioritized Lists Open Source / Off-the-shelf Platform Independent Issues Slow – system resources, ”thrashing”, db, (network speeds) JMS Implementations vary slightly Advantages Easy communication with Java Easily and quickly developed data storage and feature calculation Issues Slow Not available on many platforms Advantages / IssuesRelated to rapid prototyping decisions JMS Access

  13. Agenda • Background and Overview • Architecture • Algorithms • Results

  14. Feature characteristics Relies on more than one feature Some of the individual features take time to compute or measure Meaningful nested "sub-algorithms" can be built on increasing sets of features Data source characteristics The algorithm can efficiently, update its current solution when feature values for only a small group of source objects change There is a natural method for prioritizing objects Candidate CISA AlgorithmsA very broad group of statistical methods…

  15. Construction MethodologiesGeneral • Feature Priority • Order features (statically) • Create series of nested models that use an increasing number of features • Develop a function to assign priorities based on feature order and current object classification • Data Source Priority • Order data sources (dynamically) • Assign priorities based on uncertainty of classification or cost of misclassification • Incremental algorithms are usually essential • Combinations of Both

  16. Construction MethodologiesExamples • Feature Priority: Decompose an algorithm into subalgorithms that use subsets of features. Prioritize feature computation. • Example: Decision tree using X1,X2,… , Xn • Prioritize order of Xi computation based on tree structure • Use pruned trees to classify: {X1}, {X1,X2}, {X1, X2, X3}, …, {X1, X2, …, Xn} • Data Source Priority: • Example: Cluster analysis—All features needed • Objects with incomplete feature sets get higher priority • Objects with more uncertain classifications get higher priority

  17. Feature Priority ConstructionDecision tree example

  18. Agenda • Background and Overview • Architecture • Algorithms • Results

  19. Internet Security ExampleWho is trying to inappropriately access the company’s network? • There are 19 firewalls recording connections in a log file • Date/Time • Source and Destination IP addresses • Protocol • Action (Accept/Drop/Decrypt/..) • Service • Rule • Inbound and outbound connections and warnings over a six day period in July 2002 were logged • but connections from site to site VPNs are not • only externally initiated connections are being analyzed • more data (6 days in September) were provided later

  20. Quickly calculated features % Drop % Accept Hits/Sec # Hits More time consuming features # Different Services Different Services/Hit # Different IPs Different IPs/Hit External Network Connectors Summary statistics/features

  21. N=3 Slow Port and IP Scans High Services High Number of IPs High Number of Hits Low Hits/Sec Large Drop % N=4636 Suspicious Large Drop % Medium IP/Hit Low everything else N=10 Fast IP Address Scans Low Services High Number of Hits High IP/Hit High Number of Hits/Sec Large Drop % Mostly Foreign Represent 40% of External Connections N=7828 Normal High Accept % N=8055 Suspicious-Too Early to Tell Large Drop % High IP/Hit Few Hits N=36 Port Scans High Services Large Drop % Dates: 7/21/02 -7/27/02

  22. External Network ConnectorsClassifications 70%-80% of IPs stay in same group from day to day.

  23. External Network ConnectorsRule-based, feature priority classification algorithm Priority

  24. Precision-Speed Trade-offExpected results 100 % 0 Connections per second Correctly classified same level algorithm Correctly classified different level algorithm Consistently classified Inconsistently classified

  25. Precision-Speed Trade-offObserved results

  26. External Network ConnectorsDynamic, data source priority algorithm • Traditional cluster analysis (e.g., K-means) is time consuming on large datasets • Incremental clustering algorithm required for reasonable performance • Our approach: • After first cluster analysis, use centroid locations to seed the next analysis • Used the SAS procedure FASTCLUS for proof-of-concept purposes

  27. Dates: 8/11/02 - 8/17/02 Outlier Outlier: n=1 (0.32% of connections) Extremely high services China

  28. Dates: 8/11/02 - 8/17/02 Cluster 0: n = 5207 (10.11% of connections) High Accept % Mix Max Hits Mix IP/Hit Cluster 1: n = 2561 (17.16% of connections) High Drop % Medium IP/Hit Cluster 2: n = 7 (50.35% of connections) High Drop % High Num Hits High Num IPs High Max Hits/Sec Cluster 3: n = 180 (17.81% of connections) High Services and/or Max Hits/Sec Mixed Cluster 4: n = 4 (01.42% of connections) High Drop % High Services 94.5% of connections from Korea 1 of 4 IPs from Korea Average 23 sec between hits Cluster 5: n = 5104 (02.82% of connections) High IP/Hit High Drop % Cluster 0 Cluster 2 Cluster 4 Cluster 1 Cluster 5 Cluster 3

  29. Drop % Service/Hit IPS/Hit Max Hit/Sec IPs Scanned Services Scanned % of Sources % Connections External Network Connector Classifications Dashboard report

  30. Drop % Service/Hit IPS/Hit Max Hit/Sec IPs Scanned Services Scanned External Network Connector ClassificationsOutlier report

More Related