580 likes | 684 Views
The Mathematics and Algorithmics of Process Detection. George Cybenko Dartmouth College Hanover NH 03755 USA gvc@dartmouth.edu. IPAM 27-7-2005. Cybenko. Acknowledgements. Active Members George Bakos Alex Barsamian Marion Bates Vincent Berk Wayne Chung
E N D
The Mathematics and Algorithmics ofProcess Detection George Cybenko Dartmouth College Hanover NH 03755 USA gvc@dartmouth.edu IPAM 27-7-2005 Cybenko
Acknowledgements Active Members George Bakos Alex Barsamian Marion Bates Vincent Berk Wayne Chung Valentino Crespi (Cal State LA) George Cybenko Ian deSouza Annarita Giani Doug Madory Glenn Nofsinger Yong Sheng William Stearns Alumni Naomi Fox (UMass, Ph.D. student) Hrithik Govardhan (Rocket) Robert Gray (BAE Systems) Diego Hernando (UIUC, Ph.D. student) Guofei Jiang (NEC Research) Alex Jordan (BAE Systems) Han Li (China) Josh Peteet (Greylock Partners) Chris Roblee (LLNL) Research Support: DARPA, DHS, ARDA, ISTS, I3P, AFOSR, Microsoft IPAM 27-7-2005 Cybenko
Outline • Background and basics • Software and Applications • Theory • Summary IPAM 27-7-2005 Cybenko
An Example of a Process a b A “Process” Model 1 2 Two states - { 1 , 2 } Two observables – { a , b } Legal transitions between states are depicted by arrows. When occupying a state, the process emits an observable. All states are initial/start states and there are no terminal states. Some legal sequences of observables: abbab , bababbb, abbb Some illegal sequences of observables: aa , baab Further reading: Automata Theory, Regular Languages, etc
A More Complex Process a , c b a , c Another “Process” Model 3 1 2 Three states - { 1 , 2 , 3 } Three observables – { a , b , c } Some legal sequences of observables: abab , babaccab, ab Some illegal sequences of observables: bb , baabb Problem: Given a sequence of possible observations is it legal? What states? Solution: 1 Read the first observable, mark states that emit that observable 2 Read an observable, z 3 New marked states = (states reachable from old marked states) intersected with (states that could have emitted z ) 4 If no new marked states, illegal sequence; else go to 2
Two Simple Processes a b Model Instance A A1 A2 a b Model Instance B B1 B2 aabb is a legal observation sequence A1 B1 A2 A2 , A1 B1 A2 B2 , B1 A1 B2 B2 , ... are all legal state sequences A1 A2 A2 , A1 A2 , A1 B1 B1 B2 B1 B2 B2 We can reduce this to a single process.... a track a hypothesis
Multiple Process Representation A1 B1 a b A1 B1 0 1 1 1 Model Instance A A1 A2 M = a b Model Instance A A1 A2 0 1 1 1 0 0 0 0 M x M = 0 1 1 1 0 1 1 1 a b Model Instance B B1 B2 If the observation sequence is aaaaaa and multiple copies of the model are allowed, then we get a product model of size 2n.
Scanned Data Access Start/Normal Exfiltration Infected Multistage Process Model Potential malicious activity Potential normal activity IPAM 27-7-2005 Cybenko
Extensions: Hidden Markov Model (HMM) p(a|1) = 0.8 , p(c|1) = 0.2 p(b|2) = 1 p(a|3) = 0.8, p(c|3) = 0.2 0.8 1 0.5 Add probabilities 3 1 2 0.5 0.2 Take logs of probabilities so this is a shortest path problem and can use dynamic programming (Viterbi algorithm) t=0 t=1 t=2 t=3 t=k-2 t=k-1 Copies of states k copies
Observations missed, noise added, unlabelled (This is what we see) a b a c f k h d c b g d b k h a g d a Observations are interleaved a b c c fh d cc a b gd b a g d a Observations related to state sequences a b c d a b b a d a c f h c c g d g f, g a, c a, b f, c c, d c, d Underlying (hidden) state spaces e h Model 1 Model n Hidden Process Models IPAM 27-7-2005 Cybenko
Terminology and Summary Processes have states. The states are hidden. States emit observables that are possibly not unique to a state. Observables are not labeled, can be noisy and might be dropped. Multiple processes might be instantiated. The problem is to determine which processes are possible and which states those processes can be in. Multiple process detection can be reduced to single process detection at the expense of exponential growth. Tracks are associations of observations to processes. Hypotheses are consistent tracks that explain all the observables. IPAM 27-7-2005 Cybenko
Catalog of Processes/Models Discrete Source Separation Problem(viz Blind Source Separation, “Cocktail Party” Problem) Process/Model Example: 3 states + transition probabilities n observable events: a,b,c,d,e,… Pr( state | observable event ) given/known Observedevent sequence: ….abcbbbaaaababbabcccbdddbebdbabcbabe…. A Hypothesis A Track Which combination of which process models “best” accounts for the observations? This is what we want to compute. Events not associated with a known process are “anomalies”. IPAM 27-7-2005 Cybenko
A Simple Example of Process Detection • a,b,c,d are events that can be observed • states A, B, C, D, E, F are hidden • observe a sequence of events • Sequence Hypotheses • ab NW | RF • abab (NW & NW)|(RF&NW)... • ababc (NW & RF)|(NW & NW) • ababcc NW & NW • Which process or combination of • processes explains the observed events? a,b,c,d are events that can be observed { a } { b } { b , c } { c , d } A B C D NETWORK WORM MODEL (NW) (a,b,c,d ICMP traffic levels) E,F = 0 repeat read event e if e==a then E if E and e==b then F until F { a } { b } E F ROUTER FAILURE MODEL (RF) Two models; states have different semantics; sets of observables intersect – what is the “diagnosis”? IPAM 27-7-2005 Cybenko
Detecting a Process Using Rules A,B,C,D = 0 repeat read event e if e==a then A if A and e==b then B if B and (e==b or e==c) then C if C and (e==c or e==d) then D until D { a } { b } { b , c } { c , d } A B C D WORM MODEL (a,b,c,d ICMP traffic levels) E,F = 0 repeat read event e if e==a then E if E and e==b then F until F { a } { b } E F ROUTER FAILURE MODEL What does “ab” mean ? (Process ambiguity) What does “ac” mean ? (Missed Detections) IPAM 27-7-2005 Cybenko
Rules for Process Disambiguation A,B,C,D = 0 repeat read event e if e==a then A if A and e==b then B if B and (e==b or e==c) then C if C then (E=0, F=0) if C and (e==c or e==d) then D until D { a } { b } { b , c } { c , d } A B C D WORM MODEL (a,b,c,d ICMP traffic levels) E,F = 0 repeat read event e if e==a then E if E and e==b then F until F { a } { b } E F ROUTER FAILURE MODEL Cannot decide which process is instantiated until more data arrives. IPAM 27-7-2005 Cybenko
Rules for Missed Detections A,B,C,D = 0 repeat read event e if e==a then A if A and e==b then B if A and e==c then C,D if A and e==d then D if B and (e==b or e==c) then C if C then (E=0, F=0) if C and (e==c or e==d) then D if D then (E=0, F=0) until D { a } { b } { b , c } { c , d } A B C D WORM MODEL (a,b,c,d ICMP traffic levels) This clearly does not scale and does not lead to manageable sets/systems of rules. IPAM 27-7-2005 Cybenko
Complexity of Rule-Based Systemsfor Multiple Process Detection • m process models, each with n states • Potentially as few as mn state transitions in the original models • Potentially need to add: • O(m2n2) rules for disambiguation • O(mn2) rules for missed detections • these are “overhead” processing steps that can be done generically, not by the decision tree or rule set • Process Query System software handles this overhead processing IPAM 27-7-2005 Cybenko
Approaches to Detecting Processes • Aristotelian - Traditional information retrieval is based on specification of a query in terms of Boolean expressions based on record fields. IE. SQL ( name = “smith” & age > 20 & age < 40 ) + rule-based logics + decision trees, etc • Newtonian - Next generation process detection requires retrieval based on specification of a set of discrete, dynamic processes. IE, descriptions of a Hidden Markov Model, Hidden Petri Net, weak models, FSMs, attack trees, etc. Main Concept: Move from an Aristotelian to a Newtonian Paradigm. IPAM 27-7-2005 Cybenko
Examples of Process Detection Problems • Is there an unusual pattern of computer network events, host activities, system calls, etc? (Network and computer security) • Is a complex infrastructure (telecom, electricity, financial networks) operating normally or in a failure mode? (Critical Infrastructures) • Is my software operating normally? (Autonomic computing) • What biological pathways/processes are engaged? (Molecular Biology) • Is there an unusual pattern of document accesses within an enterprise document control system? (Insider Threat Detection) • Does a group of unusual transactions constitute a threat? (Homeland Security) • Has the physical border/perimeter been breached? (National and industrial physical intrusion detection) • Is there a large ground vehicle convoy moving towards our position? (Tactical military) • What’s going on around me? (Human Cognitive Processing) IMPORTANT – All are “adversarial” situations, not cooperative, so the observations are not necessarily labeled for easy identification and association with a process! IPAM 27-7-2005 Cybenko
Related Disciplines Multiple Target Tracking Hidden Markov Models Linear State Space Systems “Weak” Models IPAM 27-7-2005 Cybenko
Software and Applications • Sensor networks • Airborne plume detection • Cyber security • Server pool management • Dynamics of social networks* • Genomics and biological pathways* • Human situation awareness* *In process or planned. IPAM 27-7-2005 Cybenko
Process Query Systems (PQS) • Process Query Systems solve the Discrete Source Separation Problem in a generic way: • inputs • a sequence of unlabelled observations (stream, logfiles, etc) • a collection of process models • outputs • estimates of which processes produced those observations • estimates of which states those processes are in • Basic theory and technology has been developed by the PQS team at Dartmouth • Now being applied to a variety of applications IPAM 27-7-2005 Cybenko
Manage Hypotheses (MHT) Subscribed Data Arrives Track Track 2 4 Hypothesis 1 Track Track Track Track Track Track Track Track Track Track Hypothesis Pool Track Tracks Tracks Tracks Tracks Tracks Tracks Tracks Tracks Tracks Tracks Tracks Track Build or Learn Models 1 Hypothesis n Algorithms/Operations of PQS Evaluate Solutions and Process Outputs 5 3 Update Tracks Within Hypotheses (Viterbi / Kalman / NDFA,etc) and Create New Hypotheses Recursive in Time IPAM 27-7-2005 Cybenko
Software: Process Query System One platform, many applications DISCUS Vehicle Tracking Cyberlog Analysis Attacks on utilities DHS DARPA PQSnet.net Computer Security Plume detection Sensor networks Robust Server Pooling ARDA DHS DHS Generic Process Query System IPAM 27-7-2005 Cybenko
The COBOL and pre-PQS Analogy … application logic statement 1; application logic statement 2; file management statement 1; record management statement 1; file management statement 2; record management statement 2; application logic statement 3; record management statement 3; file management statement 3; application logic statement 4; … User responsibility System responsibility … application logic statement 1; application logic statement 2; SQL statement 1; application logic statement 3; SQL statement 2; application logic statement 4; … … file management operation 1; record management operation1; file management operation2; record management operation2; record management operation3; file management operation3; … + Application logic Database management system Interwoven logic Post-SQL Programs Pre-SQL Programs … model logic statement 1; model logic statement 2; sensor access statement 1; state estimate statement 1; sensor access statement 2; state estimate statement 2; model logic statement 3; sensor access statement 3; state estimate statement 3; model logic statement 4; … User responsibility System responsibility … model description statement 1; model description statement 2; model description statement 3; model description statement 4; … … sensor access statement 1; state estimate statement 1; sensor access statement 2; state estimate statement 2; sensor access statement 3; state estimate statement 3; … + Model description Process query system Interwoven logic Current Process Detection Programs PQS-based Programs
Computer Security Example(V. Berk and N. Fox)Funded by ARDA and DHS IPAM 27-7-2005 Cybenko
Network Security • Objective: Detect, disambiguate, and predict the course of concerted network attacks in an enterprise class network. • Why: Problem domain demands the power of PQS • Hundreds of “processes” occurring at once • Lots of missed observations and noise • All commercial technology focuses on collection and presentation of data • Existing correlation efforts very weak at best IPAM 27-7-2005 Cybenko
Goal of PQS in network monitoring • Create a system that quickly, and accurately correlates related activity. • Assist a security analyst in deciding: • What activity is irrelevant. • What activity needs attention and further investigation. IPAM 27-7-2005 Cybenko
DIB:s Snort Dartmouth ICMP-T3 Bcc: System Signature Matching IDS Samba Tripwire SMB server - file access reporting Host filesystem integrity checker SENSORS INTEGRATED SENSOR DESCRIPTION SCOPE Global CovChan Timing Covert Channel Detection Network IPtables Linux Netfilter firewall, log based Weblog IIS, Apache, SSL error logs, … Host US-agent Userspace host monitoring agent IPAM 27-7-2005 Cybenko
Scanned Data Access Start/Normal Exfiltration Infected Multistage Process Model Potential malicious activity Potential normal activity
DIB:s CovChan IPTables Snort SaMBa US-Agent Models PQS PQS-Net Testbed at Dartmouth Internet Dartmouth PQS-Net ISTS DMZ WWW Mail WS • 172.18.12.32-38 • Attack Hosts: • Skaion • Custom Exploits • Core Impact™ • Normal Traffic • Covert Channels • Worms WinXP/LINUX targets 192.168.24.192/26 www.pqsnet.net
PQS-Net supply chain Tier 2 Models • Focus on correlating host activity • Report chains of events Tier 1 Models • Focus on individual host status • Report on status changes Tier 1 Output Mon Feb 21 20:06:17 2005 000000 131.58.63.160 (hostile) recon on 100.10.20.4 SNORT 469 proto: 1 Mon Feb 21 20:30:24 2005 000000 138.158.170.45 (hostile) attacked 100.10.20.4 ERRORLOG 400 proto: 6 dport: 443 Tier 2 Output Tier 1 Tracker Tier 2 Tracker Attack sequences and scores sensor data Attack steps sensors Analyst’s front-end IPAM 27-7-2005 Cybenko
Example Scenario Internet A C D B E IPAM 27-7-2005 Cybenko
Example Cont’d D B E IPAM 27-7-2005 Cybenko
Fish Tracking (Kinematic Tracking)A. Jordan, W. Chung, V. CrespiFunded by DARPA and DHS IPAM 27-7-2005 Cybenko
Real time Fish Tracking • Objective: Track the fish in the fish tank • Why: Very strong example of the power of PQS • Fish swim very quickly and erratically • Lots of missed observations • Lots of noise • Classical Kalman filters don’t work (non-linear movement and acceleration) • “Easier” than getting permission to track people (we mistakenly thought) IPAM 27-7-2005 Cybenko
Fish Tracking Details • 5 Gallon tank with 2 red Platys named Bubble and Squeak • Camera generates a stream of “centroids”: For each frame a series of (X,Y) pairs is generated. • Model describes the kinematics of a fish: The model evaluates if new (X,Y) pairs could belong to the same fish, based on measured position, momentum, and predicted next position. This way, multiple “tracks” are formed. One for each object. • Model was built in under 3 days!!! IPAM 27-7-2005 Cybenko
Autonomic Server Monitoring(C. Roblee, V. Berk)Funded by DHS, ARDA IPAM 27-7-2005 Cybenko
Autonomic Server Monitoring • Objective: Detect and predict deteriorating service situations • Why: Another strong example of the power of PQS • Software and hardware are buggy and vulnerable • Hot market, large profits for “The ONE” application • Very ambiguous observations • Sys-admins also want vacation IPAM 27-7-2005 Cybenko
The Environment • Hundreds of servers and services • Various non-intrusive sensors check for: • CPU load • Memory footprint • Process table (forking behavior) • Disk I/O • Network I/O • Service query response times • Suspicious network activities (i.e.. Snort) • Models describe the kinematics of failures and attacks: The model evaluates load balancing problems, memory leaks, suspicious forking behavior (like /bin/sh), service hiccups correlated with network attacks… IPAM 27-7-2005 Cybenko
2. 3. Monitored host sensor output (system level) PQS Tracker Output Current system record for host 10.0.0.24 (10 records): Average memory over previous 10 samples: 251.000 Average CPU over previous 10 samples: 0.970 | time | mem used | CPU load | num procs | flag | ---------------------------------------------------------------------------------- | 1101094903 | 251 | 0.970 | 64 | | | 1101094911 | 252 | 0.820 | 64 | | | 1101094920 | 251 | 0.920 | 64 | | | 1101094928 | 251 | 0.930 | 64 | | | 1101094937 | 251 | 0.870 | 65 | | | 1101094946 | 251 | 0.970 | 65 | | | 1101094955 | 251 | 0.820 | 65 | | | 1101094964 | 253 | 1.220 | 65 | ! | | 1101094973 | 255 | 1.810 | 65 | ! | | 1101094982 | 258 | 2.470 | 65 | ! | Last Modified: Mon Nov 21 21:01:03 Model Name: server_compromise1 Likelihood: 0.9182 Target: 10.0.0.24 Optimal Response: SIGKILL proc 6992 o1 o2 o3 o1 1. Snort NIDS sensor output . . . Nov 21 20:57:16 [10.0.0.6] snort: [1:613:7] SCAN myscan [Classification: attempted-recon] [Priority: 2]: {TCP} 212.175.64.248-> 10.0.0.24 . . . SIGKILL Server Compromise Model: Generic Attack Scenario t0 t1 t2 t3 t4 Observations Response IPAM 27-7-2005 Cybenko
Experimental Results: No Tracking Tracking Successful Requests System Memory Consumed 210,000 requests serviced 380,000 requests serviced IPAM 27-7-2005 Cybenko
Theory • Process Query System frameworks offer a principled approach that enables • understanding how distinguishable models (attack and failure) are • developing a notion of processes that are “trackable,” given models and sensing infrastructure (ie a “sampling theory”) IPAM 27-7-2005 Cybenko
Hypothesis Growth A “hypothesis” is a consistent assignment of events to processes and/or states(ie, each event assigned to only one process instance). Given a set of “hypotheses” for an event stream of length k-1, update the hypotheses to length k to explain the new event. NP-Complete in general. Need to prune the pool of hypotheses, keeping the most suitable. time Individual path is a “track” – ie one process instance Consistent tracks form a “hypothesis” IPAM 27-7-2005 Cybenko
Models and Hypothesis Growth “Weak” model FSM with “emission” vectors Emission for state i = 0/1 vector of sensor reports eg obs(i) = ( 0 , 1 , 1 , 0 , 0 , 1 , 1 ) Observation vector at time t collected by sensors: eg sensors(t) = ( 0 , 1 , 1 , 1 , 1 , 1 , 0 ) Possible states at time t are determined by: P = { i | Hamming_distance( obs(i) , sensors(t)) <= HD } R = { i | j possible at time t - 1 and i is reachable from j } P R is the set of possible states at time t Number of hypotheses at time t recursively computed as above. U Theorem: For a fixed value of HD, the worst-case number of hypotheses at time t is either polynomial or exponential in t. (Crespi, Cybenko, Jiang 2004) IPAM 27-7-2005 Cybenko
Ouch!!! Nice Demo!! Longer tracking time Longer tracking time More noise (worse model) More noise (worse model) IPAM 27-7-2005 Cybenko
Poor Models and Sensor Coverage Excellent Models and Sensor Coverage Acceptable Models and Sensor Coverage Longer tracking time More noise (worse model) IPAM 27-7-2005 Cybenko
Basic Idea Behind the Proof N states time t time t+1 time t+2 time k If there are never two distinct paths from any node to itself over any period of observation, there is a simple injective mapping (ie. unique labeling) of the paths into {0, 1, ... , k} x {0, 1, ... , k} x {0, 1, ... , k} ... x {0, 1, ... , k} 2N times. So the number of paths is < (k+1)2N. The label for each path is the time it first occupies a state and the time it last occupies that state. IPAM 27-7-2005 Cybenko
Basic Idea Behind the Proof N states time t time t+1 time t+2 time k Process dynamics (ie what is reachable from each state in a time step) + observations + noise threshold determines a “trellis”. If there are two distinct paths from one node to itself over some period of time, the number of distinct paths grows exponentially by repeating the construct. IPAM 27-7-2005 Cybenko
Relationship to Spectral Radius • Classical spectral radius: r(A) = |lmax| • Joint spectral radius of a set, S = {A1, ... An}, of matrices: r(S) = lim max r(P Bk)1/ t • Hypothesis growth is polynomial iff r(S) <= 1 • Deciding whether r(S) <= 1 for real or rational matrices is impossible (Tsitsiklis and Blondel, 2000) • If S consists of 0-1 matrices, decidable but NP hard. t Bk e S 0 < k < t+1