1 / 67

Cognitive Support for Intelligent Survivability Management

Cognitive Support for Intelligent Survivability Management. Dec 18, 2007. Outline. Project Summary and Progress Report Goals/Objectives Changes Current status Technical Details of Ongoing Tasks Event Interpretation Response Selection Rapid Response (ILC) Learning Augmentation

floyd
Download Presentation

Cognitive Support for Intelligent Survivability Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cognitive Support for Intelligent Survivability Management Dec 18, 2007

  2. Outline • Project Summary and Progress Report • Goals/Objectives • Changes • Current status • Technical Details of Ongoing Tasks • Event Interpretation • Response Selection • Rapid Response (ILC) • Learning Augmentation • Simulated test bed (simulator) • Next steps • Development and Integration • Red team evaluation OLC

  3. Project Summary & Progress Report Partha Pal

  4. Background • Outcome of DARPA OASIS Dem/Val program • Survivability architecture • Protection, detection and reaction (defense mechanisms) • Synergistic organization of overlapping defense & functionality • Demonstrated in the context of an AFRL exemplar (JBI) • With knowledge about the architecture, human defenders can be highly effective • Even against a sophisticated adversary with significant inside access and privilege (learning exercise runs) Survivability architecture provides the dials and knobs, but an intelligent control loop–in the form of human experts—was needed for managing them Managing: Making effective decisions What was this knowledge? How did the human defenders use it? Can the intelligent control loop be automated?

  5. Incentives Narrowing of the qualitative gap in “automated” cyber-defense decision making Self managed survivability architecture Self-regenerative systems Next generation of adaptive system technology From hard-coded adaptation rules to cognitive rules to evolutionary Obstacles (at various levels) Concept: insight (sort of)– but no formalization Implementation: Architecture, Tool capability & choice Evaluation: How to create a easonably complex context & wide range of incidents Real system? Evaluation: how to quantify and validate Usefulness and effectiveness Measuring technological advancement Incentives and Obstacles

  6. CSISM Objectives • Design and implement an automated cyber-defense decision making mechanism • Expert level proficiency • Drive a typical defense enabled system • Effective, reusable, easy to port and retarget • Evaluate it in a wider context and scope • Nature and type of events and observations • Size and complexity of the system • Readiness for a real system context • Understanding the residual issues & challenges

  7. Main Problem • Making sense of low-level information (alerts, observations) to drive low-level defense-mechanisms (block, isolate etc.) such that higher-level objectives (survive, continue to operate) are achieved • Doing it as well as human experts • And also as well as in other disciplines • Additional difficulties • Rapid and real time decision-making and response • Uncertainty due to incomplete and imperfect information • Widely varying operating conditions (no alerts to 100s of alerts per second) • New symptoms and changes in adversary’s strategy

  8. For Example…. • Consider a missing protocol message alert • Observable: a system specific alert • A accuse B of omission • Interpretation • A is not dead (he reported it) • Is A lying? (corrupt) • B is dead • B is not dead, just behaving badly (corrupt) • A and B cannot communicate • Refinement (depending on what else we know about the system, the attacker objective..) • Other communications between A and B • A svc is dead if the host is dead.. • OS platform and likelihood of multi-platform exploits.. • Response selection • Now or later? • Many options • dead(svc) => restart (svc) | restart (host of svc) • cannot-communicate (host1, host2)=> ping | retry operation.. • corrupt(host)=> reboot(host)| block(host)| quarantine(host).. • Now consider a large number of hosts, sequence of alerts, various adversary objectives, and trying to keep the mission going related related

  9. Approach Learn Stream of events and observations Interpret React React React Hypotheses System Respond Actions Modify parameter or policy • Multiple levels of reasoning • Varying spatial and temporal scope • Different techniques • The main control loop is partitioned into 2 main parts: event interpretation & response selection

  10. Concrete Goals • A working prototype integrating • Policy based reactive (cyber-defense) response • “Cognitive” control loop for system-wide (cyber-defense) event interpretation and response • Learning augmentation to modify defense parameters and policies • Achieve expert level proficiency • In making appropriate cyber-defense decision • Evaluation by • “ground truth”: ODV operator responses to symptoms caused by red team • Program metrics

  11. Current State • Accomplished quite a bit in 1 year • KR and reasoning framework for handling cyber-defense events well developed • Proof of concept capability demonstrated for various components at multiple levels • OLC, ILC, Learner and Simulator • E.g., Prover9, Soar (various iterations) • Began integration and tackling incidental issues • Evaluation ongoing (internal + external) • Slightly behind in terms of response implementation and integration • Various reasons (inherent complexity, and the fact that it is very hard to debug the reasoning mechanism) • Longer term issue: Confidence in such cognitive engine? Is a system-wide scope really tenable? Is it possible to build better debugging support? • Taken mitigating steps (see next)

  12. Significant Changes Alerts and observations Recall the linear flow using various types of knowledge? That was what we were planning in June. This evolved, and the actual flow looks like the following: Knowledge about attacker goal (bin3) Knowledge about bad behavior (bin 1) , protocols and scenarios (bin 4) Knowledge about info flow (bin 2) and protocols and scenarios (bin 4) Process Accusation & Evidence Translation & Map Down Prune (Coherence and proof) Garbage collect Build Refine accusation evidence Constraint network

  13. Significant Changes (contd) • Response mechanism • Do in Jess/Java instead of Soar • Issues • Get the state accessible to Jess/Java • Viewers • Dual purpose– usability and debugging • Was: Rule driven– write a Soar rule to produce what to display • Now: get the state from Soar and process

  14. Schedule • Midterm release (Aug 2007) [done] • Red team visit (Early 2008) • Next release (Feb 2008) • Code freeze (April 2008) • Red team exercises (May/June 2008)

  15. Event Interpretation and Response (OLC) Franklin Webber

  16. OLC Overall Goals • Interpret alerts and observations • (sometimes lack of observations triggers alerts) • Find appropriate response • (sometimes it may decide that no response is necessary) • Housekeep • Keep history • Clean up

  17. OLC Components Event Interpretation Response Selection Summary accusations and evidence responses Learning History

  18. Event Interpretation Main Objectives: • Essential Event Interpretation • Interpreting events in terms of hypotheses and models • Uses deduction and coherence to decide which hypotheses are candidates for response • Incidental Undertakings • Protecting the interpretation mechanisms from attack: flooding and resource consumption • Current status and plans • Note that items with a * are in progress Event interpretation creates candidate hypotheses which can be responded to.

  19. Event Interpretation Decision Flow Response Selection Event Interpretation generator Summary hypotheses claims theorem proving dilemmas coherence learning history

  20. Knowledge Representation • Turn very specific knowledge into an intermediate form amenable to reasoning • i.e. “Q2SM sent a malformed Spread Message” -> “Q2SM is Corrupt” • Create a graph of inputs and intermediate states to enable reasoning about the whole system • Accusations and Evidence • Hypotheses • Constraints between A and B • Use the graph to enable deduction via proof and to perform a coherence search Specific system inputs are translated into a reusable intermediate form which is used for reasoning.

  21. Preparing to Reason • Observations and Alerts are transformed to Accusations and Evidence • Currently translation is done in Soar but may move outside to keep the translation and reasoning separate* Alerts: notification of an anomalous event Accusations: generic alert • Observation: notification of an expected event Evidence: generic observation Alerts and Observations are turned into Accusations and Evidence that can be reasoned about.

  22. Alerts and Accusations • By using accusations the universe of bad behavior used in reasoning is limited, with limited loss of fidelity. • The five accusations below are representative of attacks in the system • Value: accused sent malformed data • Policy: accused violated a security policy • Timing: accused send well-formed data at the wrong time • Omission: expected data was never received from accused • Flood: accused is sending much more data than expected CSISM uses 5 types of accusations to reason about a potentially infinite number of bad actions that could be reported.

  23. Evidence* • While accusations show unexpected behavior evidence is used for expected behavior • Evidence limits the universe of expected behavior used in reasoning, with limited loss of fidelity. • Alive: The subject is alive • Timely: The subject participated in a timely exchange of information • Specific “historical” data about interactions is used by the OLC, just not in event interpretation CSISM uses two types of evidence to represent the occurrence of expected actions for event interpretation.

  24. Hypotheses • When an accusation is created a set of hypotheses are proposed that explain the accusation • For example a value accusation means either the accuser or accused is corrupt and that the accuser is not dead. • The following hypotheses (both positive and negative) can be proposed • Dead: Subject is dead; fail-stop failure • Corrupt: Subject is corrupt • Communication-Broken: Subject has lost connectivity • Flooded: Subject is starved of critical resources • OR: a meta-hypothesis that either of a number of related hypotheses are true Accusations lead to hypotheses about the cause of the accusation.

  25. Reasoning Structure • Hypotheses, Accusations, and Evidence are connected using constraints • The resulting graph is used for • Coherence search • Proving system facts host dead -100 accusation 100 100 OR 100 100 -400 -400 host dead comm broken -400 100 host corrupt A graph is created to enable reasoning about hypotheses.

  26. Proofs about the System • The OLC needs to derive as much certain information as it can, but it needs to do this very quickly. The OLC does model-theoretic reasoning to find hypotheses that are theorems (i.e., always true) or necessarily false • For example, it can assume the attacker has a single platform exploit, and consider each platform in turn, finding which hypotheses are true or false in all cases. Then it can assume the attacker has exploits for two platforms and repeat the process • A hypothesis can be proven true or proven false or have an unknown proof status • Claims: Hypotheses that are proven true “Claims” are definite candidates for response

  27. Coherence • Coherence partitions the system into clusters that make sense together • For example, for a single accusation either the accuser or the accused may be corrupt but these hypotheses will cluster apart • Responses can be made on the basis of the partition, or partition membership when a proof is not available* In the absence of provable information coherence may enable actions to be taken.

  28. Protection and Cleanup • Without oversight resources can be overwhelmed • Due to flooding: we rate limit incoming messages * • Excessive information accumulation • We take two approaches to mitigate excessive information accumulation* • Removing outdated information by making it inactive • If some remedial action has cleared up a past problem • If new information makes previous information outdated or redundant • If old information contradicts new information • If an inconsistency occurs we remove low-confidence information until the inconsistency is removed • When resources are very constrained more drastic measures are taken • Hypotheses that have not been acted upon for some time will be removed, along with related accusations Resources are reclaimed and managed to prevent uncontrolled data loss or corruption.

  29. Current Status and Future Plans • Knowledge Representation • Accusation translation is implemented • May need to change to better align with the evidences • Evidence implementation in process • Will leverage the code and structure for accusation generation • Use of coherence partition in response selection--ongoing • Protection and Cleanup are being implemented • Flood control development is ongoing • The active/inactive distinction is designed and ready to implement • Drastic hypothesis removal is still being designed Much work has been accomplished, work still remains.

  30. Response Selection • Decide promptly how to react to an attack • Block the attack in most situations • Make “gaming” the system difficult • Reaction based on high-confidence event interpretation • History of responses is taken into account when selecting next response • Not necessarily deterministic Main Objectives:

  31. Response Selection Decision Flow Event Interpretation Response Selection Summary propose claims dilemmas potentially useful responses responses prune learning history

  32. Response Terminology • A response is an abstract OLC action, described generically • Example: quarantine(X), where X could be a host, file, process, memory segment, network segment etc. • A response will be carried out in a sequence of response steps • Steps for quarantine(X) && isHost(X) include • Reconfigure process protection domains on X • Reconfigure firewall local to X • Reconfigure firewalls remote to X • Steps for quarantine(X) && isFile(X) include • Mark file non-exectuable • Take specimen then delete • A command is the input to actuators that implement a single response step • Use “/sbin/iptables” to reconfigure software firewalls • Use ADF Policy Server commands to reconfigure ADF cards • Use tripwire commands to scan file systems or Resp1 Resp2 and specialization Step2 Step1 Step3 Cmd1 Cmd1 Cmd1

  33. Kinds of Response • Refresh – e.g., start from checkpoint • Reset – e.g., start from scratch • Isolate -- permanent • Quarantine/unquarantine -- temporary • Downgrade/upgrade – services and resources • Ping – check liveness • Move – migrate component The DPASA design used all of these except ‘move’. The OLC design has similar emphasis.

  34. Response Selection Phases • Phase I: propose • Set of claims (hypotheses that are likely true) implies set of possibly useful responses • Phase II: prune • Discard lower priority • Discard based on history • Discard based on lookahead • Choose between incompatible alternatives • Choose unpredictably if possible • Learning algorithm will tune Phase II parameters

  35. Example • Event interpretation claims “Q1PSQ is corrupt” • Relevant knowledge: • PSQ is not checkpointable • Propose: • (A) Reset Q1PSQ, i.e., reboot, or • (B) Quarantine Q1PSQ using firewall, or • (C) Isolate Quad 1 • Prune: • Reboot has already been tried, so discard (A) • Q1PSQ is not critical, so no need to discard (B) • Prefer (B) to (C) because more easily reversible, but override if too many previous anomalies in Quad 1 • Learning • Modify the definition of “too many” used when pruning (B)

  36. Using Lookahead for Pruning • Event interpretation provides an intelligent guess about the attacker’s capability • OLC rules encode knowledge about attacker’s possible goals • Lookahead estimates the potential future state, given assumptions about capability, goals, and response selection • If response X has better potential future than Y, favor X

  37. Status • Design • Rules for proposing responses encoded in first-order logic • Corresponding pruning rules described in English • Implementation • Mounting responses for given hypotheses prototyped in Soar • Actual response logic is being moved outside Soar • Risk mitigation step • Some specific to a particular Learning Exercise run • Much less complete than Event Interpretation, but we are getting there…

  38. Fast Containment Response and Policies Michael Atighetchi (On behalf of Steve Harp)

  39. Inner Loop Controller (ILC) Objectives • Policy Driven: Implement policies and tactics from OLC on a single host. • Autonomous: high speed responsecan work when disconnected from the OLC by an attack or failure • Flexible: Policies can be updated at any time • Adaptive: Use learned characteristics of host and monitored services to tune the policy. • Low impact on mission: able to back out of defensivedecisions when warranted Goal: Attempt to contain and correct problems at the earliest stage possible

  40. Specialized... JBI client registration failure ADF Agent failure modes SELinux enforcement status Current ILC Policies Generic... • Excessive resource consumption • Memory, CPU load • Tampering with component files • Unexpected service termination • Unexpected network connections

  41. Policy Example ‏ File Integrity Check Fails, F1 Local File F1: status=corrupted File Tampering Problem Block Network Suspend Affected Services File Cleanup Subgoal Quarantine Restore File Remove Block/Suspend Reasons

  42. ILC : Recent Work • ILC Configuration • Made this driven by (ontology) model of DPASA • No need for configuration files for each host • ILC runs on each simulated host, automatically loads required policy components • Integration • Hitherto ILC has been (unit) tested as stand-alone • Initial integration with jess DPASA simulator complete, broadening support for multiple policies • Adjustments to API to match simulator

  43. ILC : Current Status • ILC policy to handle various applications • Model driven configuration • Metrics • Rules: 94; Functions: 134; Frames: 24; Globals: 20 • Base reaction time (in unit test): ~ 4 ms. • (Measuring the inference part only.) • Target reaction time is: < 100 ms.

  44. ILC : Ongoing Work • Complete integration with the rest of CSISM framework: • DPASA Simulator • ILC---OLC Interaction • Designed; integration: TBD • Testing • Verify correct reactions in simulator to various simulated attacks • Measure reaction times

  45. Learning Augmentation Michael Atighetchi (On behalf of Karen Haigh)

  46. Learning Augmentation: Motivation • Why learning? • Extremely difficult to capture all the complexities of the system, particularly interactions among activities • The system is dynamic (static configuration gets out of date) • Core Challenge: Offline Training + Good data + Complex environment - Dynamic system Online Training - Unknown data + Complex environment + Dynamic system Human + Good data - Complex environment - Dynamic system Very hard for adversary to “train” the learner!!! CSISM’s Experimental Sandbox + Good data (self-labeled) + Complex environment + Dynamic system Sandbox approach successfully tried in SRS phase 1 Adaptation is the key to survival

  47. Development Plan for Learning in CSISM • Responses under normal conditions (Calibration) • Important first step because it learns how to respond to normal conditions • Showed at June PI meeting • Situation-dependent responses under attack conditions • Multi-stage attacks • Since June

  48. Calibration Results for all Registration times These two “shoulder” points indicate upper and lower limits. Beta=0.0005 As more observations are collected, the estimates become more confident of the range of expected values (i.e. tighter estimates to observations) June07 PI meeting

  49. Multistage Attacks • Multistage attacks involve a sequence of actions that span multiple hosts and take multiple steps to succeed. • A sequence of actions with causal relationships. • An action A must occur set up the initial conditions for action B. Action B would have no effect without previously executing action A. • Challenge: identify which observations indicate the necessary and sufficient elements of an attack (credit assignment). • Incidental observations that are either • side effects of normal operations, or • chaff explicitly added by an attacker to divert the defender. • Concealment (e.g. to remove evidence) • Probabilistic actions (e.g. to improve probability of success) Not yet

  50. Architectural Schema for Learning ofAttack Theories and Situation-Dependent Responses Observations Actions X X “Sandbox” A C B C A B C A B D Attack Theory Experimenter Defense Measures Experimenter 1 2 3 4 5 6 X ? A B C D A B C CSISM Sensors (ILC, IDS) 1 2 3 4 5 6 Observations ending in failure of protected system. Only some are essential. Viable Defense Strategies and Detection Rules Viable Attack Theories Failure

More Related