Cognitive Support for Intelligent Survivability Management (CSISM)

Cognitive Support for Intelligent Survivability Management (CSISM) Partha Pal, Franklin Webber, Paul Benjamin, Steve Harp Washington DC Dec 11 2006 NSWC, Dahlgren Division

Outline • Background (~10) • Self-regenerative Systems • CSISM Starting Point and Motivation • CSISM Team Configuration • CSISM Approach • Overview (~15) • Knowledge Representation (~15) • Outer Layer Processing (Soar) (~15) • Inner Layer Processing (Policy based) (5-10) • Learning Augmentation (10-15) • Expected Results (5-10) • Evaluation Approach • Metrics • Programmatic Aspects (2) • Schedule and Milestones • Conclusion CSISM Team

Self-Regenerative Systems Retain level of service and improve defense: Static and dynamic use of artificial diversity; Use of wide area distribution; Automated interpretation of observation and response selection, augmented by learning from past experience. Level of service w/o attack Level of service w/o attack Regenerative Regenerative Level of Level of service service Survivable Survivable rd rd (3 (3 Gen.) Gen.) Graceful degradation: Adaptive response limited to static use of diversity and policy; Event-interpretation and response selection by human experts. undefended undefended Start of focused attack Start of focused attack time time Our Focus: Automated interpretation of observation and response selection.. CSISM Team

Survivability Event-Interpretation and Decision-Making Landscape CSISM Team

Example 1: DPASA Learning Exercise Run 6 Event Interpretation/Decision-making in the blue room: +25: Invite Combat Ops from Quad 2: no client GUI, HB in Quad 3 only, no HB in Quad 2, Quads 1 and 4 saw HB saw for a few seconds. Q3DC reports a encryption error. (Inviting quad is suspected of giving bad references.) +27: Combat Ops log: indeed it got only Q3’s references right. Q2AP log: AP of inviting quad (Q2AP) did not hear back from Combat Ops and is running unknown process and its enforcement is turned off. (Q2AP is definitely participating in the attack.) +36: (Action: Block Q2AP's client side NIC, and invite Combat Ops from Quad 4): only Quad 3 reports it as attached (So, Q4SM is also corrupt - only Quad 3’s reference is correct). However, no Combat Ops client GUI yet. +39: (Action: Block Q2AP's 4 NIC)– the suspect cannot interact with core hosts now. +42: Observe that Q2AP was rebooted (by the red team). Invite Combops from Q3SM: this also fails. (Q3SM also has the bad reference then.) +51: Invite Combops from Q1SM and watch logs. Fails again. +62: Current hypothesis: All SMs got bad reference. But from whom? Q2AP, the host already identified to be compromised can only talk to Q2SM, and that too via Q2DC. (Action: to block Q2SM's NIC, and refresh AP references from others.) +64: LocalAPRebooted pressed for Q1, Q3, and Q4, re-invite Combat Ops—succeeds. +72~104: Attempt to accelerate JEES and EDC’s publication by stopping and restarting them +148: End (Scenario Completed). Real Explanation: The bad references send from Q2SM was accepted by other SMs and sent to clients. The bad references affected heartbeats from the registry but not the LCs. Because IP addresses are used to look up signing keys, PSQ operation was also impacted. JEES and EDC bogged down because the system was running with 3 quads (Q2AP was blocked), they got some bad AP references, and the PSQ has an exponential back off. The attack code exploited a known flaw in SM implementation, a fix for which was developed, but not permitted in this exercise. Nevertheless, this run survived! CSISM Team

Example 2: DPASA Learning Exercise Run 19 Events in the blue room: +3: SM suggests reboot Q2DC: veracity trying to go to vendor home page-ignored. +8: PSQ halted, Q2AP cannot reach Q2PSQ– try to look at the log, seem to fail. +9: LC on Q3PSQ alerts of high load. +10: All SMs show failure to poll MySQLs +42: Corrupt PSQ.jar file detected on Q1PSQ. (Assume this was the attack source) (Action: Q1PSQ’s NIC is blocked). Q2PSQ unreachable. (High load) (Action: reboot). All 4 quads still show failing to poll their MySQLs. +44: (Action: Fresh package to Q1PSQ, restart it from a clean state). +53: MySQL and PSQ processes restarted on all four PSQ servers- things progress a bit—WxHaz publishes, CAF, MAF queries–and then again we are in the same state– PSQs halt. (Q1PSQ [previous suspect] is blocked, one of these three clients must be the trigger. WxHaz is the prime suspect implicated by a large IO alert.) +54: (Action: restart all PSQs.) Q3 and Q4 SMs got killed accidentally! Q3 and Q4 wont have clients’ registration information, which Q1 and Q2 would have and are needed for PSQ. On the other hand, Q1 and Q2 PSQ servers do not have the published IOs. Some participation from all quads are needed to make progress. (In a very bad situation! Need to restart entire quads.) (Action: Restart quads 3 and 4). +73: WxHaz client restarted as a test, it registers successfully to quads 3 and 4. Polling of MySQL still failing. Q2PSQ is known to be slow, trying to find what is wrong with Q1PSQ which was cleansed before, and Q3PSQ and Q4PSQ hosts +80: (No harm in restarting Q4PSQ) (Action: Q1PSQ and Q4PSQ hosts restarted). +89: (Could it be waiting for a Quorum?) (Action: Set Quorum size to 2, remove Q1PSQ from PSQ group. Rehydrate EDC). EDC starts processing the cached TNL. +93: All clients successfully re-registered. +94: Quad 1 isolated using the QIS, after blocking Q1AP client side NIC +145: MySql on Q3PSQ and Q4PSQ still locked. (Action: Restart MySQL). Fails. (Action: Block external ADF NIC on Q4AP). +165: End (Blue conceded-- Scenario is not completed). Real Explanation:MySql servers were the target of the attack. Although attack code was found in Q1PSQ– it was likely a decoy/left over. The more likely source was one of WxHaz, CAF or MAF (shown in yellow). All PSQ hosts were (shown in red) affected: MySql servers were not responding to polls, in some cases could not even be stopped and restarted, halting PSQ operation. Java and specifically JDBC provided good built-in protection against buffer overflow attacks. Otherwise the red team would have been able to run more damaging attacks. Operator error dug the system into a deeper hole. Operators focus did not consider possibilities of WxHaz etc as the attack source CSISM Team

CSISM Starting Point • Note the issues: • Expert operators (costly, limits the attraction of emerging survivability technologies and approaches) • Subject to human limitations (speed, coverage, alertness of the operator on that day etc) BAA Challenges:Need to detect 50% of all attacks and generate effective response automatically -- DPASA detected and responded most of the cases, but needed human expertise-- SRS 1 efforts achieved 10% efficiency under narrowly defined conditionsRespond within 250 ms but comprehensive exploration takes timeFalse positives/incomplete observation leads to bad decisions-- Observations and models will always be imperfect-- DPASA and SRS 1 efforts all had false positivesRepetition and/or attack variants diminish the value of defense CSISM Team

CSISM Starting Point Contd. • Better survivability management self-healing • Can we automate interpretation and decision making shown in DPASA • without significantly sacrificing the level and quality • Building up resistance against attack repetition and variants cognitive immunity • Can we use learning techniques to improve the decision-making and interpretation capability over time • What will it take to demonstrate this in a system scale? • but, without having to build the entire defense-enabled system like DPASA. • Ultimately, smarter use of available information about the system  smarter, better managed system CSISM Team

CSISM Team Configuration • BBN • Survivability management information and architecture • Knowledge representation and partitioning • Learning • Experimental platform and integration • Adventium • Policy based fast containment response • Learning • Pace University • Cognitive Engine CSISM Team

CSISM Approach: Overview • Interpret observed events and decide appropriate action • The space in which events are interpreted • Knowledge about the system; symptoms and their probable causes; response capabilities, their applicability and impacts etc • Processing of reported events like the intelligent human operators did • Interpret: consider multiple possibilities in parallel • Decide: consider a number of action and potential counter action before choosing • Prune and prioritize: • Requirements • Good KR, effective partitioning • Fast processing engine • A fast containment response to buy time for the cognitive processing • Build up resistance against attack repetition and variants • Detection: Experiments to generalize and test attacks extended over multiple hosts and steps • Response: Learn situation dependant utility of response • Requirements • Supervisory access to the reported events, interpretation and decisions made based on them • Defend the inner control loop and the outer cognitive processing engine CSISM Team

CSISM Outer Cognitive Processing Parallel exploration of multiple sets of facts and rules that are consistent with observations, including those that are not definitive indicators of attack effects Risk-benefit analyses from both defense and adversary point of view to decide whether and how to respondUse of encoded generic cyber-defense insight to prune and guide exploration Addition of an active learning component to generalize attack characteristics and refine response utility via experimentation CSISM Policy-based Inner Control Loop CSISM Team

CSISM Inner Control Loop Despite pruning and guiding based on expert insights, the deep deliberation can be slow Therefore, a fast-acting policy-based local control loop is used to augment the CSISM cognitive outer control loop More about knowledge representation, outer cognitive processing, ILC, and Learning in next sections… CSISM Policy-based Inner Control Loop CSISM Team

Key Innovations Use of rules, facts and operators about the system as the “internal world model” of the cognitive mechanism for cyber defense event interpretation and response -- A Useful “knowledge model” organized in partitioned classes is easier to build than executable model or specification of similar utility -- Parallel exploration of multiple possible hypotheses in the knowledge space enables better interpretation of uncertain indicators, imperfect sensor reports and precursor events-- Expert insight also encoded as additional knowledge to help eliminate unlikely hypotheses Key Innovations Response selection based on predictive risk-payoff analyses from both defense and adversary points of view-- Survivability-management decision-making is 2 person zero-sum game between the defense and the adversary-- Look-ahead and payoff analysis will provide a better defensive response than analyses that considers only the defense’s utility Use of experiment-based learning to derive generalized characteristics of multi-step attacks-- Experiment with hypotheses that consider time and space along with other axes of vulnerability—no prior art in this areaLearn and refine situation-dependent utility of responses by continuous observation-- Expected to have a wider impact than observation-based learning done in the past CSISM Team

Knowledge Representation

Knowledge Representation • Extract knowledge from Red Team encounters; attempt to generalize • Goals • Separate generic, reusable, knowledge from system-specific • Include both cyber-defense and cyber-attack, to permit look-ahead • Encode enough detail to estimate relative goodness of alternatives in most situations; expect steep “learning curve” for CSISM CSISM Team

Kinds of Knowledge • Symptomatic: possible explanations for a given anomalous event • Both generic and system-specific • Relational: constraints that reinforce or eliminate possible explanations • Mostly system-specific • Teleological: possible attacker goals and actions that may be used to accomplish the goals • Mostly generic • Reactive: possible defensive countermeasures for a given attack • Both generic and system-specific CSISM Team

Symptomatic Knowledge • Example: component A claims component B is unresponsive • B might have crashed or not yet started up; • A might be corrupt; • No communication path between A and B may exist; • Communication might be blocked by flooding; • A might have bad data about B’s location. • Example: A claims B sent malformed data • Either A or B may be corrupt; • Data may have been corrupted in transit. CSISM Team

Relational Knowledge • Examples: • The system comprises services P, Q, and R. • Service R is implemented by server A. • Server A runs on host H. • Host H runs Linux. • A communication path exists between A and B, via C and D, but no path exists between A and X. • Communication between P and Q is blocked by security policy in switch S. • Data sent by A to B is digitally signed using a private key on a smart card. CSISM Team

Teleological Knowledge • Example goal: corrupt the system • Subgoal: corrupt one of the system’s services • Example goal: corrupt a Byzantine fault- tolerant service with N servers • Subgoal: corrupt more than N/3 servers • Subgoal: corrupt inter-server communication • Example goal: corrupt a server • Subgoal: exploit OS vulnerability • Example goal: corrupt communication • Subgoal: establish man-in-the-middle • Example goal: block communication • Subgoal: flood a subset of links CSISM Team

Reactive Knowledge • Example attack: corrupt component A • Countermeasure: restart A • Countermeasure: restart A from checkpoint • Countermeasure: reboot A’s host • Countermeasure: quarantine A’s host • Example attack: flooding Lan L from host H • Countermeasure: restart H • Countermeasure: quarantine H • Countermeasure: quarantine L CSISM Team

Parameterization • Each of the 4 kinds of knowledge will include various adjustable parameters • Relative likelihoods • Confidence levels • Correlations • Parameter adjustments will be part of learning process • Potential weakness: attacker may exploit estimates of “likelihood” and “confidence” CSISM Team

Estimates of Complexity • Symptomatic knowledge: ~200 Soar rules • Relational knowledge: ~1000 Soar rules • Teleological knowledge: ~25 Soar rules • Reactive knowledge: ~40 Soar rules CSISM Team

Cognitive Outer Layer Processing

Building a Cognitive Controller • Using the Soar cognitive architecture • Problem Spaces contain types of knowledge • Subgoaling searches alternatives • Chunking speeds up the performance • Works with large rulesets in real time • Leveraging experience in cybersecurity (VMSoar) and robotics (ADAPT) at Pace University CSISM Team

Evaluating Hypotheses with Soar Generating possible future paths of network behavior Events Current Knowledge Knowledge about the system structure and current state Hypotheses Inner Loop Contr Actions CSISM Team

Evaluating Hypotheses with Soar • Soar will be reasoning about hypotheses • Consistent with what is known and encoded about the system • Considering adversary’s potential counter action in response to defense’s action before choosing an action • The knowledge is very incomplete and uncertain • Soar’s goal is to direct the system to gather necessary information until enough is known to take action CSISM Team

Challenges • Encoding the various types of reasoning: • Teleological • Reactive • Relational • Symptomatic • Ensuring anytime capability • Integration with the inner control loop • Building the network model • Evaluating the effectiveness of the knowledge • Designing the human/Soar interaction • Defending the Soar-based engine via redundancy CSISM Team

Policy-Based Inner Loop Control

Policy-Based Inner Loop Control Inner Loop Control • Implement policies and tactics from OLC on a single host • provide feedback to OLC • Act autonomously • high speed execution of a reactive defense plan • work when disconnectedfrom OLC • Low impact on mission • able to back out of defensivedecisions when warranted CSISM Team

Inner Loop Application Defense App 1 Individual Applications are governed by a Security Policy Layer which may include multiple mechanisms such as: • SELinux, CSA, systrace, • Java application security policy, • Local (software) firewall Policy Db Application policies: static and dynamic communicated by Outer Loop Control CSISM Team

ILC – Application Interaction Signals/Commands ILC Controller Checkpoint/Restore App1 Sensed Conditions Factory Sense application conditions from: • instrumented application • operating system (process resource status) • policy enforcement and audit mechanisms Checkpoint Db ILC can force application recovery: • signal application to checkpoint state • restore application state from saved • destroy and rebuild from saved/initial state CSISM Team

ILC – OLC Interaction Outer Loop Control ILC OLC communciates policies comprising reactive plans to the ILCs, and receives status information about the application. Controller Factory ILC Policy Db 1. If a new file not specified in the policy appears then delete that file. 2. If a new process not specified in the policy is detectedthen isolate that process. CSISM Team

ILC Self Defense • Local autonomous recovery from unexpected fatal errors and attacks • Redundant ILCs • “warm spares” • Other defenses • Hardware watchdog • Hardened ILC databases ILC ILC warm spares Controller Factory watchdog timer CSISM Team

Learning Augmentation

Learning Augmentation: Motivation • Adaptation is key to survival– we made some headway into using adaptation in defense for things like • graceful degradation • recovery • Time to explore how to adapt for • improving the defensive posture • better knowledge (about the attacks or attacker), better policies • improving how the system responds to symptoms • better connection between response actions and their triggers • Learning techniques are enablers for the next level of enhancements in adaptive defense • (Long term goal) “Cognitive immunity”-- the ability to be safe/better defended against future attacks (repeats/variants) based on past experience CSISM Team

Approaches to Learning Defenses Detecting Attacks Responding to attacks Passive Observation • Relatively well explored.Example: anomaly detection • Work remains to be done to detect attacks extended over multiple hosts and steps. • Not well explored. • CSISM innovation: situation-dependent utility of responses. Experiment (e.g. in sandbox / tester / laboratory) • Examples: Cortex, HACQIT. • CSISM innovation: Use experiments to generalize and test multi-step attacks. • Not well explored. • Probably difficult to achieve outside narrowly defined problems. CSISM Team

Multistep Attacks: Example • Hypothetical Example Attack Observables: • Send malformed request to service S on host H that renders S temporarily unavailable • Exploit a faulty script in web service W on H to install attack tools in /tmp directory on H • Client C attempts to use S but fails... • Advertise phony service S' on host H' • Client C attempts to use S' : “success” • S' serves bogus data to client C CSISM Team

The Multistep Learning Problem • Goal: Detect and generalize multi-step attacks • May extend across hosts and services • Time scale may be chosen by the attacker • Immunize against broad classes of related multi-step attacks • Challenges: • Which observations are necessary & sufficient for attack success?Eliminate side effects and chaff added by attacker to divert attention • What are the most reliable observations? • What are the parameter boundaries? • Approach: Experimentation in realistic off-line sandbox • Observe, Hypothesize, Generalize, Experiment & Validate CSISM Team

Learning Multistep Attacks: Approach • Observables • Incoming Traffic & sensor reports • Techniques • Form Candidate set of sequences • Frequent Patterns Analysis & Anomaly Detection • Identify suspicious elements within patterns • Axes of vulnerability and modeling of normal activity • Identify key elements within patterns • Design experiments to: update models of most important axes, and which factors interact, order dependencies • Potential output • New Soar rules or modification of existing ones CSISM Team

Situation Dependent Action Problem • Example Attack / Response Sequence • Corrupt service-S libraries discovered on host H1 (primary server) • Defender Attempts to restart H1 • Corrupt service-S libraries discovered on host H2 (backup server) • Defender Attempts to restart H2 • H1 fails to restart • H2 fails to restart • All Clients stalled :-( We took some actions, but definitively lost this round—next time we must try something different... CSISM Team

Situation-dependent Action Utilities • Learn tradeoffs among potential responses • Complexity and Time-varying domain  appropriateness of responses changes (e.g. Don't take risks with the backup when the primary is down) • Actions have different costs & benefits under different conditions: • Learn Utility(Action | Conditions) = f(Cost, Benefit) • Conditions include descriptions of users, attack elements, system performance, etc • Cost includes effort to mount response and impact on availability • Measuring the effect of responses is hard: • Complex domain  rarely identical situations  non-deterministic actions/effects (e.g. “restart” sometimes fails) • Use an exploration mode to observe similar conditions CSISM Team

Evaluation Approach

Evaluation Approach • Goal: develop and demonstrate “system scale” technology • Ease of construction (of test/demo harness) • Flexible to extension and scaling (in terms of nodes, processes, defense mechanisms) • Ease of injection (of symptoms or attack effects) • Ease of interfacing (what is visible to and controlled by the ILC and OLC) • Possibilities include • Real system (e.g. DPASA)– costly and complex to build as well to create attack scenarios • Compressed configuration of a real system—loss of some key information, attack scenarios are harder to build • Virtualization/Emulation—easier to set up, easier to create all kinds of attack scenario • Stubbed object based • Rule based CSISM Team

Metrics • BAA metrics • Detect 50% of attacks and generate effective response • Respond within 250 ms • For each 10 effective response generated, generate no more than 1 response in error • If response impacts availability negatively, demonstrate the learning process that converge on effective responses that optimally preserve availability • Internal metrics • As good or better as DPASA operators for the DPASA attacks (in decision making– note we are not creating new defenses) • Speed of response • Accuracy of response • Successfulness of response Figure 9 CSISM Team

Expected Impact, Schedule and Conclusion

Expected Impact Answers to key questions about cognitive approaches in cyber-defense-- Can parallel evaluation of multiple possibilities adequately address the problem of incomplete or imperfect information?-- Can a small number of expert-level insights adequately prune or guide the search and exploration for effective cyber-defense response? Short term Expert-level survivability-management decision-making without expert involvement at runtime-- Can enhance 3rd generation survivable systems-- An essential building block for the next generation (self-regenerative) of survivable systems A key milestone toward achieving the long term goal of systems that can defend themselves Long term Smarter and better managed systems– systems that make better use of available information about themselves-- Deliver better QoS-- Better utilization of resource-- Reduce cost of ownership etc. CSISM Team

Schedule and Milestone CSISM Team

Conclusion • Leverage prior red team experiences including DPASA • Understanding the problem of survivability management • Understanding how human operators behaved • Realistic system context • Human defenders’ behavior and cognitive processing • Soar style seems to be a good match • Application of cognitive mechanism (KR and processing) to a very specialized domain • Further narrowed by heuristic knowledge • Complemented by fast policy-based containment response • Rapid prototyping and experimentation • Explore learning as supervisory augmentation of adaptive behavior • Better survivability management enables usability of emerging survivability techniques • Smarter security administration at affordable cost • Intelligent use of available information about a system smarter and better managed systems • More productive systems, less cost of ownership etc. CSISM Team

Cognitive Support for Intelligent Survivability Management (CSISM)

Cognitive Support for Intelligent Survivability Management (CSISM)

Presentation Transcript

Network Survivability

Survivability

Intelligent Information Management

Intelligent Systems for Decision Support

Intelligent Information Management

Cognitive Support Technologies

Intelligent Content Management

Fault management and WDM network survivability

Air Support for Intelligent Aggressor

Intelligent Information Systems for Emergency Management

Intelligent Collaborative Support Systems

Intelligent Request Management

Intelligent Light Management

Cognitive Support for Intelligent Survivability Management

Intelligent Content Management

Cognitive Support for Intelligent Survivability Management

Survivability Panel

Intelligent Business Management

Survivability

Intelligent Light Management

Cognitive Architectures and General Intelligent Systems

From Survivability To Risk Management