Self-defending software: Automatically patching security vulnerabilities

Self-defending software:Automatically patching security vulnerabilities Michael Ernst University of Washington

Demo Automatically fix previously unknown security vulnerabilities in COTS software.

Goal: Security for COTS software Requirements: • Preserve functionality • Protect against unknown vulnerabilities • COTS (commercial off-the-shelf) software Key ideas: • Learn from failure • Leverage the benefits of a community

1. Preserve functionality • Maintain continuity: application continues to operate despite attacks • For applications with high availability requirements • Important for mission-critical applications • Web servers, air traffic control, communications • Technique: create a patch (repair the application) • Previous work: crash the program when attacked • Loss of data, overhead of restart, attack recurs

2. Unknown vulnerabilities • Proactively prevent attacks via unknown vulnerabilities • “Zero-day exploits” • No pre-generated signatures • No time for human reaction • Works for bugs as well as attacks

3. COTS software • No modification to source or executables • No cooperation required from developers • No source information (no debug symbols) • Cannot assume built-in survivability features • x86 Windows binaries

4. Learn from failure • Each attack provides information about the underlying vulnerability • Detect all attacks (of given types) • Prevent negative consequences • First few attacks may crash the application • Repairs improve over time • Eventually, the attack is rendered harmless • Similar to an immune system

5. Leverage a community • Application community = many machines running the same application (a monoculture) • Convenient for users, sysadmins, hackers • Problem: A single vulnerability can be exploited throughout the community • Solution: Community members collaborate to detect and resolve problems • Amortized risk: a problem in some leads to a solution for all • Increased accuracy: exercise more behaviors during learning • Shared burden: distribute the computational load

Evaluation • Adversarial Red Team exercise • Red Team was hired by DARPA • Had access to all source code, documents, etc. • Application: Firefox (v1.0) • Exploits: malicious JavaScript, GC errors, stack smashing, heap buffer overflow, uninitialized memory • Results: • Detected all attacks, prevented all exploits • For 7/10 exploits, generated a patch that maintained functionality • After an average of 8 attack instances • No false positives • Low steady-state overhead

Collaborative learning approach normal execution Monitoring attack (or bug) Monitoring: Detect attacks Pluggable detector, does not depend on learning Learning: Infer normal behavior from successful executions Repair: Propose, evaluate, and distribute patches based on the behavior of the attack All modes run on each member of the community Modes are temporally and spatially overlapped Learning Model of behavior Repair

Structure of the system Community machines Server (Server may be replicated, distributed, etc.) Threat model does not (yet!) include malicious nodes Encrypted, authenticated communication

Learning Community machines Observe normal behavior Server Server distributes learning tasks

Learning Community machines Generalize observed behavior Server copy_len < buff_size … copy_len ≤ buff_size … copy_len ≤ buff_size copy_len = buff_size Clients send inference results Server generalizes(merges results) Clients do local inference

Monitoring Community machines Detect attacks Assumption: few false positives Detectors used in our research: • Code injection (Memory Firewall) • Memory corruption (Heap Guard) Many other possibilities exist Server Detector collects information and terminates application

Monitoring Community machines What was the effect of the attack? Server Violated: copy_len ≤ buff_size Clients send difference in behavior: violated constraints Server correlates constraints to attack Instrumentation continuously evaluates learned behavior

Repair Community machines Propose a set of patches for each behavior correlated to attack Server • Candidate patches: • Set copy_len = buff_size • Set copy_len = 0 • Set buff_size = copy_len • Return from procedure Correlated: copy_len ≤ buff_size Server generatesa set of patches

Repair Community machines Distribute patches to the community Patch 1 Server Ranking: Patch 1: 0 Patch 2: 0 Patch 3: 0 … Patch 2 Patch 3

Repair Community machines Evaluate patches Success = no detector is triggered Server Patch 1 failed Ranking: Patch 3: +5 Patch 2: 0 Patch 1: -5 … Patch 3 succeeded When attacked, clients send outcome to server Detector is still running on clients Server ranks patches

Repair Community machines Redistribute the best patches Server Ranking: Patch 3: +5 Patch 2: 0 Patch 1: -5 … Patch 3 Server redistributes the most effective patches

Outline • Overview • Learning: create model of normal behavior • Monitoring: determine properties of attacks • Repair: propose and evaluate patches • Evaluation: adversarial Red Team exercise • Conclusion

Learning Community machines Generalize observed behavior Server copy_len < buff_size … copy_len ≤ buff_size … copy_len ≤ buff_size copy_len = buff_size Clients send inference results Server generalizes(merges results) Clients do local inference

Dynamic invariant detection • Idea: generalize observed program executions • Initial candidates = all possible instantiations of constraint templates over program variables • Observed data quickly falsifies most candidates • Many optimizations for accuracy and speed • Data structures, static analysis, statistical tests, … Candidate constraints: Remaining candidates: Observation: copy_len < buff_size copy_len ≤ buff_size copy_len = buff_size copy_len ≥ buff_size copy_len > buff_size copy_len ≠ buff_size copy_len < buff_size copy_len ≤ buff_size copy_len = buff_size copy_len ≥ buff_size copy_len > buff_size copy_len ≠ buff_size copy_len: 22 buff_size: 42

Quality of inference results • Not sound • Overfitting if observed executions are not representative • Does not affect attack detection • For repair, mitigated by correlation step • Continued learning improves results • Not complete • Templates are not exhaustive • Useful! • Sound in practice • Complete enough

Enhancements to learning • Relations over variables in different parts of the program • Flow-dependent constraints • Distributed learning across the community • Filtering of constraint templates, variables

Learning for Windows x86 binaries • No static or dynamic analysis to determine validity of values and pointers • Use expressions computed by the application • Can be more or fewer variables (e.g., loop invariants) • Static analysis to eliminate duplicated variables • New constraint templates (e.g., stack pointer) • Determine code of a function from its entry point

Outline • Overview • Learning: create model of normal behavior • Monitoring: determine properties of attacks • Detector details • Learning from failure • Repair: propose and evaluate patches • Evaluation: adversarial Red Team exercise • Conclusion

Detecting attacks (or bugs) Code injection (Determina Memory Firewall) • Triggers if control jumps to code that was not in the original executable Memory corruption (Heap Guard) • Triggers if sentinel values are overwritten These have low overhead and no false positives Goal: detect problems close to their source

Other possible attack detectors Non-application-specific: • Crash: page fault, assertions, etc. • Infinite loop Application-specific checks: • Tainting, data injection • Heap consistency • User complaints Application behavior: • Different failures/attacks • Unusual system calls or code execution • Violation of learned behaviors Collection of detectors with high false positives Non-example: social engineering

Learning from failures Each attack provides information about the underlying vulnerability • That it exists • Where it can be exploited • How the exploit operates • What repairs are successful

Monitoring Community machines Detect attacks Assumption: few false positives Server Detector collects information and terminates application

scanf read_input process_record main Monitoring Community machines Where did the attack happen? Server (Detector maintains a shadow call stack) Client sends attack info to server Detector collects information and terminates application

scanf read_input process_record main Monitoring Community machines Extra checking in attacked code Evaluate learned constraints Monitoring for main, process_record, … Server Server generates instrumentation for targeted code locations Server sends instru-mentation to all clients Clients install instrumentation

Monitoring Community machines What was the effect of the attack? Server Correlated: copy_len ≤ buff_size Violated: copy_len ≤ buff_size Clients send difference in behavior: violated constraints Server correlates constraints to attack Instrumentation continuously evaluates inferred behavior

Correlating attacks and constraints Evaluate only constraints at the attack sites • Low overhead A constraint is correlated to an attack if: • The constraint is violated iff the attack occurs Create repairs for each correlated constraint • There may be multiple correlated constraints • Multiple candidate repairs per constraint

Repair Community machines Distribute patches to the community Success = no detector is triggered Patch 1 Server Ranking: Patch 1: 0 Patch 2: 0 Patch 3: 0 … Patch 2 Patch 3 Patch evaluation uses additional detectors(e.g., crash, difference in attack)

Attack example • Target: JavaScript system routine (written in C++) • Casts its argument to a C++ object, calls a virtual method • Does not check type of the argument • Attack supplies an “object” whose virtual table points to attacker-supplied code • A constraint at the method call is correlated • Constraint: JSRI address target is one of a known set • Possible repairs: • Skip over the call • Return early • Call one of the known valid methods • No repair

Enabling a repair • Repair may depend on constraint checking • if (!(copy_len ≤ buff_size)) copy_len = buff_size; • If constraint is not violated, no need to repair • If constraint is violated, an attack is (probably) underway • Repair does not depend on the detector • Should fix the problem before the detector is triggered • Repair is not identical to what a human would write • A stopgap until an official patch is released • Unacceptable to wait for human response

Evaluating a patch • In-field evaluation • No attack detector is triggered • No crashes or other behavior deviations Pre-validation, before distributing the patch: • Replay the attack • No need to wait for a second attack • Exactly reproduce the problem • Expensive to record log; log terminates abruptly • Need to prevent irrevocable effects • Delays distribution of good patches • Run the program’s test suite • May be too sensitive • Not available for COTS software

Red Team • Red Team attempts to break our system • Hired by DARPA; 10 engineers • (Researchers = “Blue Team”) • Red Team created 10 Firefox exploits • Each exploit is a webpage • Firefox executes arbitrary code • Malicious JavaScript, GC errors, stack smashing, heap buffer overflow, uninitialized memory

Rules of engagement • Firefox 1.0 • Blue team may not tune system to known vulnerabilities • Focus on most security-critical components • No access to a community for learning • Focus on the research aspects • Red Team may only attack Firefox and the Blue Team infrastructure • No pre-infected community members • Future work • Red Team has access to the implementation and all Blue Team documents & materials • Red Team may not discard attacks thwarted by the system

Results • Blue Team system: • Detected all attacks, prevented all exploits • Repaired 7/10 attacks • After an average of 5.5 minutes and 8 attacks • Handled attack variants • Handled simultaneous & intermixed attacks • Suffered no false positives • Monitoring/repair overhead was not noticeable

Results: Caveats • Blue team fixed a bug with reusing filenames • Communications infrastructure (from subcontractor) failed during the last day • Learning overhead is high • Optimizations are possible • Can be distributed over community

Related work • Distributed detection • Vigilante [Costa] (for worms; proof of exploit) • Live monitoring [Kıcıman] • Statistical bug isolation [Liblit] • Learning (lots) • Typically for anomaly detection • FSMs for system calls • Repair [Demsky] • requires specification; not scalable

Self-defending software: Automatically patching security vulnerabilities