RABA’s Red Team Assessments

RABA’s Red TeamAssessments QuickSilver 14 December 2005

Agenda • Tasking for this talk… • Projects Evaluated • Approach / Methodology • Lessons Learned • and Validations Achieved • The Assessments • General Strengths / Weaknesses • AWDRAT (MIT) • Success Criteria • Assessment Strategy • Strengths / Weaknesses • LRTSS (MIT) • QuickSilver / Ricochet (Cornell) • Steward (JHU)

The Tasking “Lee would like a presentation from the Red Team perspective on the experiments you've been involved with. He's interested in a • talk that's heavy on lessons learned and benefits gained. Also of interest would be • red team thoughts on strengths and weaknesses of the technologies involved. Keeping in mind that no rebuttal would be able to take place beforehand, • controversial observations should be either generalized (i.e., false positives as a problem across several projects) or left to the final report.” -- John Frank e-mail (November 28, 2005)

Specific Teams We Evaluated • Architectural-Differencing, Wrappers, Diagnosis, Recover, Adaptive Software and Trust Management (AWDRAT) • October 18-19, 2005 • MIT • Learning and Repair Techniques for Self-Healing Systems (LRTSS) • October 25, 2005 • MIT • QuickSilver / Ricochet • November 8, 2005 • Cornell University • Steward • Dec 9, 2005 • JHU

Basic Methodology • Planning • Present High Level Plan at July PI Meeting • Interact with White Team to schedule • Prepare Project Overview • Prepare Assessment Plan • Coordinate with Blue Team and White Team • Learning • Study documentation provided by team • Conference Calls • Visit with Blue Team day prior to assessment • Use system, examine output, gather data • Test • Formal De-Brief at end of Test Day

Lessons Learned(and VALIDATIONS achieved)

Validation / Lessons Learned • Consistent Discontinuity of Expectations • Scope of the Assessment + Success Criteria • Boiling it down to “Red Team Wins” or “Blue Team Wins” on each test required significant clarity • Unique to these assessments because the metrics were unique • Lee/John instituted an assessment scope conference call ½ way through • we think that helped a lot • Scope of Protection for the systems • Performer’s Assumptions vs. Red Team’s Expectations • In all cases, we wanted to see a more holistic approach to the security model • We assert each program needs to define its security policy • And especially document what it assumes will be protected / provided by other components or systems

LL: Scope of Protection

Validation / Lessons Learned • More time would have helped A LOT • Longer Test Period (2-3 day test vice 1 day test) • Having an evening to digest then return to test would have allowed more effective additional testing and insight • We planned an extra 1.5 days for most, and that was very helpful • We weren’t rushing to get on an airplane • We could reduce the data and come back for clarifications if needed • We could defer non-controversial tests to the next day to allow focus with Government present • More Communication with Performers • Pre-Test Site/Team Visit (~2-3 weeks prior to test) • Significant help in preparing testing approach • The half-day that we implemented before the test was crucial for us • More conference calls would have helped, too • Hard to balance against performers main focus, though

Validation / Lessons Learned • A Series of Tests Might Be Better • Perhaps one day of tests similar to what we did • Then a follow-up test a month or two later as prototypes matured • With the same test team to leverage understanding of system gained • We Underestimated the Effort in Our Bid • Systems were more unique and complex than we anticipated • 20-25% more hours would have helped us a lot in data reduction • Multi-talented team proved vital to success • We had programming (multi-lingual), traditional red team, computer security, systems engineering, OS, system admin, network engineering, etc. talent present for each test • Highly tailored approach proved appropriate and necessary • Using more traditional network-oriented Red Team Assessment approach would have failed

The Assessments

Overall Strengths / Weaknesses of Projects • Strengths • Teams worked hard to support our assessments • The technologies are exciting and powerful • Weaknesses • Most Suffered a Lack of System Documentation • We understand there is a balance to strike – these are research prototypes essentially after all • Really limited ability to prepare for assessment • All are Prototypes -- stability needed for deterministic test results • All provide incomplete security / protection almost by definition • Most Suffered a Lack of Configuration Management / Control • Test “Harnesses” far from optimal for Red Team use • Of course, they are oriented around supporting the development • But, we’re fairly limited in using other tools due to uniquenesses of the technologies

AWDRATAssessment October 18-19, 2005

Success Criteria AWDRAT (MIT) • The target application cansuccessfully and/or correctly perform its mission • The AWDRAT system can • detect an attacked client’s misbehavior • interrupt a misbehaving client • reconstitute a misbehaving client in such a way that the reconstituted client is not vulnerable to the attack in question • The AWDRAT system must • Detect / Diagnose at least 10% of attacks/root causes • Take effective corrective action on at least 5% of the successfully identified compromises/attacks

Assessment Strategy AWDRAT (MIT) • Denial of Service • aimed at disabling or significantly modifying the operation of the application to an extent that mission objectives cannot be accomplished • attacks using buffer-overflow and corrupted data injection to gain system access • False Negative Attacks • a situation in which a system fails to report an occurrence of anomalous or malicious behavior • Red Team hoped to perform actions that would fall "under the radar". We targeted the modules of AWDRAT that support diagnosis and detection. • False Positive Attacks • system reports an occurrence of malicious behavior when the activity detected was non-malicious • Red Team sought to perform actions that would excite AWDRAT's monitors. Specifically, we targeted the modules supporting diagnosis and detection. • State Disruption Attacks • interrupt or disrupt AWDRAT's ability to maintain its internal state machines • Recovery Attacks • disrupt attempts to recover or regenerate a misbehaving client • target the Adaptive Software and Recovery and Regeneration modules in an attempt to allow a misbehaving client to continue operating

Strengths / Weaknesses AWDRAT (MIT) • Strengths • With a reconsideration of system’s scope of responsibility, we anticipate the system would have performed far better in the tests • We see great power in the concept of wrapping all the functions • Weaknesses • Scope of Responsibility / Protection far too Limited • Need to Develop Full Security Policy • Single points of failure • Application-Specific Limitations • Application Model Issues • Incomplete – by design? • Manually Created • Limited Scope • Doesn’t really enforce multi-layered defense

LRTSSAssessment October 25, 2005

Success Criteria LRTSS (MIT) • The instrumented Freeciv server does not core dump under a condition in which the uninstrumented Freeciv server does core dump • The LRTSS system can • Detect a corruption in a data structure that causes an uninstrumented Freeciv server to exit • Repair the data corruption in such a way that the instrumented Freeciv server can continue running • The LRTSS system must • Detect / Diagnose at least 10% of attacks/root causes • Take effective corrective action on at least 5% of the successfully identified compromises/attacks

Assessment Strategy LRTSS (MIT) • Denial of Service • Aimed at disabling or significantly modifying the operation of the Freeciv server to an extent that mission objectives cannot be accomplished • In this case, not achieving mission objectives is defined as the Freeciv server exits or dumps core • Attacks using buffer-overflow, corrupted data injection, and resource utilization • Various data corruptions aimed at causing the server to exit • Formulated the attacks by targeting the uninstrumented server first, then running the same attack against the instrumented server • State Disruption Attacks • interrupt or disrupt LRTSS's ability to maintain its internal state machines

Strengths / Weaknesses LRTSS (MIT) • Strengths • Performs very wellunder simple data corruptions • (that would cause the system to crash under normal operation) • Performs well under a large number of these simple data corruptions • (200 to 500 corruptions are repaired successfully) • Learning and Repair algorithms well thought out • Weaknesses • Scope of Responsibility / protection too limited • Complex Data Structure Corruptions not handled well • Secondary Relationships are not protected against • Pointer Data Corruptions not entirely tested • Timing of Check and Repair Cycles not optimal • Description of “Mission Failure” as core dump may be excessive

QuickSilverAssessment November 8, 2005

Success Criteria QuickSilver (Cornell) • Ricochet can successfully and/or correctly perform its mission • “Ricochet must consistently achieve a fifteen-fold reduction in latency (with benign failures) for achieving consistent values of data shared among one hundred to ten thousand participants, where all participants can send and receive events." • Per client direction, elected to use average latency time as the comparative metric • Ricochet’s Average Recovery demonstrates 15-fold improvement over SRM • Additional constraint levied requiring 98% update saturation (imposing the use of the NACK failover for Ricochet)

Assessment Strategy QuickSilver (Cornell) • Scalability Experiments --test scalability in terms of number of groups per node and number of nodes per group. Here, no node failures will be simulated, and no packet losses will be induced (aside from those that occur as a by-product of normal network traffic). • Baseline Latency • Group Scalability • Large Repair Packet Configuration • Large Data Packet Storage Configuration • Simulated Node Failures – simulate benign node failures. • Group Membership Overhead / Intermittent Network Failure • Simulated Packet Losses – introduce packet loss into the network. • High Packet Loss Rates • Node-driven Packet Loss • Network-driven Packet Loss • Ricochet-driven Packet Loss • High Ricochet Traffic Volume • Low Bandwidth Network • Simulated Network Anomalies –simulate benign routing and network errors that might exist on a deployed network. The tests will establish whether or not the protocol is robust in its handling of typical network anomalies, as well as those atypical network anomalies that may be induced by an attacker. • Out of Order Packet Delivery • Packet Fragmentation • Duplicate Packets • Variable Packet Sizes

Strengths / Weaknesses QuickSilver (Cornell) • Strengths • Appears to be very resilient when operating within its assumptions • Very stable software • Significant performance gains over SRM • Weaknesses • FEC-orientation - focus in statistics belies valuable data regarding complete packet delivery • Batch-oriented Test Harness – • Impossible to perform interactive attacks • Very limited insight into blow-by-blow performance • Metrics collected were very difficult to understand fully

STEWARDAssessment December 9, 2005

Success Criteria Steward (JHU) • The STEWARD system must: • Make progress in the system when under attack. • Progress is defined as the eventual global ordering, execution, and reply to any request which is assigned a sequence number within the system • Maintain a consistency of data replicated on each of the servers in the system

Assessment Strategy Steward (JHU) • Data Integrity Attacks -attempts to create an inconsistency in the data replicated on the various servers in the network • Arbitrarily Execute Updates • Multiple Pre-Prepare Messages using Same Sequence Numbers and Different Request Data • Spurious Prepare, Null Messages • Suppressed Checkpoint Messages • Prematurely Perform Garbage Collection • Invalid Threshold Signature • Protocol State Attacks -attacks focused on interrupting or disrupting STEWARD's ability to maintain its internal state machines • Certificate Threshold Validation Attack • Replay Attack • Manual Exploit of Client or Server • Validation Activities -tests we will perform to verify that STEWARD can endure up to five Byzantine faults while maintaining a three-fold reduction in latency with respect to BFT • Byzantine Node Threshold • Benchmark Latency • Progress Attacks -attacks we will launch to prevent STEWARD from progressing to a successful resolution of an ordered client request • Packet Loss • Packet Delay • Packet Duplication • Packet Re-ordering • Packet Fragmentation • View Change Message Flood • Site Leader Stops Assigning Sequence Numbers • Site Leader Assigns Non-Contiguous Sequence Numbers • Suppressed New-View Messages • Consecutive Pre-Prepare Messages in Different Views • Out of Order Messages • Byzantine Induced Failover Note: We did not try to validate or break the encryption algorithms.

Strengths / Weaknesses Steward (JHU) • Strengths • First system that assumes and actually tolerates corrupted components (Byzantine attack) • Blue Team spent extensive time up front in analysis, design and proof of the protocol – it was clear in the performance • System was incredibly stable and resilient • We did not compromise the system • Weaknesses • Limited Scope of Protection • Relies on external entity to secure and manage keys which are fundamental to the integrity of the system • STEWARD implicitly and completely trusts the client • Client-side attacks were out of scope of the assessment

Going Forward • White Team will generate definitive report on this Red Team Test activity • It will have the official scoring and results • RABA (Red Team) will generate a test report from our perspective • We will publish to: • PI for the Project • White Team (Mr. Do) • DARPA (Mr. Badger)

Questions or Comments Any Questions, Comments, or Concerns?

RABA’s Red Team Assessments