340 likes | 534 Views
SRS Architecture Study. Partha Pal Franklin Webber. Outline. Level of service w/o attack. Regenerative. Level of service. Survivable (OASIS Dem/Val). undefended. Start of focused attack. time. S elf- R egenerative S urvivable System: Self: Organic decision making
E N D
SRS Architecture Study Partha Pal Franklin Webber
Outline Level of service w/o attack Regenerative Level of service Survivable (OASIS Dem/Val) undefended Start of focused attack time Self-Regenerative Survivable System: Self: Organic decision making Regenerative: Better than graceful degradation/simple recovery– reversing the trend • Study goals • SRS Technologies • Top down • Bottom up • Strawman • Issues, challenges
Study Plan Understand how to incorporate the new technologies in a distributed information system that not only tolerates the effects of cyber-attacks, but also attempts to stop and reverse the loss of resources and capabilities 3rd generation assumptions are still valid- Absolute prevention, and accurate and on time detection are impossible to achieve • Start with the new (SRS) capabilities, build a partial architectural framework, and then see what other capabilities, mechanisms and services are needed to complete the architecture– • offers a high level of resistance to attacks (protection), • improves visibility of attacker activity/attack effects (detection), and • is able to adapt to changes caused by the attacker (react) Start with a high watermark survivability architecture, identify where SRS capabilities could benefit, re-organize the architecture to integrate the selected capabilities, mechanisms and services Top down.. Bottom up.. If the high watermark is implemented then it provides a concrete context, but “grand fathering” may impact choice and Integration of new capabilities This is study in the abstract..leading to an abstract architecture that will need a concrete context to realize.. Combine & contrast the abstract architecture with the more concrete case to create a Strawman Self-regenerative Survivable System Architecture Balances pros and cons of both approaches
Summary of SRS Technology Study Process • Sent a questionnaire to each original SRS project (i.e., all except Asbestos) • General outline: • Claims • Key Capabilities • Benefits and Other Distinguishing Factors • Assumptions • Use Cases and Interface Issues • Customized for issues we thought especially important or were confused about • All responded, some very quickly some needed gentle prodding – thank you! General Observations • Varying degrees of maturity • Some projects started with existing technology • At least half of the projects offer multiple technologies that could be used independently • Less overlap than we expected: many technologies seem complementary • Unsurprisingly, not a lot of support for integration
Biologically-Inspired Diversity Projects • Genesis • A toolkit offering a variety of transformations • Based on Strata and is portable • DAWSON • A toolkit offering a variety of transformations • Based on Windows DLLs • Comparisons • Some overlap in randomization techniques • Genesis also offers highly-attack-resistant runtime transformations that incur Strata’s overhead • DAWSON also offers Windows-specific transformations • May be combined but value and difficulty are unclear
Cognitive Immunity and Self-Healing Projects • Learning and Repair • Daikon: learns program constraints from a set of traces • Kvasir: monitors program to create traces for Daikon • Archie: checks program constraints at runtime • Repair Tool: repairs damage to conform to constraints • Tools existed before SRS but are being improved • RMPL (Concurrent Model-Based Execution) • A language expressing temporal properties without fully specifying an order of execution, and probabilistic assumptions and choices • An executive that plans, dispatches methods and replans when necessary
Cognitive Immunity and Self-Healing, cont’d • AWDRAT • Language to specify behavior (Architectural Model) • Language to describe Method Selection Metadata • Tools to instrument Java to monitor and control behavior • An executive that • Detects anomalies by Architectural Differencing • Combines other observations to update a Trust Model • Selects methods to maximize utility and/or minimize costs • Cortex • A “taste-tester” framework for redundant components • Scyllarus: situation assessment • CIRCA: generates controllers from models
Cognitive Immunity and Self-Healing, cont’d • Comparisons • Learning and Repair tools are complementary to others • Cortex learning by taste testing is also complementary • AWDRAT and RMPL address some of the same issues but: • AWDRAT is middleware to defend existing application; RMPL is a language and environment for building new applications • Geared to different application domains: • RMPL– embedded/autonomous vehicle systems • AWDRAT- information processing systems • AWDRAT’s Trust Modeling is complementary to others
Granular Scalable Redundancy Projects • Steward • Scalable support for Byzantine fault-tolerant state-machine replication • BFT-like protocol for LANs • Paxos-like protocol for WANs • Library for threshold crypto • CMU • Byzantine fault-tolerant data storage using scalable asynchronous protocols • read/write (R/W) • query/update (Q/U) • QuickSilver • Tempest (time-critical; probabilistic; SlingShot protocol) • QuickSilver (scale to many groups; virtual synchronous protocol) • Cayuga (efficient automata for searching publication histories) • ChunkySpread (dynamic IP multicast)
Granular Scalable Redundancy, cont’d • Comparisons among protocols • Significantly different attack (fault) models • Significantly different assumptions about applications • CMU’s Q/U protocol makes the weakest assumptions about the attacker but has more restrictive application than Steward, SlingShot or QuickSilver
Reasoning about Insider Threat Projects • PMOP • Framework for monitoring operator behavior, recognizing and blocking bad actions • HDSM (High-Dimensional Search and Modeling) • Insider Modeler and Analyzer, currently used offline • Search engine for high-dimensional space of sensor data • Response Engine • Asbestos • New x86 OS with efficient support for trustworthy isolation in hosts and processes running untrusted code
Reasoning about Insider Threat, cont’d • Comparisons • All are complementary to each other • PMOP seems to be AWDRAT’s Architectural Differencing applied to operators rather than components • HDSM’s search engine is complementary to other SRS technologies but the Response Engine overlaps in scope with AWDRAT executives
Top Down Approach SRS Technologies DPASA Architecture applied to the JBI Exemplar used in OASIS Dem/Val What can we learn about the architecture of SRS systems by trying to transform a high watermark survivable system into an SRS system? Understanding of their And, its limitations and shortcomings, as identified by: Capabilities Developers’ experiences Assumptions Limitations Testing and validation Maturity Out of lab deployment Multiple red team exercises • Our study found that the there is sizable intersection that pushes the high watermark more towards an SRS system! • Much better than finding that technologies do not address the identified problems; or even if they do, “self” and “regenerative” aspects had no gain These changes are incremental improvements over current DPASA architecture. Changing the architecture substantially, (e.g., implementing JBI CAPI using QuickSilver) without appropriate forethought is not likely to lead to a more survivable system because the system will lose the well tested interaction of existing protection, detection and adaptive response mechanisms
Limitations and Shortcomings of the DPASA Architecture • Recovery supported only for some key components • Availability seems to be the most attractive target for the adversary • Interpretation of observation, deduction and decision making require expertise • More options for adaptive response • Lack of support for improving the system on the fly The last three are more tightly inter-related among themselves and more SRS oriented, but SRS technologies may help in all but the last one
Improving Recovery Full recovery Restart with state loss SM and PSQ are redundant, maintain some replicated state • State: • Partially implemented: some clients and some PSQ (those committed to MySQL) • Connection: • Reasonably handled • Group view: • PSQ: • View among servers: handled well • View of servers from clients: takes a long time • SM: • Dependant on Spread: could be broken in a bad way • Improvement possibilities • Need “safe” state transfer or carry over • Can SRS technologies help? • Replace Spread transmitter? • Implement the (in memory) data structures maintained by PSQ servers as Q/U objects using CMU protocol? • Clients and DC: use Asbestos for protecting check-pointed state? SRS technologies provide supporting infrastructure Self: who makes the decision to recover (or not to recover), and when? Regenerative: Recovering to “operational” without any other “changes” is still in the realm of “delaying the eventual degradation”
Some Details q1sm wants to multicast message M: q1sm signs M and hands it to its XMTR, which returns success only of all XMTRs in the group acknowledges receiving M Combination of managed switches and ADF policies define who can talk to whom and over which port and protocol q1sm q3sm q4sm q2sm Voting Voting Voting Voting Sig Vrfy Sig Vrfy Sig Vrfy Sig Vrfy SXMTR SsXMTR SXMTR SsXMTR SsXMTR SXMTR SsXMTR SXMTR q4sm q3sm q1sm q2sm SPREAD GCS Steward or QuickSilver q1psq q2psq q3psq q4psq q4dc q1dc q2dc q3dc The way client’s PSQ messages are handled by our PSQ servers are similar to using CMU’s Q/U protocols– imagine the subscription info as a Q/U object, replicated at each PSQ server, part of which is maintained in memory-one difference is that instead of the client, one PSQ server acts as its proxy. • It is not clear whether the unavailability observed is purely an implementation problem, but switching over to Steward or QuickSilver transport may still be advantageous: • Maintaining the state machine replication abstraction is advantageous for state recovery • Simpler XMTR • Can handle more quads socket socket socket socket socket socket socket socket socket socket socket socket q1psq q3psq q4psq q2psq Using the Q/U object abstraction and associated protocol will help state recovery of a restarted PSQ server—different clients may have interacted with different quads while the recovering quad was down. Q/U Objects Voting Voting Voting Voting Sig Vrfy Sig Vrfy Sig Vrfy Sig Vrfy Q/U client Q/U protocol and Object Synching Client’s PSQ Req
Making Availability Compromises More Difficult From outside Need privileged access on inside host(s) q4sm q4sm Redundant and fail over spl hw q4ps q4ps Monitor all legs q4cor q4cor q4psq q4psq Diversity (costly) PIX T tester Attacker network q4dc q4dc q4NIDS q4NIDS HUB WNIDS other q4ap q4ap PIX PIX CombOPS PIX x Wing Ops LAN AMC CONUS LAN HUB x x ANIDS MAF PIX MAF MAF PIX AMC CONUS LAN ANIDS ANIDS AMC CONUS LAN • Unavailability triggered by corruption: • Non-redundant and homogeneous perimeter (PIX FW Routers) • Corrupt references • Attacks on Java: serialization bombs, garbage collection/lease • SQL Injection Dynamic diversity using Genesis? May not be memory exploit? Taste tester? Not brute force DOS!
Availability cont’d q2sm q3sm q4sm Send a serialized packet with a huge number in the size field JVM tries to allocate the memory in the size field and throws OOME Code/Implementation problem Corrupt references: Flaw: Q1SM’s unsolicited statement about other quads IP address is believed by everybody Q1’s IP=127.0.0.1 Q2’s IP=127.0.0.1 Q3’s IP=127.0.0.1 Q4’s IP=127.0.0.1 Registering client gets 127.0.0.1 for all quads Attacks on Java: serialization bombs, garbage collection/lease mechanisms • Create variants of JVM or other libraries using Genesis or DAWSON tools • Enforce a size rule? • Use AOP to implement a check before allocation? • Use Daikon toolset to learn the max size of serialized packets, enforce it as an invariant and fix when violated? Variations: Send a number of such packets to consume available memory Packets may or may not be well formed, but to keep the memory allocated “serialization” must succeed Tell a RMI server that a client reference has been released Need to guess the client side refs
Availability cont’d Query created from D by PSQ Roles and other checks Encrypted message containing data D Session check Session check Encrypted message containing data D Encrypted message containing data D • Improving Detection and Response • Add taste tester • Two possibilities at PSQ level or at the RDB level SQL Injection X MySQL in all quads, but on different OS RDB Improving Prevention (& detection) RDB X T taster RDB Query created from D by PSQ X • Strictly control what is executed on the RDB • Vet D • Create a white list T taster PSQ PSQ Server PSQ Server DC Encrypted message containing data D DC Applicability, Extendibility … PSQ Proxy PSQ Proxy • Use diverse DBs (hoping some will behave differently) • Can SRS diversity techniques help • Genesis tainting? DPASA has the notion of escrow Client Cost… Client
Detection Arch differencing Deviation from spec Interpretation Models, JHU A-DAGs Deductive analysis, hypothesis testing HDSM? Cortex Response selection RMPL? Cortex More Organic Decision Making • Combat ops got bad references for Q1, Q3 and Q4? • Most likely not all at once Q2SM sees heartbeats from 4 out of 5 Combat Ops components Q1sm invites Combat Ops, but does not see all heartbeats Q3sm shows some missing heartbeats from Combat Ops Q4sm same as Q3SM DPASA Operators • Try to push right references • Try refreshing these first • If fails try refreshing with q3 blocked? (DPASA Operators) No significant alerts in Emerald Organic Decision Making: within the system, by the system • At which granularity the cost overruns benefits? • Most DPASA implemented components have some of these in “code”– should they be made explicit? • Should we add these as architectural elements at key components SM, PS, PSQ and LC GUI Up, but cannot subscribe Issues to be addressed by the architecture
More Maneuvering Room for Defense • Beyond restart process, reboot ,and graceful degradation (block or isolate, reduce quorum size etc) • More spares, distributed widely • (Scalable redundancy) • Restart a variant • (Genesis, Dawson) • Reboot a new system • (Asbestos?) • Change transport • (from QuickSilver to SlingShot, accept the weaker guarantees) Carrying over state and keys? • SRS technologies provide the infrastructure or mechanisms– but the management? • Policies, decision making– when to restart a variant, when to reboot with what restrictions, which transport? • SRS cognitive capabilities (reasoning about the system) will likely fall short in reasoning about SRS technologies
Improving the System on the Fly • Even if improvement causing changes are identified along with the right time to apply them, the system must be “architected” to take the changes • Authorized vs unauthorized changes • Risk of automation– a new attack avenue • Different kinds “Change” • Code changes– • Restart– state and key issues • Policy or configuration changes • IP Tables, ADF, rate limiting, size checking • Hooks exists, can be done manually • Protocol/transport changes This is an architecture and implementation issue– solution will likely be dependant on the technologies being used
A Futuristic DPASA++ System Emerald Auto-action Arch Difference HD Search q4sm q4sm Enhanced SMs: eliminate advisors, more decision support interfaces q4ps q4ps q4sm q4sm q4cor q4cor q4sm q7sm q4ps q4ps q4psq q4psq q4ps q7ps q4cor q4cor q4dc q4dc q4cor q7cor q4psq q4psq Removal of existing component/feature q7psq q4psq q4dc q4dc q4NIDS q4NIDS Enhancement of existing component/feature q7dc q4dc q4NIDS q4NIDS Addition of new component/feature q4ap q4ap q4NIDS q7NIDS q4ap q4ap q7ap q4ap More quads (PSQ/SM: Scalable Redundancy) Diverse variants of JVM and libraries OS support of isolation– keys, check pointed data, etc. LCs enhanced with Arch Diff and Cognitive Executive Use Genesis, DAWSON, Asbestos, RMPL/AWDRAT technologies q1sm q2sm q3sm q4sm q4sm q1ps q2ps q4ps q4ps q3ps q1cor q2cor q3cor q4cor q4cor q3psq q1psq q2psq q4psq q4psq Color Code q1dc q2dc q3dc q4dc q4dc q2NIDS q4NIDS q4NIDS q1NIDS q3NIDS q2ap q3ap q4ap q4ap q1ap Taste testers: at key service providers such as PSQ (using existing redundancy) and may even at the perimeter router. SCRBT SCRBT WNIDS ENIDS ENIDS PNIDS PNIDS LC TAP TAP CombOPS LC LC LC WxHaz WxHaz AODB AODB Wing Ops LAN LC LC ChemHaz ChemHaz Target Target LC AODBSVR AODBSVR SWDIST SWDIST TAPDB TAPDB EDC EDC CAF CAF ANIDS LC MAF JEES JEES LC LC ENV LAN PLANNING LAN AMC CONUS LAN
Bottom Up Approach: Self-Regeneration Feedback Loop Controller service specification service deviation resource allocation Application service • “service” may include the app’s • functional correctness and/or • quality of service delivery
Feedback Loop Including Resources Controller knowledge service specification service deviation analysis strategy service measurement resource allocation resource configuration service measurement Resource Application Resource Resource service
Using SRS Technologies in Feedback Loop • Service specification: • RMPL, AWDRAT, Daikon • Service measurement: • Archie, RMPL, Architectural Differencing, PMOP • Resource configuration: • Genesis, DAWSON, Repair Tool, Cortex, HDSM • Resource allocation: • RMPL, AWDRAT • Controller: • Knowledge: Cortex • Analysis: Trust Modeling, HDSM • Strategy: RMPL, AWDRAT
Using SRS Technologies for Distributivity • Self-Regenerative System will likely distribute • Application and/or • Resources and/or • Controller • For coordinating distributed redundant application services and resources • Steward, Q/U, R/W, QuickSilver (virtual synchrony) • For coordinating distributed redundant controllers • SlingShot (probabilistic time-critical)
Design Choices for Feedback Loops • Hierarchy • Loops may be placed within application components, resources, and/or controllers of larger loops • Loops may share resources and/or controllers • Controllers often share data: • Synthesized from lower layers • Inherited from higher layers • Trade speed for smarts: • small loops are fast and dumb; large loops slow and aware • Coordination • Replicated controllers allow easier analysis of defensive properties • Autonomous, decentralized controllers reduce the cost of coordination
Example: Multiple Components, Nested and Distributed Controllers, Shared Resources Controller Controller Controller Component Component Resource Resource Resource Resource Resource
Design Rules-Of-Thumb • Use purely local reaction only when accurate self-accusation is possible • “Organic” decision-making • Examples: if uncaught exception, restart thread; if seg fault, start new variant • Controller scope should follow some boundary defined by access controls. • Examples: a LAN bounded by firewalls • For every resource, some controller scope should monitor all its uses.
Natural Architectural Fragments • Use AWDRAT, RMPL, or Cortex as Controller framework • entire system or a significant subsystem and/or • one object or process • Use Genesis or DAWSON to create alternate method implementations used in AWDRAT or RMPL • Use Asbestos to compartmentalize data for multiple clients in Q/U protocol or multiple groups in QuickSilver protocol • Construct a Unified Communication Service from multicast protocols • Runtime selection of alternate communication protocol with different properties • Apply Learning and Repair technology to other SRS components
Conclusion • Various SRS technologies would have allowed improvement to our DPASA system defenses. • Taken collectively, SRS technologies address most parts of the problem of self-regenerative control. • Underlying SRS ideas seem sound but many implementations are immature. • SRS technologies do not show how to distribute and scale self-regenerative control loops.
Placeholder for Strawman FixIt • Componentization of defense • Protection, detection and adaptation • Organic decision making • Unified Communication Service • Architecture: • Organizing defense-enabled components over the UCS substrate • Layered vs monolithic • Loose confederation vs Logical centralization • (DPASA is layered and logically centralized) • Deliberative inter-component adaptations