The Insider Threat in Scalable Distributed Systems: Algorithms, Metrics, Gaps

The Insider Threat in Scalable Distributed Systems: Algorithms, Metrics, Gaps Yair Amir Distributed Systems and Networks lab Johns Hopkins University www.dsn.jhu.edu ACM STC’07

Acknowledgement • Johns Hopkins University • Claudiu Danilov, Jonathan Krisch, John Lane • Purdue University • Cristina Nita-Rotaru, Josh Olsen, Dave Zage • Hebrew University • Danny Dolev • Telcordia Technologies • Brian Coan

This Talk in Context Scalable Information Access & Communication • High availability (80s – 90s) • Benign faults, accidental errors, crashes, recoveries, network partitions, merges. • Fault tolerant replication as an important tool. • Challenges – consistency, scalability and performance. • Security (90s – 00s) • Attackers are external. • Securing the distributed system and the network. • Crypto+ as an important tool. • Survivability ( 00s – …) • Millions of compromised computers: there is always a chance the system will be compromised. • Lets start the game when parts of it are already compromised. • Can the system still achieve its goal, and under what assumptions ? • Challenges – assumptions, scalability, performance.

Trends: Information Access & Communication • Networks become one • From one’s network to one Internet. • Therefore: (inherently,) the environment becomes increasingly hostile. • Stronger adversaries => weaker models. • Benign faults - mean time to failure, fault independence • Fail-stop, crash-recovery, network partitions-merges. • Goals: high availability, consistency (safety, liveness). • External attacks – us versus them • Eavesdropping, replay attacks, resource consumption kind of DoS. • Goals: keep them out. Authentication, Integrity, Confidentiality. • Insider attacks – the enemy is us • Byzantine behavior • Goals: safety, liveness, (performance?)

The Insider Threat • Networks are already hostile! • 250,000 new zombie nodes per day. • Very likely that some of them are part of critical systems. • Insider attacks are a real threat, even for well-protected systems. • Challenges: • Service level: Can we provide “correct” service? • Network level: Can we “move” the bits? • Client level: Can we handle “bad” input?

The Insider Threat in Scalable Systems • Service level: Byzantine Replication • Hybrid approach: few trusted components, everything else can be compromised. • Symmetric approach: No trusted component, compromise up to some threshold. • Network level: Byzantine Routing • Flooding “solves” the problem. • “Stable” networks - some limited solutions, good starting point [Awerbuch et al. 02] • “Dynamic” networks – open problem. • Client level:? • Input replication – not feasible in most cases. • Recovery after the fact – Intrusion detection, tracking and backtracking [Chen et al. 03]. • Open question – is there a better approach?

Outline • Context and trends • Various levels of the insider threat problem • Service level problem formulation • Relevant background • Steward: First scalable Byzantine replication • A bit on how it works • Correctness • Performance • Tradeoffs • Composable architecture • A bit on how it works • BLink – Byzantine link protocol • Performance and optimization • Theory hits reality • Limitation of existing correctness criteria • Proposed model and metrics • Summary

A site Clients Server Replicas o o o N 1 2 3 Service Level: Problem Formulation • Servers are distributed in sites, over a Wide Area Network. • Clients issue requests to servers, then get back answers. • Some servers can act maliciously. • Wide area connectivity is limited and unstable. • How to get good performance and guarantee correctness ? • What is correctness?

Relevant Prior Work • Byzantine Agreement • Byzantine generals [Lamport et al. 82], [Dolev 83] • Replication with benign faults • 2-phase commit [Eswaran, Gray et al. 76] • 3-phase commit [Skeen, Stonebreaker 82] • Paxos [Lamport 98] • Hybrid architectures • Hybrid Byzantine tolerant systems [Correia, Verissimo et al. 04] • Symmetric approaches for Byzantine-tolerant replication • BFT [Castro, Liskov 99] • Separating agreement from execution [Yin, Alvisi et al. 03] • Fast Byzantine consensus [Martin, Alvisi 05] • Byzantine-tolerant storage using erasure codes [Goodson, Reiter et al. 04]

request proposal accept reply C C 0 0 1 1 2 2 3 Background: Paxos and BFT request pre-prepare prepare commit reply • Paxos [Lamport 98] • Ordering coordinated by an elected leader. • Two rounds among servers during normal case (Proposal and Accept). • Requires 2f+1 servers to tolerate fbenign faults. • BFT [Castro, Liskov 99] • Extends Paxos into the Byzantine environment. • One additional round of communication, crypto. • Requires 3f+1 servers to tolerate fByzantine servers. October 10, 2007 SRDS 2007

Background: Threshold Crypto • Practical Threshold Signatures [Schoup 2000] • Each participant receives a secret share. • Each participant signs a certain message with its share, and sends the signed message to a combiner. • Out of k valid signed shares, the combiner creates a (k, n) threshold signature. • A (k, n) threshold signature • Guarantees that at least k participants signed the same message with their share. • Can be verified with simple RSA operations. • Combining the shares is fairly expensive. • Signature verification is fast.

A site Clients Server Replicas o o o N 1 2 3 Steward: First Byzantine Replication Scalable to Wide Area Networks [DSN 2006] • Each site acts as a trusted unit that can crash or partition. • Within each site: Byzantine-tolerant agreement (similar to BFT). • Masks fmalicious faults in each site. • Threshold signatures prove agreement to other sites. • ---------- that is optimally intertwined with -------------- • Between sites: light-weight, fault-tolerant protocol (similar to Paxos). • There is no free lunch: we pay with more hardware. • 3f+1 servers in each site.

Outline • Context and trends • Various levels of the insider threat problem • Service level problem formulation • Relevant background • Steward: First scalable Byzantine replication • A bit on how it works • Correctness • Performance • Tradeoffs • Composable architecture • A bit on how it works • BLink – Byzantine link protocol • Performance and optimization • Theory hits reality • Limitation of existing correctness criteria • Proposed model and metrics • Summary

Byzantine ordering Threshold signed proposal (2f+1) Threshold signed accept (2f+1) Main Idea 1: Common Case Operation • A client sends an update to a server at its local site. • The update is forwarded to the leader site. • The representative of the leader site assigns order in agreement and issues a threshold signed proposal. • Eachsite issues a threshold signed accept. • Upon receiving a majority of accepts, servers in each site “order” the update. • The original server sends a response to the client.

Steward Hierarchy Benefits • Reduces the number of messages sent on the wide area network. • O(N2)  O(S2) – helps both in throughput and latency. • Reduces the number of wide area crossings. • BFT-based protocols require 3 wide area crossings. • Paxos-based protocols require 2 wide area crossings. • Optimizes the number of local Byzantine agreements. • A single agreement per update at leader site. • Potential for excellent performance. • Increases system availability • (2/3 of total servers + 1)  (A majority of sites). • Read-only queries can be answered locally.

Steward Hierarchy Challenges • Each site has a representative that: • Coordinates the Byzantine protocol inside the site. • Forwards packets in and out of the site. • One of the sites act as the leader in the wide area protocol • The representative of the leading site is the one assigning sequence numbers to updates. • Messages coming out of a site during leader election are based on communication between 2f+1(out of 3f+1) servers inside the site. • There can be multiple sets of 2f+1 servers. • In some instances, multiple correct but different site messages can be issued by a malicious representative. • It is sometimes impossible to completely isolate a malicious server behavior inside its own site. • How do we select and change representatives in agreement ? • How do we select and change the leader site in agreement ? • How do we transition safely when we need to change them ?

Main Idea 2: View Changes • Sites change their local representatives based on timeouts. • Leader site representative has a larger timeout. • allows to contact at least one correctrep. at other sites. • After changing enough leader site representatives, servers at all sites stop participating in the protocol, and elect a different leading site.

Correctness Criteria • Safety: • If two correct servers order an update with the same sequence i, then these updates are identical. • Liveness: • If there exists a set of a majority of sites, each consisting of at least 2f+1 correct, connected servers, and a time after which all sites in the set are connected, then if a client connected to a site in the set proposes an update, some correct server at a site in the set eventually orders the update.

Intuition Behind a Proof • Safety: • Any agreement (ordering or view change) involves a majority of sites, and 2f+1 servers in each. • Any two majorities intersect in at least one site. • Any two sets of 2f+1 servers in that site intersect in at least f+1 servers (which means at least one correct server). • That correct server will not agree to order two different updates with the same sequence. • Liveness: • A correct representative or leader site cannot be changed by f local servers. • The selection of different timeouts ensures that a correct representative of the leader site has enough time to contact correct representatives at other sites.

Testing Environment • Platform: Dual Intel Xeon CPU 3.2 GHz 64 bits 1 GByte RAM, Linux Fedora Core 4. • Library relies on Openssl : • Used OpenSSL 0.9.7a, Feb 2003. • Baseline operations: • RSA 1024-bits sign: 1.3 ms, verify: 0.07 ms. • Perform modular exponentiation 1024 bits, ~1 ms. • Generate a 1024 bits RSA key ~ 55ms.

Symmetric Wide Area Network • Synthetic network used for analysis and understanding. • 5 sites, each of which connected to all other sites with equal bandwidth/latency links. • One fully deployed site of 16 replicas; the other sites are emulated by one computer each. • Total – 80 replicas in the system, emulated by 20 computers. • 50 ms wide area links between sites. • Varied wide area bandwidth and the number of clients.

Write Update Performance • Symmetric network. • 5 sites. • BFT: • 16 replicas total. • 4 replicas in one site, 3 replicas in each other site. • Up to 5 faults total. • Steward: • 16 replicas per site. • Total of 80 replicas (four sites are emulated). Actual computers: 20. • Up to 5 faults in each site. • Update only performance (no disk writes).

Read-only Query Performance • 10 Mbps on wide area links. • 10 clients inject mixes of read-only queries and write updates. • None of the systems was limited by bandwidth. • Performance improves between a factor of two and more than an order of magnitude. • Availability: Queries can be answered locally, within each site.

Wide-Area Scalability • Selected 5 Planetlab sites, in 5 different continents: US, Brazil, Sweden, Korea and Australia. • Measured bandwidth and latency between every pair of sites. • Emulated the network on our cluster, both for Steward and BFT. • 3-fold latency improvement even when bandwidth is not limited. (how come ?)

Non-Byzantine Comparison Boston MITPC • Based on a real experimental network (CAIRN). • Several years ago we benchmarked benign replication on this network. • Modeled on our cluster, emulating bandwidth and latency constraints, both for Steward and BFT. Delaware 4.9 ms San Jose 9.81Mbits/sec UDELPC TISWPC 3.6 ms 1.42Mbits/sec ISEPC 1.4 ms 1.47Mbits/sec ISEPC3 100 Mb/s <1ms 38.8 ms 1.86Mbits/sec ISIPC4 Virginia ISIPC 100 Mb/s < 1ms Los Angeles

CAIRN Emulation Performance • Steward is limited by bandwidth at 51 updates per second. • 1.8Mbps can barely accommodate 2 updates per second for BFT. • Earlier experimentation with benign fault 2-phase commit protocols achieved up to 76 updates per sec. [Amir et. all 02].

Steward: Approach Tradeoffs • Excellent performance • Optimized based on intertwined knowledge among global and local protocols. • Highly complex • Complex correctness proof. • Complex implementation. • Limited model does not translate well to wide area environment needs • Global benign protocol over local Byzantine. • “What if the whole site is compromised?” • Partially addressed by implementing 4 different protocols: Byzantine/Benign, Byzantine/Byzantine, Benign/Benign, Benign/Byzantine (Steward). • “Different sites have different security profiles…”

A Composable Approach [SRDS 2007] • Use clean two-levelhierarchy to maintain scalability. • Clean separation of the local and global protocols. • Message complexity remains O(Sites2). • Use state machine based logical machines to achieve a customizable architecture. • Free substitutionof the fault tolerance method used in each site and among the sites. • Use efficient wide-area communication to achieve high performance. • Byzantine Link (BLink) protocol for inter logical machine communication.

Outline • Context and trends • Various levels of the insider threat problem • Service level problem formulation • Relevant background • Steward: First scalable Byzantine replication • A bit on how it works • Correctness • Performance • Tradeoffs • Composable architecture • A bit on how it works • BLink – Byzantine link protocol • Performance and optimization • Theory hits reality • Limitation of existing correctness criteria • Proposed model and metrics. • Summary

Building a Logical Machine BLink BLink Wide-Area Protocol Wide-Area Protocol • A single instance of the wide-area replication protocol runs among a group of logical machines (LMs),one in each site. • Logical machines behave like single physical machines with respect to the wide-area protocol. • Logical machines send threshold-signed wide-area messages viaBLink. • Each logical machine is implemented by a separate instance of a local state machine replication protocol. • Physical machines in each site locally order all wide-area protocol events: • Wide-area message reception events. • Wide-area protocol timeout events. • Each logical machine executes a single stream of wide-area protocol events. Local-Area Protocol Local-Area Protocol Site B  Logical Machine B Site A  Logical Machine A

A Composable Architecture • Clean separation and free substitution • We can choose the local-area protocol deployed in each site, and the wide-area protocol deployed among sites. • Trade performance for fault tolerance • Protocol compositions: wide area / local area • Paxos on the wide area: Paxos/Paxos, Paxos/BFT • BFT on the wide area: BFT/Paxos, BFT/BFT

An Example: Paxos/BFT Leader site Logical Machine LM1 BLink Logical Links Physical Machines Logical Machine LM2 LM5 Wide-Area Network LM3 LM4 Client

Paxos/BFT in Action LM1 LM2 LM5 Update initiation from Client LM3 LM4

Paxos/BFT in Action LM1 LM2 LM5 Local Ordering of Update, Threshold Signing of Update LM3 LM4

Paxos/BFT in Action LM1 Forwarding of Update to Leader LM via BLink LM2 LM5 LM3 LM4

Paxos/BFT in Action Local Ordering of Update, Threshold Signing of Proposal LM1 LM2 LM5 LM3 LM4

Paxos/BFT in Action LM1 Dissemination of Proposal via BLink LM2 LM5 LM3 LM4

Paxos/BFT in Action LM1 Local Ordering of Proposal, Threshold Signing of Accept LM2 LM5 LM3 LM4

Paxos/BFT in Action LM1 Dissemination of Accepts via BLink LM2 LM5 LM3 LM4

Paxos/BFT in Action LM1 Local Ordering of Accepts, Global Ordering of Proposal LM2 LM5 LM3 LM4

Paxos/BFT in Action LM1 LM2 LM5 Reply to client LM3 LM4

The BLink Protocol • Faulty servers can block communication into and out of logical machines. • Redundant message sending is not feasible in wide-area environments. • Our approach: BLink protocol • Outgoing wide-area messages are normally sent only once. • Four sub-protocols, depending on fault tolerance method in sending and receiving logical machines: • (Byzantine, Byzantine),(Byzantine,benign) • (benign,Byzantine),(benign, benign) • This talk: (Byzantine, Byzantine)

Constructing Logical Links BLink Logical Link • Logical links are constructed from sets of virtual links. • Each virtual link contains: • Forwarder from the sending logical machine • Peer from the receiving logical machine. • Virtual links are constructed via a mapping function. • At a given time, the LM delegates wide-area communication responsibility to one virtual link on each logical link. • Virtual links suspected of being faulty are replaced according to aselection order. Sending Logical Machine Receiving Logical Machine Virtual Links

Intuition: A Simple Mapping 0 X 0 • F = 2, N = 3F+1 = 7 • Servers 0 and 1 from Sending LM and Servers 2 and 3 from Receiving LM faulty. • Mapping function: • Virtual link i consists of the servers with id  i mod N • Selection order: • Cycle through virtual links in sequence (1, 2, 3, ….) 1 X 1 2 2 X 3 3 X • Two important metrics: • Ratio of correct to faulty virtual links • Worst-case number of consecutive faulty virtual links • With the simple mapping: • At least 1/3 of the virtual links are correct. • The adversary can block at most 2F consecutive virtual links. • With a more sophisticated mapping: • At least 4/9 of the virtual links are correct. • The adversary can block at most 2F consecutive virtual links. 4 4 5 5 6 6 Sending LM Receiving LM

Architectural Comparison • Protocols were CPU-limited. • Relative maximum throughput corresponds to the number of expensive cryptographic operations.

Architectural Comparison • Paxos/BFT vs. Steward • Same level of fault tolerance • Paxos/BFT locally orders all wide-area protocol events, Steward orders events only when necessary. • Paxos/BFT achieves about 2.5 times lower throughput than Steward. • Difference is the cost of providing customizability! • Protocols were CPU-limited. • Relative maximum throughput corresponds to the number of expensive cryptographic operations.

Performance Optimizations • Computational Bottlenecks: • 1. Ordering all message reception events. • 2. Threshold signing outgoing messages. • Solutions: • Aggregate local ordering: batching • Aggregate threshold signing: Merkle trees • Use a single threshold signature for many outgoing messages. • Outgoing messages contain additional information needed to verify the threshold signature.

Merkle Hash Trees • Use a single threshold signature for many outgoing wide-area messages. • Each leaf contains the digest of a message to be sent. • Each interior node contains the digest of the concatenation of its two children. • Threshold signature is computed on the root hash. N1-8 Threshold Sign Root hash: D(N1-4 || N5-8) N1-4 D(N1-2 || N3-4) N5-8 D(N5-6 || N7-8) D(N1 || N2) N1-2 N3-4 D(N3 || N4) D(N5 || N6) N5-6 N7-8 D(N7 || N8) D(m1) D(m2) D(m3) D(m4) D(m5) D(m6) D(m7) D(m8) N1 N2 N3 N4 N5 N6 N7 N8

Example: Sending Message m4 • Outgoing message contains additional information needed to verify the signature. • The message itself • The siblings of the nodes on the path from m4 to the root hash • The signature on the root hash • To verify, use the digests to reconstruct the root hash, then verify the threshold signature. m4 || N3 || N1-2 || N5-8 Root hash Send: N1-8 Root hash: D(N1-4 || N5-8) N1-4 D(N1-2 || N3-4) N5-8 D(N5-6 || N7-8) D(N1 || N2) N1-2 N3-4 D(N3 || N4) D(N5 || N6) N5-6 N7-8 D(N7 || N8) D(m1) D(m2) D(m3) D(m4) D(m5) D(m6) D(m7) D(m8) N1 N2 N3 N4 N5 N6 N7 N8

The Insider Threat in Scalable Distributed Systems: Algorithms, Metrics, Gaps