Scalable Trusted Computing Engineering challenge, or something more fundamental?

Scalable Trusted ComputingEngineering challenge, or something more fundamental? Ken Birman Cornell University

Cornell Quicksilver Project • Krzys Ostrowski: The key player • Ken Birman, Danny Dolev: Collaborators and research supervisors • Mahesh Balakrishnan, Maya Haridasan, Tudor Marian, Amar Phanishayee, Robbert van Renesse, Einar Vollset, Hakim Weatherspoon: Offered valuable comments and criticisms

Trusted Computing • A vague term with many meanings… • For individual platforms, integrity of the computing base • Availability and exploitation of TPM h/w • Proofs of correctness for key components • Security policy specification, enforcement • Scalable trust issues arise mostly in distributed settings

System model • A world of • Actors: Sally, Ted, … • Groups: Sally_Advisors = {Ted, Alice, …} • Objects: travel_plans.html, investments.xls • Actions: Open, Edit, … • Policies: • (Actor,Object,Action)  { Permit, Deny } • Places: Ted_Desktop, Sally_Phone, ….

Rules • If Emp.place  Secure_Place and Emp  Client_Advisors thenAllow Open Client_Investments.xls • Can Ted, working at Ted_Desktop, open Sally_Investments.xls? • … yes, if Ted_Desktop  Secure_Places

Miscellaneous stuff • Policy changes all the time • Like a database receiving updates • E.g. as new actors are added, old ones leave the system, etc • … and they have a temporal scope • Starting at time t=19 and continuing until now, Ted is permitted to access Sally’s file investments.xls

Order dependent decisions • Consider rules such as: • Only one person can use the cluster at a time. • The meeting room is limited to three people • While people lacking clearance are present, no classified information can be exposed • These are sensitive to the order in which conflicting events occur • Central “clearinghouse” decides what to allow based on order in which it sees events

Goal: Enforce policy Read investments.xls (data) Policy Database

… reduction to a proof • Each time an action is attempted, system must develop a proof either that the action should be blocked or allowed • For example, might use the BAN logic For the sake of argument, let’s assume we know how to do all this on a single machine

Implications of scale • We’ll be forced to replicate and decentralize the policy enforcement function • For ownership: Allows “local policy” to be stored close to the entity that “owns” it • For performance and scalability • For fault-tolerance

Decentralized policy enforcement Read investments.xls (data) Policy Database Original Scheme

Decentralized policy enforcement Read investments.xls (data) Policy DB 1 Policy DB 2 New Scheme

So… how do we decentralize? • Consistency: the bane of decentralization • We want a system to behave as if all decisions occur in a single “rules” database • Yet want the decisions to actually occur in a decentralized way… a replicated policy database • System needs to handle concurrent events in a consistent manner

So… how do we decentralize? • More formally: • Analogy: database 1-copy serializability Any run of the decentralized system should be indistinguishable from some run of a centralized system

But this is a familiar problem! • Database researchers know it as the atomic commit problem. • Distributed systems people call it: • State machine replication • Virtual synchrony • Paxos-style replication • … and because of this we know a lot about the question!

… replicated data with abcast • Closely related to the “atomic broadcast” problem within a group • Abcast sends a message to all the members of a group • Protocol guarantees order, fault-tolerance • Solves consensus… • Indeed, a dynamic policy repository would need abcast if we wanted to parallelize it for speed or replicate it for fault-tolerance!

A slight digression • Consensus is a classical problem in distributed systems • N processes • They start execution with inputs {0,1} • Asynchronous, reliable network • At most 1 process fails by halting (crash) • Goal: protocol whereby all “decide” same value v, and v was an input

Distributed Consensus Jenkins, if I want another yes-man, I’ll build one! Lee Lorenz, Brent Sheppard

Asynchronous networks • No common clocks or shared notion of time (local ideas of time are fine, but different processes may have very different “clocks”) • No way to know how long a message will take to get from A to B • Messages are never lost in the network

Fault-tolerant protocol • Collect votes from all N processes • At most one is faulty, so if one doesn’t respond, count that vote as 0 • Compute majority • Tell everyone the outcome • They “decide” (they accept outcome) • … but this has a problem! Why?

What makes consensus hard? • Fundamentally, the issue revolves around membership • In an asynchronous environment, we can’t detect failures reliably • A faulty process stops sending messages but a “slow” message might confuse us • Yet when the vote is nearly a tie, this confusing situation really matters

Some bad news • FLP result shows that fault-tolerant consensus protocols always have non-terminating runs. • All of the mechanisms we discussed are equivalent to consensus • Impossibility of non-blocking commit is a similar result from database community

But how bad is this news? • In practice, these impossibility results don’t hold up so well • Both define “impossible  not always possible” • In fact, with probabilities, the FLP scenario is of probability zero • … must ask: Does a probability zero result even hold in a “real system”? • Indeed, people build consensus-based systems all the time…

Solving consensus • Systems that “solve” consensus often use a membership service • This GMS functions as an oracle, a trusted status reporting function • Then consensus protocol involves a kind of 2-phase protocol that runs over the output of the GMS • It is known precisely when such a solution will be able to make progress

More bad news • Consensus protocols don’t scale! • Isis (virtual synchrony) new view protocol • Selects a leader; normally 2-phase; 3 if leader dies • Each phase is a 1-n multicast followed by an n-1 convergecast (can tolerate n/2-1 failures) • Paxos decree protocol • Basic protocol has no leader and could have rollbacks with probability linear in n • Faster-Paxos is isomorphic to the Isis view protocol (!) • … both are linear in group size. • Regular Paxos might be O(n2) because of rollbacks

Work-arounds? • Only run the consensus protocol in the “group membership service” or GMS • It has a small number of members, like 3-5 • They run a protocol like the Isis one • Track membership (and other “global” state on behalf of everything in the system as a whole” • Scalability of consensus won’t matter

But this is centralized • Recall our earlier discussion • Any central service running on behalf of the whole system will become burdened if the system gets big enough • Can we decentralize our GMS service?

GMS in a large system Global events are inputs to the GMS Output is the official record of events that mattered to the system GMS

Hierarchical, federated GMS • Quicksilver V2 (QS2) constructs a hierarchy of GMS state machines • In this approach, each “event” is associated with some GMS that owns the relevant official record GMS0 GMS2 GMS1

Delegation of roles • One (important) use of the GMS is to track membership in our rule enforcement subsystem • But “delegate” responsibility for classes of actions to subsystems that can own and handle them locally • GMS “reports” the delegation events • In effect, it tells nodes in the system about the system configuration – about their roles • And as conditions change, it reports new events

Delegation In my capacity as President of the United States, I authorize John Pigg to oversee this nation’s banks Thank you, sir! You can trust me

Delegation GMS0 GMS1 Policysubsystem

Delegation example • IBM might delegate the handling of access to its Kingston facility to the security scanners at the doors • Events associated with Kingston access don’t need to pass through the GMS • Instead, they “exist” entirely within the group of security scanners

… giving rise to pub/sub groups • Our vision spawns lots and lots of groups that own various aspects of trust enforcement • The scanners at the doors • The security subsystems on our desktops • The key management system for a VPN • … etc • A nice match with publish-subscribe

Publish-subscribe in a nutshell • Publish(“topic”, message) • Subscribe(“topic”, handler) • Basic idea: • Platform invokes handler(message) each time a topic match arises • Fancier versions also support history mechanisms (lets joining process catch up)

Publish-subscribe in a nutshell • Concept first mentioned by Willy Zwaenepoel in a paper on multicast in the V system • First implementation was Frank Schmuck’s Isis “news” tool • Later re-invented in TIB message bus • Also known as “event notification”… very popular

Other kinds of published events • Changes in the user set • For example, IBM hired Sally. Jeff left his job at CIA. Halliburton snapped him up • Or the group set • Jeff will be handling the Iraq account • Or the rules • Jeff will have access to the secret archives • Sally is no longer allowed to access them

But this raises problems • If “actors” only have partial knowledge • E.g. the Cornell library door access system only knows things normally needed by that door • … then we will need to support out-of-band interrogation of remote policy databases in some cases

A Scalable Trust Architecture GMS hierarchy tracks configuration events GMS GMS GMS Pub/sub framework Roledelegation Slave systemapplies policy Masterenterprisepolicy DB Knowledge limited to locally useful policy Central database tracks overall policy Enterprise policy system for some company or entity

A Scalable Trust Architecture • Enterprises talk to one-another when decisions require non-local information PeopleSoft Inquiry FBI (policy) Cornell University

www.zombiesattackithaca.com

Open questions? • Minimal trust • A problem reminiscent of zero-knowledge • Example: • FBI is investigating reports of zombies in Cornell’s Mann Library… Mulder is assigned to the case. • The Cornell Mann Library must verify that he is authorized to study the situation • But does FBI need to reveal to Cornell that the Cigarette Man actually runs the show?

Other research questions • Pub-sub systems are organized around topics, to which applications subscribe • But in a large-scale security policy system, how would one structure these topics? • Topics are like file names – “paths” • But we still would need an agreed upon layout

Practical research question • “State transfer” is the problem of initializing a database or service when it joins the system after an outage • How would we implement a rapid and secure state transfer, so that a joining security policy enforcement module can quickly come up to date? • Once it’s online, the pub-sub system reports updates on topics that matter to it

Practical research question • Designing secure protocols for inter-enterprise queries • This could draw on the secured Internet transaction architecture • A hierarchy of credential databases • Used to authenticate enterprises to one-another so that they can share keys • They employ the keys to secure “queries”

Recap? • We’ve suggested that scalable trust comes down to “emulation” of a trusted single-node rule enforcement service by a distributed service • And that service needs to deal with dynamics such as changing actor set, object set, rule set, group membership

Recap? • Concerns that any single node • Would be politically unworkable • Would impose a maximum capacity limit • Won’t be fault-tolerant • … pushed for a decentralized alternative • Needed to make a decentralized service emulate a centralized one

Recap? • This led us to recognize that our problem is an instance of an older problem: replication of a state machine or an abstract data type • The problem reduces to consensus… and hence is impossible • … but we chose to accept “Mission Impossible: V”

… Impossible? Who cares! • We decided that the impossibility results were irrelevant to real systems • Federation addressed by building a hierarchy of GMS services • Each supported by a group of servers • Each GMS owns a category of global events • Now can create pub/sub topics for the various forms of information used in our decentralized policy database • … enabling decentralized policy enforcement

QS2: A work in progress • We’re building Quicksilver, V2 (aka QS2) • Under development by Krzys Ostrowski at Cornell, with help from Ken Birman, Danny Dolev (HUJL) • Some parts already exist and can be downloaded now: • Quicksilver Scalable Multicast (QSM). • Focus is on reliable and scalable message delivery even with huge numbers of groups or severe stress on the system

Scalable Trusted Computing Engineering challenge, or something more fundamental?

Scalable Trusted Computing Engineering challenge, or something more fundamental?

Presentation Transcript

Trusted Computing Technology and Client-side Access Control Architecture

Operating System Security CS 136 Computer Security Peter Reiher February 2, 2010

Optical Interconnection Networks for Scalable High-performance Parallel Computing Systems

The Reversible Computing Question: A Crucial Challenge for Computing

Engineering Challenge: Championing the Next Generation

SENSE: Scalable and Efficient Networking of Sensor Elements

SCALABLE QUERY PROCESSING IN SERVICE ORIENTED SENSOR NETWORKS

An engineering challenge

Cyberinfrastructure for Scalable and High Performance Geospatial Computation

Scalable Parallel ComputIng

Towards trusted cloud computing

Faster , More Scalable Computing in the Cloud

The Multicore Software Challenge

Trusted Computing Technology and Client-side Access Control Architecture

Architecture and Performance of Runtime Environments for Data Intensive Scalable Computing

Trusted Computing Amidst Untrustworthy Intermediaries

FUNDAMENTAL OF ELECTRICAL ENGINEERING EMT 113/4

Competition and ‘Trusted Computing’