320 likes | 473 Views
Scalable Self-Repairing Publish/Subscribe. Robbert van Renesse Ken Birman Werner Vogels Cornell University. Background. ISIS, Horus, Ensemble systems Strong properties (for replicated data) Adaptive (changing network/app behavior) Problems… as fast as slowest receiver “Jim Gray effect”
E N D
Scalable Self-Repairing Publish/Subscribe Robbert van Renesse Ken Birman Werner Vogels Cornell University
Background • ISIS, Horus, Ensemble systems • Strong properties (for replicated data) • Adaptive (changing network/app behavior) • Problems… • as fast as slowest receiver • “Jim Gray effect” • no IP Multicast
New Direction • Probabilistically Strong Guarantees • Randomized protocols • Compartmentalization • No reliance on IP multicast, clock sync • Auto-configuration, self-repair JBI
Three Main Components • Astrolabe • Aggregation Service • SelectCast • Dissemination Service • Bimodal Multicast • End-to-end reliability
Aggregation • Ability to summarize information from distributed sources. • aka data fusion in sensor networks. • The basis for scalability! • Standard service in databases. • Why not in distributed systems?
Examples • Barrier Synchronization • Voting • Resource Location • Multicast Routing F
Astrolabe • Astrolabe takes continuous snapshots of the global state of a distributed system, and aggregates this information in user-specified ways.
Four Design Principles • Scalability through Hierarchy • Flexibility through Mobile SQL • Robustness through p2p Gossip • Security through Certificates
DNS-like Domain Hierarchy Attribute list Domains identified by path names
MIB • Each domain has an attribute list called “MIB” (management information base). • MIBs of internal domains generated by aggregating child domains’ MIBs.
Domain Table • No servers for any domain: a MIB is replicated on all hosts in its domain! • Each host maintains not only the MIBs of its own domains, but also those of its sibling domains. • Sibling MIBs organized in “domain tables”.
Aggregation Dynamically changing query output is visible domain-wide (like spreadsheet) SQL query “summarizes” data Domain2 Domain1
Example queries • SELECT SUM(nmembers) AS nmembers • SELECT MAX(depth) + 1 AS depth • SELECT MIN(minl) AS minl • (minimum load) • … • Functions gossiped with everything else.
Aggregation Domain2 Domain1
Aggregation Domain2 Domain1 O(log n) info per host
Other Examples • Which are the three lowest loaded hosts? • Which domains contain hosts with an out-of-date virus database? • Do >30% of hosts measure elevated radiation? • Which domains contain subscribers interested in some topic? • Where is the nearest logging server?
Epidemic or Gossip Protocols • Used to keep domain tables up-to-date • Randomized Communication between (nearby) hosts: • Fast (latency grows O(log n)) • Hard to stop (robust even in the face of Denial-of-Service attacks) • Probabilistically Real-Time guarantees on latency (based on epidemiological analysis).
SQL How it works… gossip
SelectCast • Disseminate messages through Astrolabe hierarchy • (Application-level) Routers selected through domain aggregation: SELECT FIRST(3, routers) AS routers, MIN(minload) AS minload ORDER BY minload Exploit heterogeneity, don’t hide it!
Filtering (Pub/Sub) • SQL condition on each message • For example: • MIN(version) < 3 • MAX(radiation) > 300 • OR(subject) // BLOOM FILTERS • TRUE • Generalization of topic based publishing
Scalability • Latency, memory use, CPU load, load on network links, all grow O(log N), and independent of update rate. • Highly robust to omission and crash failures. • Confirmed by analysis, simulation, and experiment. • O(1) lookup for most useful queries.
Real vs. Simulation The real thing Simulation
Membership • Domain failure detected when its attributes are no longer being updated. • Domains discovered (and partitions repaired) through • gossip • occasional broadcast and multicast • configuration • Special precautions for domains separated by firewalls and NAT boxes
Security • Integrated PKI • integrity, no confidentiality • prevents “Sybil” Attacks • Remove outliers • Summarize in a robust way • Compartmentalize • Exploit domain hierarchy
Bimodal Multicast • Probabilistic end-to-end reliability • Uses IP Multicast or SelectCast for initial dissemination • Runs a background gossip protocol to do repairs of message loss • Performance improves with scale • share buffering load
Work in Progress • Evaluate Scalability and Performance • emulation, simulation, deployment • Improve support for low power apps • self configuration • Improve expressiveness • pattern matching