250 likes | 334 Views
OceanStore Status and Directions ROC/OceanStore Retreat 1/16/01. John Kubiatowicz University of California at Berkeley. Questions about ubiquitous information:. Where is persistent information stored?
E N D
OceanStoreStatus and DirectionsROC/OceanStore Retreat 1/16/01 John Kubiatowicz University of California at Berkeley
Questions about ubiquitous information: • Where is persistent information stored? • Want: Geographic independence for availability, durability, and freedom to adapt to circumstances • How is it protected? • Want: Encryption for privacy, signatures for authenticity, and Byzantine commitment for integrity • Can we make it indestructible? • Want: Redundancy with continuous repair and redistribution for long-term durability • Is it hard to manage? • Want: automatic optimization, diagnosis and repair
Everyone’s Data, One Utility • Millions of servers, billions of clients …. • 1000-YEAR durability (excepting fall of society) • Maintains Privacy, Access Control, Authenticity • Incrementally Scalable (“Evolvable”) • Self Maintaining! • Not quite peer-to-peer: • Utilizing servers in infrastructure • Some computational nodes more equal than others
Want Automatic Maintenance • Can’t possibly manage billions of servers by hand! • System should: • Be Fault-Tolerance (High MTTF) • Repair itself (Low MTTR through adaptation) • Incorporate new elements • Can we guarantee data is available for 1000 years? • New servers added from time to time • Old servers removed from time to time • Everything just works • Many components with geographic separation • System not disabled by natural disasters • Can adapt to changes in demand and regional outages • Gain in stability through statistics
OceanStore Assumptions • Untrusted Infrastructure: • The OceanStore is comprised of untrusted components • Only ciphertext within the infrastructure • Responsible Party: • Some organization (i.e. service provider) guarantees that your data is consistent and durable • Not trusted with content of data, merely its integrity • Mostly Well-Connected: • Data producers and consumers are connected to a high-bandwidth network most of the time • Exploit multicast for quicker consistency when possible • Promiscuous Caching: • Data may be cached anywhere, anytime
This Talk: making it real!(Or: you will hear reality from my students)
Inner-Ring Servers Second-Tier Caches Clients Multicast trees The Path of an OceanStore Update
Important Components: • Data Object: (Distribution-enabled data format) • Must support copy-on-write and versioning efficiently • Must allow sparse population of data in caches • Must smoothly interface with archive • Inner Ring: (Byzantine Agreement) • Check write access control • Choose seriallize updates/resolve micro-conflicts • Sign result with Threshold Signature • Erasure code result and send fragments • Second Tier Server: (Promiscuous Caches) • Serve local clients • Tie itself into Dissemination tree • Apply updates that it receives through tree • Decision point for caching policies: tentative vs committed
Thread Scheduler X Introspection Modules D i s p a t c h Y 1 4 3 2 4 Consistency Location & Routing Archival Asynchronous Network Asynchronous Disk Java Virtual Machine Operating System Network Implementation Framework • Event-driven Implementation Model in Java • Divided into a sequence of communicating “stages” • Communication between stages in the form of “snoopable” messages • > 100,000 lines of Java, Comments, Test scripts • Substantially functioning!
Foo Bar Baz Each link is an AGUID Out-of-Band “Root link” Myfile GUIDs for Naming • Unique, location independent identifiers: • Every version of every unique entity has a permanent, Version-GUID (or VGUID): Hash over content Versioning supports time-travel • Each object has a permanent (version-independent) Archival-GUID (or AGUID): • Signed Associations between AGUIDs and latest VGUIDs are produced by inner ring (called Heartbeats) • Naming hierarchy: • Users map from names to AGUIDs via hierarchy of OceanStore objects
Check Point == V6 Check Point == V11 Data B -Tree M M Indirect Blocks Indirect Blocks Blocks Blocks Log Object d'8 d'9 d1 d2 d3 d4 d5 d6 d7 d8 d9 V10 V9 V8 Unit of Coding Unit of Coding Set of Log Entries V7 Encoded Fragments: Unit of Archival Storage Encoded Fragments: Unit of Archival Storage Verification Tree Verification Tree GUID of d1 GUID of d'8 Data Object StructureAll about flexibility and validation
Status:Data Object Development • Second-Tier Replica support: functional • Second-tier caches can hold multiple versions • Tie themselves into multicast trees • Several dissemination tree algorithms explored • Updates forwarded from inner ring through trees • Complete B-Tree object structure developed • Data blocks named with unforgeable hashes • Hashes can point to archival fragments/live blocks • Supports copy on write • Top block defines complete version • Missing blocks filled in from archive or other replicas • Update commits with distributed threshold signatures • Byzantine commitment not quite integrated into prototype • Traffic generator for testing
Model Builder Introspection Human Input Set Creator Network Monitoring model probe fragments set type Disseminator set Disseminator fragments fragments The Dissemination Process
Achieving Low MTTR:Global Heartbeats • Trigger repair when level of redundancy to low • Continuous sweep (slowly over time)
Status:Archival Infrastructure • Archival Fragments generated by Inner Ring • Multi-stage-based implementation at inner ring • Storage servers hold fragments • Caching servers (2nd- tier replicas) hold data objects • Independence Analysis (mostly there) • Node discovery technique exists • Analysis of long-running reliability data • Dissemination-set creator: initial versions • Storage servers (Naïve but functional): • Initial implementation: cache + object store • Ongoing tuning efforts • Redesign in the works
Location Independent Routing • Paradigm: Routing • Route messages to objects by GUID regardless of location • Fast, probabilistic search for “routing cache”: • Built from attenuated bloom filters • Approximation to gradient search • Redundant Plaxton Mesh used for underlying routing infrastructure: • Randomized data structure with locality properties • Redundant, insensitive to faults, and repairable • Amenable to continuous adaptation to adjust for: • Changing network behavior • Faulty servers • Denial of service attacks • Tomorrow: 3 talks on Routing
Status: Location Independent Routing • Basic Tapestry infrastructure is operational • Single-path static routing: works • Multi-path adaptive routing: mostly there • Dynamic Integration of new nodes: implemented • Network adaptation almost there (Patchwork) • Framework for Measurement of network properties • Periodic beacons measure loss and network latency • Exploitation of Differences in nodes: • Brocade backbone supplement to Tapestry: Improves routing • Differentiation in service experiments ongoing • Theoretical Results on Tapestry • Construction/Analysis of Dynamic Integration Algorithms • Voluntary/involuntary node deletion algorithms • View of Tapestry as data structure for solving nearest neighbor • Attenuated Bloom Filters are operational • Implemented and functional • Optimizes short-distance routing infrastructure!
Compute Adapt Monitor Introspection:The New Architectural Creed • Using Moore’s law gains for something other than performance • Examples: • Online algorithmic validation • Model building for data rearrangement • Availability • Better prefetching • Extreme Durability (1000-year time scale?) • Use of erasure coding and continuous repair • Stability through Statistics • Use of redundancy to gain more predictable behavior • Systems version of Thermodynamics! • Continuous Dynamic Optimization of other sorts
Status: Introspection • Development of OIL framework for introspection: this framework is operational • Collection facilities can observe all events in the system • Multiple aggregation models available • Example 1: Clustering for prefetching • Currently builds Hidden Markov-model of access patterns utilizing OIL framework • Almost there: • Use models to better prefetch objects • Placement of replices assisted by bloom filters (almost) • Example 2: Observation of network behavior • Framework for observation of network latencies • Adaptation of network topology: almost there • Example 3: Grammer building for prefetching • Experiment of introspection at processor level • Talk later today about this (Mark Whitney)
Status:Medium Scale Test and Emulation • Two medium clusters from IBM SUR Grant • Each cluster 21 servers: • Each with two 1 GHz processors • One GByte of RAM, 73 GB of Disk • 1 GB Switch per cluster • MIRNET switch • Plan to have continuous OceanStore components running – in approximately 1 month • Emulation technology: currently works • Able to simulate large-scale network by simulating network latencies • Multiple OceanStore nodes emulated/node
Day Dreams?(Becoming real) • NFS File system built in OceanStore (Exists) • Still have to integrate ACLs • Update to latest prototype • Windows Installable File system (Planning) • “USB Keys” hold cryptographic keys and personal identity • Automatic downloading and verification of filesystem • IMAP OceanStore gateway (Planning) • Lotus Notes Domino Server • Exploring use of work flow on top of OceanStore
OceanStore Conclusions • OceanStore: everyone’s data, one big utility • Global Utility model for persistent data storage • Very Soon: Working OceanStore cluster!!!! • Event-driven programming in Java • You will hear about components today and tomorrow • OceanStore assumptions: • Untrusted infrastructure with a responsible party • Mostly connected with conflict resolution • Continuous on-line optimization
For more info: • OceanStore vision paper for ASPLOS 2000 “OceanStore: An Architecture for Global-Scale Persistent Storage” • OceanStore paper on Maintenance (IEEE IC): “Maintenance-Free Global Data Storage” • Both available on OceanStore web site: http://oceanstore.cs.berkeley.edu/