1 / 34

OceanStore Global-Scale Persistent Storage

OceanStore Global-Scale Persistent Storage. John Kubiatowicz University of California at Berkeley. Alex Aiken, PL Eric Brewer, OS John Canny, AI David Culler, OS/Arch Joseph Hellerstein, DB Michael Jordan, Learning Anthony Joseph, OS Randy Katz, Nets John Kubiatowicz, Arch

nirav
Download Presentation

OceanStore Global-Scale Persistent Storage

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OceanStoreGlobal-Scale Persistent Storage John Kubiatowicz University of California at Berkeley

  2. Alex Aiken, PL Eric Brewer, OS John Canny, AI David Culler, OS/Arch Joseph Hellerstein, DB Michael Jordan, Learning Anthony Joseph, OS Randy Katz, Nets John Kubiatowicz, Arch James Landay, UI Jitendra Malik, Vision George Necula, PL Christos Papadimitriou, Theory David Patterson, Arch Kris Pister, Mems Larry Rowe, MM Alberto Sangiovanni-Vincentelli, CAD Doug Tygar, Security Robert Wilensky, DL/AI Context: Project EndeavourInterdisciplinary, Technology-Centered Team

  3. Endeavour Goals: • Enhancing human understanding • Help people to interact with information, devices, and people - exploit Moore’s law growth in everything • Enable new approaches for problem solving & learning • Figure of merit: how effectively we amplify and leverage human intellect • Enabling and exploiting ubiquitous computing • Small devices, sensors, smart materials, cars, etc • New methods for design, construction, and administration of ultra-scale systems • “Planetary-scale” Information Utilities • Infrastructure is transparent and always active • Extensive use of redundancy of hardware and data • Devices that negotiate their interfaces automatically • Elements that tune, repair, and maintain themselves

  4. Endeavour Maxims • Exploit Moore’s law growth for better behavior • Use of “excess” capacity for better human interface • Personal Information Mgmt is the Killer App • Not corporate processing but management, analysis, aggregation, dissemination, filtering for the individual • Automated extraction and organization of daily activities to assist people • Time to move beyond the Desktop • Community computing: infer relationships among information, delegate control, establish authority • Information Technology as a Utility • Continuous service delivery, on a planetary-scale, on top of a highly dynamic information base

  5. “Fluid”, Network-Centric System Software Partitioning and management of state between soft and persistent state Data processing placement and movement Component discovery and negotiation Flexible capture, self-organization, and re-use of information Information Devices Beyond desktop computers to MEMS-sensors/actuators with capture/display to yield enhanced activity spaces Information Utility Information Applications High Speed/Collaborative Decision Making and Learning Augmented “Smart” Spaces: Rooms and Vehicles Design Methodology User-centric Design withHW/SW Co-design; Formal methods for safe and trustworthy decomposable and reusable components Endeavour Approach

  6. OceanStore Context: Ubiquitous Computing • Computing everywhere: • Desktop, Laptop, Palmtop • Cars, Cellphones • Shoes? Clothing? Walls? • Connectivity everywhere: • Rapid growth of bandwidth in the interior of the net • Broadband to the home and office • Wireless technologies such as CMDA, Satelite, laser • Rise of the thin-client metaphor: • Services provided by interior of network • Incredibly thin clients on the leaves • MEMs devices -- sensors+CPU+wireless net in 1mm3 • Mobile society: people move and devices are disposable

  7. Questions about information: • Where is persistent information stored? • 20th-century tie between location and content outdated (we all survived the Feb 29th bug -- let’s move on!) • In world-scale system, locality is key • How is it protected? • Can disgruntled employee of ISP sell your secrets? • Can’t trust anyone (how paranoid are you?) • Can we make it indestructible? • Want our data to survive “the big one”! • Highly resistant to hackers (denial of service) • Wide-scale disaster recovery • Is it hard to manage? • Worst failures are human-related • Want automatic (introspective) diagnose and repair

  8. First Observation:Want Utility Infrastructure • Mark Weiser from Xerox: Transparent computing is the ultimate goal • Computers should disappear into the background • In storage context: • Don’t want to worry about backup • Don’t want to worry about obsolescence • Need lots of resources to make data secure and highly available, BUT don’t want to own them • Outsourcing of storage already becoming popular • Pay monthly fee and your “data is out there” • Simple payment interface one bill from one company

  9. Second Observation:Need wide-scale deployment • Many components with geographic separation • System not disabled by natural disasters • Can adapt to changes in demand and regional outages • Gain in stability through statistics • Difference between thermodynamics and mechanics surprising stability of temperature and pressure given 1030 molecules with highly variable behavior! • Wide-scale use and sharing also requires wide-scale deployment • Bandwidth increasing rapidly, but latency bounded by speed of light • Handling many people with same system leads to economies of scale

  10. OceanStore:Everyone’s data, One big Utility “The data is just out there” • Separate information from location • Locality is an only an optimization (an important one!) • Wide-scale coding and replication for durability • All information is globally identified • Unique identifiers are hashes over names & keys • Single uniform lookup interface replaces: DNS, server location, data location • No centralized namespace required (such as SDSI)

  11. Basic Structure:Irregular Mesh of “Pools”

  12. Amusing back of the envelope calculation(courtesy Bill Bolotsky, Microsoft) • How many files in the OceanStore? • Assume 1010 people in world • Say 10,000 files/person (very conservative?) • So 1014 files in OceanStore! • If 1 gig files (not likely), get 1 mole of files! Truly impressive number of elements… … but small relative to physical constants

  13. Utility-based Infrastructure Canadian OceanStore • Service provided by confederation of companies • Monthly fee paid to one service provider • Companies buy and sell capacity from each other Sprint AT&T IBM Pac Bell IBM

  14. Outline • Motivation • Properties of the OceanStore and Assumptions • Specific Technologies and approaches: • Conflict resolution on encrypted data • Replication and Deep archival storage • Naming and Data Location • Introspective computing for optimization and repair • Economic models • Conclusion

  15. Ubiquitous Devices  Ubiquitous Storage • Consumers of data move, change from one device to another, work in cafes, cars, airplanes, the office, etc. • Properties REQUIRED for OceanStore storage substrate: • Strong Security: data encrypted in the infrastructure; resistance to monitoring and denial of service attacks • Coherence:too much data for naïve users to keep coherent “by hand” • Automatic replica management and optimization:huge quantities of data cannot be managed manually • Simple and automatic recovery from disasters: probability of failure increases with size of system • Utility model: world-scale system requires cooperation across administrative boundaries

  16. State of the Art? • Widely deployed systems: NFS, AFS (/DFS) • Single “regions” of failure, caching only at endpoints • ClearText exposed at various levels of system • Compromised server all data on server compromised • Mobile computing community: Coda, Ficus, Bayou • Small scale, fixed coherence mechanism • Not optimized to take advantage of high-bandwidth connections between server components • ClearText also exposed at various levels of system • Web caching community: Inktomi, Akamai • Specialized, incremental solutions • Caching along client/server path, various bottlenecks • Database Community: • Interfaces not usable by legacy applications • ACID update semantics not always appropriate

  17. OceanStore Assumptions • Untrusted Infrastructure: • The OceanStore is comprised of untrusted components • Only cyphertext within the infrastructure • Information must not be “leaked” over time • Principle Party: • There is one organization that is financially responsible for the integrity of your data • Mostly Well-Connected: • Data producers and consumers are connected to a high-bandwidth network most of the time • Exploit multicast for quicker consistency when possible • Promiscuous Caching: • Data may be cached anywhere, anytime • Operations Interface with Conflict Resolution: • Applications employ an operations-oriented interface, rather than a file-systems interface • Coherence is centered around conflict resolution

  18. OceanStore Technologies I:Naming and Data Location • Requirements: • System-level names should help to authenticate data • Route to nearby data without global communication • Don’t inhibit rapid relocation of data • OceanStore approach: Two-level search with embedded routing • Underlying namespace is flat and built from secure cryptographic hashes (160-bit SHA-1) • Search process combines quick, probabilistic search with slower guaranteed search • Long-distance data location and routing are integrated • Every source/destination pair has multiple routing paths • Continuous, on-line optimization adapts for hot spots, denial of service, and inefficiencies in routing

  19. Floating Replica Universal Name Active Data Name OID Version OID Global Object Resolution Commit Logs Checkpoint OID Root Structure Update OID: Archive versions: Version OID1 Version OID2 Version OID3 Global Object Resolution Global Object Resolution Global Object Resolution Archival copy or snapshot Archival copy or snapshot Archival copy or snapshot Erasure Coded: Universal Location Facility • Takes 160-bit unique identifier (GUID) and Returns the nearest object that matches

  20. Some current results: • Have a working algorithm for local search • Uses attenuated bloom filters • Performs search by passing messages from node to node. All state kept in messages! • Updates filters through semi-chaotic passing of information between neighbors • Resembles compiler dataflow algorithm • Can be shown to converge • Have candidate for “backing store” index • Randomized data structure with locality properties • Every document has multiple roots in the OceanStore • Searches “close” to copy tend to find copy quickly • Redundant, insensitive to faults, and repairable • Investigating algorithms to continually adapt routing structure to adjust for faults and denial of service

  21. OceanStore Technologies II:Rapid Update in an Untrusted Infrastructure • Requirements: • Scalable coherence mechanism which can operate directly on encrypted datawithout revealing information • Handle Byzantine failures • Rapid dissemination of committed information • OceanStore Approach: • Operations-based interface using conflict resolution • Modeled after Xerox Bayou  updates packets include:Predicate/update pairs which operate on encryped data • Use of oblivious function techniques to perform this update • Use of incremental cryptographic techniques • User signs Updates and principle party signs commits • Committed data multicast to clients

  22. Tentative Updates:Epidemic Disemination

  23. Committed Updates:Multicast Dissemination

  24. Our State of the Art • Have techniques for protecting metadata • Uses encryption and signatures to provide protection against substitution attacks • Provides “secure pointer” technology • Have a working scheme that can do some forms of conflict resolution directly on encryped data • Uses new technique for searching on encrypted data. • Can be generalized to perform optimistic concurrency, but at cost in performance and possibly privacy • Byzantine assumptions for update commitment: • Signatures on update requests from clients • Compromised servers are unable to produce valid updates • Uncompromised second-tier servers can make consistent ordering decision with respect to tentative commits • Use of threshold cryptography in inner-tier of servers • Signatures on update stream from inner-tier • Use of chained MACs to reduce overhead

  25. OceanStore Technologies III:High-Availability and Disaster Recovery • Requirements: • Handle diverse, unstable participants in OceanStore • Mitigate denial of service attacks • Eliminate backup as independent (and fallible) technology • Flexible “disaster recovery” for everyone • OceanStore Approach: • Use of erasure-codes to provide stable storage for archival copies and snapshots of live data • Mobile replicas are self-contained centers for logging and conflict resolution • Version-based update for painless recovery • Continuous introspection repairs data structures and degree of redundancy

  26. Floating Replicas and “Deep Archival Coding” • Floating Replicas are per-object virtual servers • Complete copy of data • logging for updates/conflict resolution • Interaction with other centers to keep data consistent • May appear and disappear like bubbles • Erasure coded fragments provide very stable store • Multi-level codes spread over 1000s of nodes • Could lose 1/2 of nodes and still recover data • Archive: old versions of data and checkpoints • Inactive data may only be in erasure-coded form

  27. Full Copy Full Copy Full Copy Ver1: 0x34243 Ver2: 0x49873 Ver3: … Ver1: 0x34243 Ver2: 0x49873 Ver3: … Ver1: 0x34243 Ver2: 0x49873 Ver3: … Conflict Resolution Logs Conflict Resolution Logs Conflict Resolution Logs Floating Replica Erasure-coded Fragments Floating Replica and Deep Archival Coding

  28. Checkpoint Reference (Later Version) Redo Logs Metadata Redo Logs Metadata Unit of Coding Fragments Structure of Archival Checkpoints • All blocks and fragments signed • “Copy on Write” behavior • Older metablocks fragmented also Checkpoint Reference (GUID) . . . . . NOTE: Each Block needs a GUID Blocks Unit of Archival Storage

  29. Proactive Self-Maintenance • Continuous testing and repair of information • Slow sweep through all information to make sure there are sufficient erasure-coded fragments • Continuously reevaluate of risk and redistribute data • Slow sweep and repair of metadata/search trees • Continuous online self-testing of HW and SW • detects flaky, failing, or buggy components via: • fault injection:triggering hardware and software error handling paths to verify their integrity/existence • stress testing: pushing HW/SW components past normal operating parameters • scrubbing: periodic restoration of potentially “decaying” hardware or software state • automates preventive maintenance

  30. OceanStore Technologies IV:Introspective Optimization • Requirements: • Reasonable job on global-scale optimization problem • Take advantage of locality whenever possible • Sensitivity to limited storage and bandwidth at endpoints • Repair of data structures, increasing of redundancy • Stability in chaotic environment  Active Feedback • OceanStore Approach: • Introspective Monitoring and analysis of relationships to cluster information by relatedness • Time series-analysis of user and data motion • Rearrangement and replication in response to monitoring • Clustered prefetching: fetch related objects • Proactive-prefetching: get data there before needed • Rearrangement in response to overload and attack

  31. Example: Client Introspection • Client observer and optimizer components • greedy agents working on the behalf of the client • Watches client activity/combines with historical info • Performs clustering and time-series analysis • forwards results to infrastructure (privacy issues!) • Monitoring of state of network to adapt behavior • Typical Actions: • cluster related files together • prefetch files that will be needed soon • Create/destroy floating replicas

  32. OceanStore Technologies V:The oceanic data market • Properties: • Utility providers have resources (storage and bandwidth) • Clients use resources both directly and indirectly • Use of data storage and bandwidth on demand • Data movement “on behalf” of users • Some customers are more important than others • Techniques that we are exploring (very preliminary) • Data market driven by principle party • Tradeoff between performance (replication) and cost • Secure signatures on data packets permit: • Accounting of bandwidth and CPU utilization • Access control policies (Bays in OceanStore nomenclature) • Use of challenge-response protocols (similar to zero-knowledge proofs) to demonstrate possession of data

  33. Two-Phase Implementation: • This term: Read-Mostly Prototype • Construction of data location facility • Initial introspective gathering of tacit info and adaptation • Initial archival techniques (use of erasure codes) • Unix file-system interface under Linux (“legacy apps”) • Later?: Full Prototype • Final conflict resolution and encryption techniques • More sophisticated tacit info gathering and rearrangement • Final object interface and integration with Endeavour applications • Wide-scale deployment via NTON and Internet-2

  34. OceanStore Conclusion • The Time is now for a Universal Data Utility • Ubiquitous computing and connectivity is (almost) here! • Confederation of utility providers is right model • OceanStore holds all data, everywhere • Local storage is a cache on global storage • Provides security in an untrusted infrastructure • Large scale system has good statistical properties • Use of introspection for performance and stability • Quality of individual servers enhances reliability • Exploits economies of scale to: • Provide high-availability and extreme survivability • Lower maintenance cost: • self-diagnosis and repair • Insensitivity to technology changes:Just unplug one set of servers, plug in others

More Related