1 / 24

Neptune: Scalable Replication Management and Programming Support for Cluster-based Network Services

Neptune is a scalable clustering architecture that provides replication management and programming support for large-scale network services with persistent data. It offers a flexible programming model and replica consistency support to address availability and performance tradeoffs.

erandy
Download Presentation

Neptune: Scalable Replication Management and Programming Support for Cluster-based Network Services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Neptune: Scalable Replication Management and Programming Support for Cluster-based Network Services Kai Shen, Tao Yang, Lingkun Chu, JoAnne L. Holliday, Douglas K. Kuschner, and Huican Zhu Department of Computer Science University of California, Santa Barbara http://www.cs.ucsb.edu/research/Neptune

  2. Motivations • Availability, incremental-scalability, and manageability - key requirements for building large-scale network services. • Challenging for those with frequent persistent data updates. • Existing solutions in managing persistent data: • Pure data partitioning: no availability guarantee; bad at dealing with runtime hot-spots. • Disk-sharing: inherently unscalable; single-point of failure. • Replication provided by database vendors: tied to specific database systems; inflexible in consistency. USITS 2001, San Francisco

  3. Neptune Project Goal • Design a scalable clustering architecture for aggregating and replicating network services with persistent data. • Provide a simple and flexible programming model to shield complexity of data replication, service discovery, load balancing, and failover management. • Provide flexible replica consistency support to address availability and performance tradeoffs for different services. USITS 2001, San Francisco

  4. Related Work • TACC, MultiSpace: infrastructure support for cluster-based network services. • DDS: distributed persistent data structure for network services. • Porcupine: cluster-based email service (with commutative updates). • Bayou: weak consistency for wide-area applications. • BEA Tuxedo– platform middleware supporting transactional RPC. USITS 2001, San Francisco

  5. Outline • Motivations & Related Work • System Architecture and Assumptions • Replica Consistency and Failure Recovery • System implementation and Service Deployments • Experimental Studies USITS 2001, San Francisco

  6. Partitionable Network Services Characteristics of network services: • Information independence. Service data can be divided into independent categories (e.g. discussion group). • User independence. Data accessed by different users tend to be independent (e.g. email service). Neptune is targeting partitionable network services: • Service data can be divided into independent partitions. • Each service access can be delivered independently on a single partition; or • Each service access can be aggregated from sub-services each of which can be delivered independently on a single partition. USITS 2001, San Francisco

  7. Conceptual Architecture for a Neptune Service Cluster USITS 2001, San Francisco

  8. Neptune Components Neptune components on client and server-side: • Neptune Server Module: starts, regulates, terminates registered service instances and maintains replica data consistency. • Neptune Client Module: provides location-transparent accesses to application service clients. USITS 2001, San Francisco

  9. Programming Interfaces Request/Response communications: • Client-side API: (called by service clients) NeptuneCall (CltHandle, Service, Partition, SvcMethod, Request, Response); • Service Interface: (abstract interface that application services implement) SvcMethod (SvcHandle, Partition, Request, Response); Stream-based communications: • Neptune sets up a bi-directional stream between the service client and the service instance. USITS 2001, San Francisco

  10. Assumptions • All system modules follow fail-stop failure model. • Network partitions do not occur inside the service cluster. Neptune does allow persistent data survive all-node failures. • Atomic execution is supported if each underlying service module ensures atomicity in stand-alone configuration. USITS 2001, San Francisco

  11. Neptune Replica Consistency Model A service access is called a write if it changes the state of persistent data; and it is called a read otherwise. • Level 1: Write-anywhere replication for commutative writes. Writes are accepted at any replica and propagated to peers. E.g. message board (append-only). • Level 2: Primary-secondary replication for ordered writes. Writes are only accepted at primary node, then ordered and propagated to secondaries. • Level 3: Primary-secondary replication with staleness control.Soft time-based staleness bound and progressive version delivery. Not strong consistency because writes completed independently at each replica. USITS 2001, San Francisco

  12. Soft Time-based Staleness Bound • Semantics: each read serviced at a replica at most x seconds stale compared to the primary. • Important for services such as on-line auction. • Implementation: • Each replica periodically announces its data version; • Neptune client module directs requests only to replicas with a fresh enough version. • The bound is soft, depending on network latency, announcement frequency, and intermittent packet losses. USITS 2001, San Francisco

  13. Progressive Version Delivery • From each client’s point of view, • Writes are always seen by subsequent reads. • Versions delivered for reads are progressive. • Important for services like on-line auction. • Implementation: • Each replica periodically announces its data version; • Each service invocation returns a version number for a service client to keep as a session variable; • Neptune client module directs a read to a replica with an announced version >= all the previously-returned version. USITS 2001, San Francisco

  14. Failure Recovery A REDO log is maintained for each data partition at each replica, which has two portions: • Committed portion: completed writes; • Uncommitted portion: writes received but not yet completed. Three-phrase recovery for primary-secondary replication (level-2 & level-3): • Synchronize with underlying service module; • Recover missed writes from the current primary; • Resume normal operations. Only phase one is necessary for write-anywhere replication (level-1). USITS 2001, San Francisco

  15. Outline • Motivations & Related Work • System Architecture and Assumptions • Replica Consistency and Failure Recovery • System Implementation and Service Deployments • Experimental Studies USITS 2001, San Francisco

  16. Prototype System Implementation on a Linux cluster • Service availability and node runtime workload are announced through IP Multicast. • multicast once a second; • kept as soft state, expires in five seconds. • Service instances can run either as processes or threads in Neptune server runtime environment. • Each Neptune server module maintains a process/thread pool and a waiting queue. USITS 2001, San Francisco

  17. Experience with Service Deployments • On-line discussion group • View message headers, view message, and add message. • All three consistency levels can be applied. • Auction • Level 3 consistency with staleness control is used. • Persistent cache • Store key-value pairs (e.g. caching query result). • Level 2 consistency (Primary-secondary) is used.  Fast prototyping and implementation without worrying about replication/clustering complexities. USITS 2001, San Francisco

  18. Experimental Settings for Performance Evaluation • Synthetic Workloads: • 10% and 50% write percentages; • Balanced workload to assess best-case scalability; • Skewed workload to evaluate the impact of runtime hotspots. • Metric: maximum throughput when at least 98% client requests are completed in 2 seconds. • Evaluation Environment: • Linux cluster with dual 400MHz Pentium IIs, 512MB/1GB memory, dual 100Mb/s Ethernet interfaces. • Lucent P550 Ethernet switch with 22Gb/s backplane bandwidth. USITS 2001, San Francisco

  19. Scalability under Balanced Workload • NoRep is about twice as fast as Rep=4 under 50% writes. • Insignificant performance difference across three consistency levels under balanced workload. USITS 2001, San Francisco

  20. Skewed Workload • Each skewed workload consists of requests chosen from a set of partitions according to Zipf distribution. • Define the workload imbalance factor as the proportion of the requests directed to the most popular partition. • For a 16-partition service, an imbalance factor of 1/16 indicates a completely balanced workload. • An imbalance factor of 1 means all requests are directed to one partition. USITS 2001, San Francisco

  21. Impact of Workload Imbalance on Replication Degrees • Replication provides dynamic load-sharing for runtime hot-spots (Rep=4 could be up to 3 times as fast as NoRep). 10% writes; level-2 consistency; 8 nodes. USITS 2001, San Francisco

  22. Impact of Workload Imbalance on Consistency Levels 10% writes; Rep degree 4; 8 nodes. • Modest performance difference: • Up to 12% between level-2 and level-3; • Up to 9% between level-1 and level-2. USITS 2001, San Francisco

  23. Failure Recovery for Primary-secondary Replication • Graceful performance degradation. • Performance drop after the three-node failure. • Errors and timeouts trailing each recovery (write recovery and sync overhead). USITS 2001, San Francisco

  24. Conclusions Contributions: • Scalable replication for cluster-based network services; multi-level consistency with staleness control. • A simple programming model to shield replication and clustering complexities from application service authors. Evaluation results: • Replication improves performance for runtime hotspots. • Performance of level 3 consistency is competitive. • Level 2/3 carries extra overhead during failure recovery. http://www.cs.ucsb.edu/research/Neptune USITS 2001, San Francisco

More Related