380 likes | 513 Views
Simplifying Cluster-Based Internet Service Construction with SDDS. Steven D. Gribble Qualifying Exam Proposal April 19th, 1999 Committee Members: Eric Brewer, David Culler, Joseph Hellerstein, and Marti Hearst. Clusters for Internet Services. Previous observation (TACC, Inktomi, NOW):
E N D
Simplifying Cluster-Based Internet Service Construction with SDDS Steven D. Gribble Qualifying Exam Proposal April 19th, 1999 Committee Members: Eric Brewer, David Culler, Joseph Hellerstein, and Marti Hearst
Clusters for Internet Services • Previous observation (TACC, Inktomi, NOW): • clusters of workstations are a natural platform for constructing Internet services • Internet service properties • support large, rapidly growing user populations • must remain highly available, and cost-effective • Clusters offer a tantalizing solution • incremental scalability: cluster grows with service • natural parallelism: high performance platform • software and hardware redundancy: fault-tolerance
Problem • Internet service construction on clusters is hard • load balancing, process management, communications abstractions, I/O balancing, fail-over and restart, … • toolkits proposed to help (TACC, AS1, River, … ) • Even harder if shared, persistent state is involved • data partitioning, replication, and consistency, interacting with storage subsystem, … • solutions not geared to clustered services • use (distributed) RDBMS: expensive, powerful semantic guarantees, generality at cost of performance • use network/distributed FS: overly general, high overhead (e.g. double buffering penalties). Fault-tolerance? • roll your own custom solution: not reusable, complex
Idea / Hypothesis • It is possible to: • isolate clustered services from vagaries of state mgmt., • to do so with adequately general abstractions, • to build those abstractions in a layered fashion (reuse), • and to exploit clusters for performance, and simplicity. • Use Scalable Distributed Data Structures (SDDS) • take conventional data structure (hash table, tree, log, … ) • partition it across nodes in a cluster • parallel access, scalability, … • replicate partitions within replica groups in cluster • availability in face of failures, further parallelism • store replicas on disk
Why SDDS? • 1st year ugrad software engineering principle: • separation of concerns • decouple persistency/consistency logic from rest of service • simpler (and cleaner!) service implementations • service authors understand data structures • familiar behavior and interfaces from single-node case • should enable rapid development of new services • structure and access patterns are self-evident • access granularity manifestly a structure element • coincidence of logical and physical data units (ALF/ADU) • cf. file systems, SQL in RDBMS, VM pages in DSM
Challenges (== Research Agenda) • Overcoming complexities of distributed systems • data consistency, data distribution, request load balancing, hiding network latency and OS overhead, … • ace up the sleeve: cluster wide area • single, controlled administrative domain • engineer to (probabilistically) avoid network partitions • use low-latency, high-throughput SAN (5 s, 40-120 MB/s) • predictable behavior, controlled heterogeneity • “I/O is still a problem” • plenty of work on fast network I/O, some on fast disk I/O • less work bridging network disk I/O in cluster environment
High-level plan (10000’ view) 1) Implement several SDDS’s • hypothesis: log, tree, and hash table are sufficient • investigate different consistency levels and mechanisms • experiment with scalability mechanisms • Get top and bottom interfaces correct • top interface: particulars of hash, tree, log implementation • specialization hooks? • bottom interface: segment-based cluster I/O layer • filtered streams between disks, network, memory 2) Evaluate by deploying real services • goal: measurable reduction of complexity • port existing Ninja services to SDDS platform • goal: ease of adoption by service authors • have co-researchers implement new services on SDDS
Outline • Prototype description • distributed hash table implementation • “Parallelisms” service and lessons • SDDS design and research issues • Evaluation methodology • Segment-based I/O layer for clusters • Related work • Pro Forma timeline
Outline • Prototype description • distributed hash table implementation • “Parallelisms” service and lessons • SDDS design and research issues • Evaluation methodology • Segment-based I/O layer for clusters • Related work • Pro Forma timeline
Prototype hash table Client Client Client Client Client Client Service FE Service FE Service FE hash library hash library hash library SAN storage “brick” storage “brick” storage “brick” storage “brick” storage “brick” storage “brick” • storage bricks provide local, network-accessible hash tables • interaction with distrib. hash table through abstraction libraries • “C”, Java APIs available • partitioning, mirrored replication logic in each library • distrib. table semantics • handles node failures • no consistency • or transactions, on-line recovery, etc.
Storage bricks Thread Pool RPC marshalling Communications abstractions Communications abstractions Hash table primitives TCP layer TCP layer AM layer AM layer virt / phys virt / phys MMAP management Argument marshalling Worker pool: one thread dispatched per request. • Individual nodes are storage bricks • consistent, atomic, network accessible operations on a local hash table • uses MMAP to handle data persistence • no transaction support • Clients communicate to set of storage bricks using RPC marshalling layer Local hash table implementations Messaging, event queue MMAP region management, and alloc(), free() impl. Transport specific comm. and naming storage brick Service application logic Virtual to physical node names, inter-node hashing Service Frontend
Parallelisms service • Provides “relevant site” information given a URL • an inversion of Yahoo! directory • Parallelisms: builds index of all URLs, returns other URLs in same topics • read-mostly traffic, nearly no consistency requirements • large database of URLs • ~ gigabyte of space for 1.5 million URLs and 80000 topics • Service FE itself is very simple • 400 semicolons of C • 130 for app-specific logic • 270 for threads, HTTP munging, … • hash table code: 4K semicolons of C http://ninja.cs.berkeley.edu/~demos/ paralllelisms/parallelisms.html
Some Lessons Learned • mmap() simplified implementation, but at a price • service “working sets” naturally apply • no pointers: breaks usual linked list and hash table libraries • little control over the order of writes, so cannot guarantee consistency if crashes occur • if node goes down, may incur a lengthy sync before restart • same for abstraction libraries: simplicity with a cost • each storage brick could be totally independent • because policy embedded in abstraction libraries • bad for administration and monitoring • no place to “hook in” to get view of complete table • each client makes isolated decisions • load balancing and failure detection
More lessons learned • service simplicity premise seems valid • Parallelisms service code devoid of persistence logic • Parallelisms front-ends contain only session state • no recovery necessary if they fail • interface selection is critical • originally, just supported put(), get(), remove() • wanted to support java.util.hashtable subclass • needed enumerations, containsKey(), containsObject() • to efficiently support required significant replumbing • thread subsystem was troublesome • JDK has its own, and it conflicted. Had to remove threads from client-side abstraction library.
Outline • Prototype description • distributed hash table implementation • “Parallelisms” service and lessons • SDDS design and research issues • Evaluation methodology • Segment-based I/O layer for clusters • Related work • Pro Forma timeline
SDDS goal: simplicity • hypothesis: simplify construction of services • evidence: Parallelisms • distributed hash table prototype: 3000 lines of “C” code • service: 400 lines of “C” code, 1/3 of which is service-specific • evidence: Keiretsu service • instant messaging service between heterogeneous devices • crux of service is in sharing of binding/routing state • original: 131 lines of Java SDDS version: 80 lines of Java • management/operational aspects • to be successful, authors must want to adopt SDDSs • simple to incorporate and understand • operational management must be nearly transparent • node fail-over and recovery, logging, etc. behind the scenes • plug-n-play extensibility to add capacity
SDDS goal: generality Web server read-mostlyHT: static documents, cache. L: hit tracking Proxy cache HT: soft-state cache. L: hit tracking Search engine HT: query cache, doc. table. HT, T: indexes. L: crawl data, hit log PIM server HT: repository HT, T: indexes. L: hit log, WAL for re-filing etc. • potential criticism of SDDSs: • no matter which structures you provide, some services simply can’t be built with only those primitives • response: pick a basis to enable many interesting services • log, hash table, and tree: our guess at a good basis • layered-model will allow people to develop other SDDS’s • allow GiST-style specialization hooks?
SDDS Research Ideas: Consistency • consistency / performance tradeoffs • stricter consistency requirements imply worse performance • we know some intended services have weaker requirements • rejected alternatives: • built strict consistency, and force people to use • investigate “extended transaction models” • our choice: pick small set of consistency guarantees • level 0 (atomic but not isolated operations) • level 3 (ACID) • get help with this one - it’s a bear
SDDS Research Ideas: Consistency (2) • replica management • what mechanism will we use between replicas? • 2 phase commit for distributed atomicity • log-based on-line recovery • exploiting cluster properties • low network latency fast 2 phase commit • especially relative to WAN latency for Internet services • given good UPS, node failures independent • “commit” to memory of peer in group, not to disk • (probabilistically) engineer away network partitions • unavailable failure, therefore consensus algo. not needed
SDDS Research Ideas: load management • data distribution affects request distribution • start simple: static data distribution (except extensibility) • given request, lookup or hash to determine partition • optimizations • locality aware request dist. (LARD) within replicas • if no failures, replicas further “partition” data in memory • “front ends” often colocated with storage nodes • front end selection based on data distribution knowledge • smart clients (Ninja redirector stubs..?) • Issues • graceful degradation: RED/LRP techniques to drop requests • given many simultaneous requests, service ordering policy?
Incremental Scalability • logs and trees have a natural solution • pointers are ingrained in these structures • use the pointers to (re)direct structures onto new nodes
Incremental Scalability - Hash Tables • hash table is the tricky one • why? mapping is done by client-side hash functions • unless table is chained, no pointers inside hash structure • need to change client-side functions to scale structure • Litwin’s linear hashing? • client-side hash function evolves over time • clients independently discovery when to evolve functions • Directory-based map? • move hashing into infrastructure (inefficient) • or, have infrastructure inform clients when to change function • AFS-style registration and callbacks?
Getting the Interfaces Right • upper interfaces: sufficient generality • setting the bar for functionality (e.g. java.util.hashtable) • opportunity: reuse of existing software (e.g. Berkeley DB) • lower interfaces: use a segment-based I/O layer? • log, tree: natural sequentiality, segments make sense • hash table is much more challenging • aggregating small, random accesses into large, sequential ones • rely on commits to other nodes’ memory • periodically dump deltas to disk LFS-style
Outline • Prototype description • distributed hash table implementation • “Parallelisms” service and lessons • SDDS design and research issues • Evaluation methodology • Segment-based I/O layer for clusters • Related work • Pro Forma timeline
Evaluation: use real services • metrics for success 1) measurable reduction in complexity to author Internet svcs. 2) widespread adoption of SDDS by Ninja researchers 1) port/reimplement existing Ninja services • Keiretsu, Ninja Jukebox, the multispace Log service • explicitly demonstrate code reduction & performance boon 2) convince people to use SDDS for new services • NinjaMail, Service Discovery Service, ICEBERG services • will force us to tame operational aspects of SDDS • goal: as simple to use SDDS as single-node, non-persistent case
Outline • Prototype description • distributed hash table implementation • “Parallelisms” service and lessons • SDDS design and research issues • Evaluation methodology • Segment-based I/O layer for clusters • Related work • Pro Forma timeline
Segment layer (motivation) • it’s all about disk bandwidth & avoiding seeks • 8 ms random seek, 25-80 MB/s throughput • must read 320 KB per seek to break even • build disk abstraction layer based on segments • 1-2 MB regions on disk, read and written in their entirety • force upper layers to design with this in mind • small reads/writes treated as uncommon failure case • SAN throughput is comparable to disk throughput • stream from disk to network and saturate both channels • stream through service-specific filter functions • selection, transformation, … • apply lessons from high-performance networks
Segment layer challenges • thread and event model • lowest level model dictates entire application stack • dependency on particular thread subsystem is undesirable • asynchronous interfaces are essential • especially for Internet services w/ thousands of connections • potential model: VIA completion queues • reusability for many components • toughest customer: Telegraph DB • dictate write ordering, be able to “roll back” mods for aborts • if content is paged, make sure don’t overwrite on disk • no mmap( )!
Segment Implementation Plan • Two versions planned • one version using POSIX syscalls and vanilla filesystem • definitely won’t perform well (copies to handle shadowing) • portable to many platforms • good for prototyping and getting API right • one version on Linux with kernel modules for specialization • I/O-lite style buffer unification • use VIA or AM for network I/O • modify VM subsystem for copy-on-write segments, and/or paging dirty data to separate region
Outline • Prototype description • distributed hash table implementation • “Parallelisms” service and lessons • SDDS design and research issues • Evaluation methodology • Segment-based I/O layer for clusters • Related work • Pro Forma timeline
Related work • (S)DSM • structural element is a better atomic unit than page • fault tolerance as goal • Distributed/networked FS [NFS, AFS, xFS, LFS, ..] • FS more general, has less chance to exploit structure • often not in clustered environment (except xFS, Frangipani) • Litwin SDDS [LH, LH*, RP, RP*] • significant overlap in goals • but little implementation experience • little exploitation of cluster characteristics • consistency model not clear
Related Work (continued) • Distributed & Parallel Databases [R*, Mariposa, Gamma, …] • different goal (generality in structure/queries, xacts) • stronger and richer semantics, but at cost • both $$ and performance • Fast I/O research [U-Net, AM, VIA, IO-lite, fbufs, x-kernel, …] • network and disk subsystems • main results: get OS out of way, avoid (unnecessary) copies • use results in our fast I/O layer • Cluster platforms [TACC, AS1, River, Beowulf, Glunix, …] • many issues we aren’t tackling • harvesting idle resources, process migration, single-system view • our work is mostly complementary
Outline • Prototype description • distributed hash table implementation • “Parallelisms” service and lessons • SDDS design and research issues • Evaluation methodology • Segment-based I/O layer for clusters • Related work • Pro Forma timeline
Pro Forma Timeline • Phase 1 (0-6 months) • preliminary distributed log and extensible hash table SDDS • weak consistency for hash table, strong for log • prototype design/implementation of segment I/O layer • both POSIX and linux-optimized versions • Phase 2 (6-12 months) • measure/optimize SDDSs, improve hash table consistency • port existing services to these SDDS’s, help with new svcs. • Phase 3 (12-18 months) • add tree/treap SDDS to repertoire • incorporate lessons learned from phase 2 services • measure, optimize, measure, optimize, … , write, graduate.
Summary • SDDS hypothesis • simplification of cluster-based Internet service construction • possible to achieve generality, simplicity, efficiency • exploit properties of clusters to our benefit • build in layered fashion on segment-based I/O substrate • Many interesting research issues • consistency vs. efficiency, replica management, load balancing and request distribution, extensibility, specialization, … • Measurable success metric • simplification of existing Ninja services • adoption for use in construction new Ninja services
Taxonomy of Clustered Services simplify the job of constructing these classes of services Goal: Stateless Soft-state Persistent State • high availability and • completeness • perhaps consistency • persistence necessary • high availability • perhaps consistency • persistence is an • optimization State Mgmt. Requirements Little or none TACC distillers TACC aggregators Inktomi search engine River modules AS1 servents or RMX Parallelisms Examples Scalable PIM apps Video Gateway Squid web cache HINDE mint
Performance • Bulk-loading of database dominated by disk access time • can achieve 1500 inserts per second per node on 100 Mb/s Ethernet cluster, if hash table fits in memory (dominant cost is messaging layer) • otherwise, degrades to about 30 inserts per second (dominant cost is disk write time) • In steady state, all nodes operate primarily out of memory, as the working set is fully paged in • similar principle to research Inktomi cluster • handles hundreds of queries per s. on 4 node cluster w/ 2 FE’s
Deliverables Ninja Svcs Telegraph HT HT’ log tree SDDS primitives I/O layer = I will build = I will help build/influence = I will spectate