1 / 35

Detour: Distributed Systems Techniques & Case Studies I

Detour: Distributed Systems Techniques & Case Studies I. Distributing (Logically) Centralized SDN Controllers NIB need to be maintained by multiple (distributed) SDN controllers Multiple SDN controllers may need to concurrently read or write the same shared state

randir
Download Presentation

Detour: Distributed Systems Techniques & Case Studies I

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Detour: Distributed Systems Techniques & Case Studies I • Distributing (Logically) Centralized SDN Controllers • NIB need to be maintained by multiple (distributed) SDN controllers • Multiple SDN controllers may need to concurrently read or write the same shared state • Distributed State Management Problem! • Look at three case studies from distributed systems • Google File System (GFS) • Amazon’s Dynamo • Yahoo!’s PNUTS CSci8211: Distributed System Techniques & Case Studies: I

  2. Distributed Data Stores & Consistency Models Availability & Performance vs. Consistency Trade-offs • Traditional (Transactional) Database Systems: • Query Model: more expressive query language e.g., SQL • ACID Properties: Atomicity, Consistency, Isolation and Durability • Efficiency: very expensive to implement at large scale! • Many real Internet applications/systems do not require strong consistency, but require high availability • Google File Systems: many reads, few writes (mostly appends) • Amazon’s Dynamo: simple query model, small data objects, but need to “always-writable” at massive scale • Yahoo’s PNUTS: databases with relaxed consistency for web apps requiring more than “eventual consistency.” (e.g., ordered updates) Implicit/Explicit Assumptions: Applications often can tolerate or know best how to handle inconsistencies (if happen rarely), but care more about availability & performance CSci8211: Distributed System Techniques & Case Studies: I

  3. Data Center and Cloud Computing • Data center: large server farms + data warehouses • not simply for web/web services • managed infrastructure: expensive! • From web hosting to cloud computing • individual web/content providers: must provision for peak load • Expensive, and typically resources are under-utilized • web hosting: third party provides and owns the (server farm) infrastructure, hosting web services for content providers • “server consolidation” via virtualization Under client web service control App Guest OS VMM

  4. Cloud Computing • Cloud computing and cloud-based services: • beyond web-based “information access” or “information delivery” • computing, storage, … • Cloud Computing: NIST Definition "Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction." • Models of Cloud Computing • “Infrastructure as a Service” (IaaS), e.g., Amazon EC2, Rackspace • “Platform as a Service” (PaaS), e.g., Micorsoft Azure • “Software as a Service” (SaaS), e.g., Google

  5. With thousands of servers within a data center, How to write applications (services) for them? How to allocate resources, and manage them? in particular, how to ensure performance, reliability, availability, … Scale and complexity bring other key challenges with thousands of machines, failures are the default case! load-balancing, handling “heterogeneity,” … data center (server cluster) as a “computer” “super-computer” vs. “cluster computer” A single “super-high-performance” and highly reliable computer vs. a “computer” built out of thousands of “cheap & unreliable” PCs Pros and cons? Data Centers: Key Challenges

  6. Google Scale and Philosophy • Lots of data • copies of the web, satellite data, user data, email and USENET, Subversion backing store • Workloads are large and easily parallelizable • No commercial system big enough • couldn’t afford it if there was one • might not have made appropriate design choices • But truckloads of low-cost machines • 450,000 machines (NYTimes estimate, June 14th 2006) • Failures are the norm • Even reliable systems fail at Google scale • Software must tolerate failures • Which machine an application is running on should not matter • Firm believers in the “end-to-end” argument • Care about perf/$, not absolute machine perf

  7. Cluster Scheduling Master Lock Service GFS Master Machine 2 Machine 3 Machine 1 BigTableServer UserTask 1 BigTableServer BigTable Master UserTask User Task 2 SchedulerSlave GFSChunkserver SchedulerSlave GFSChunkserver SchedulerSlave GFSChunkserver Linux Linux Linux Typical Cluster at Google

  8. Google: System Building Blocks • Google File System (GFS): • raw storage • (Cluster) Scheduler: • schedules jobs onto machines • Lock service: • distributed lock manager • also can reliably hold tiny files (100s of bytes) w/ high availability • Bigtable: • a multi-dimensional database • MapReduce: • simplified large-scale data processing • ....

  9. Google File System Key Design Considerations • Component failures are the norm • hardware component failures, software bugs, human errors, power supply issues, … • Solutions: built-in mechanisms for monitoring, error detection, fault tolerance, automatic recovery • Files are huge by traditional standards • multi-GB files are common, billions of objects • most writes (modifications or “mutations”) are “append” • two types of reads: large # of “stream” (i.e., sequential) reads, with small # of “random” reads • High concurrency (multiple “producers/consumers” on a file) • atomicity with minimal synchronization • Sustained bandwidth more important than latency

  10. GFS Architectural Design • A GFS cluster: • a single master + multiple chunkservers per master • running on commodity Linux machines • A file: a sequence of fixed-sized chunks (64 MBs) • labeled with 64-bit unique global IDs, • stored at chunkservers (as “native” Linux files, on local disk) • each chunk mirrored across (default 3) chunkservers • master server: maintains all metadata • name space, access control, file-to-chunk mappings, garbage collection, chunk migration • why only a single master? (with read-only shadow masters) • simple, and only answer chunk location queries to clients! • chunk servers (“slaves” or “workers”): • interact directly with clients, perform reads/writes, …

  11. GFS Architecture: Illustration • GPS clients • consult master for metadata • typically ask for multiple chunk locations per request • access data from chunkservers Separation of control and data flows

  12. Chunk Size and Metadata • Chunk size: 64 MBs • fewer chunk location requests to the master • client can perform many operations on a chuck • reduce overhead to access a chunk • can establish persistent TCP connection to a chunkserver • fewer metadata entries • metadata can be kept in memory (at master) • in-memory data structures allows fast periodic scanning • some potential problems with fragmentation • Metadata • file and chunk namespaces (files and chunk identifiers) • file-to-chunk mappings • locations of a chunk’s replicas

  13. Chunk Locations and Logs • Chunk location: • does not keep a persistent record of chunk locations • polls chunkservers at startup, and use heartbeat messages to monitor chunkservers: simplicity! • because of chunkserver failures, it is hard to keep persistent record of chunk locations • on-demand approach vs. coordination • on-demand wins when changes (failures) are often • Operation logs • maintains historical record of critical metadata changes • Namespace and mapping • for reliability and consistency, replicate operation log on multiple remote machines (“shadow masters”)

  14. Clients and APIs • GFS not transparent to clients • requires clients to perform certain “consistency” verification (using chunk id & version #), make snapshots (if needed), … • APIs: • open, delete, read, write (as expected) • append: at least once, possibly with gaps and/or inconsistencies among clients • snapshot: quickly create copy of file • Separation of data and control: • Issues control (metadata) requests to master server • Issues data requests directly to chunkservers • Caches metadata, but does no caching of data • no consistency difficulties among clients • streaming reads (read once) and append writes (write once) don’t benefit much from caching at client

  15. System Interaction: Read • Client sends master: • read(file name, chunk index) • Master’s reply: • chunk ID, chunk version#, locations of replicas • Client sends “closest” chunkserver w/replica: • read(chunk ID, byte range) • “closest” determined by IP address on simple rack-based network topology • Chunkserver replies with data

  16. System Interactions: Write and Record Append • Write and Record Append (atomic) • slightly different semantics: record append is “atomic” • The master grants a chunk lease to a chunkserver (primary), and replies back to client • Client first pushes data to all chunkservers • pushed linearly: each replica forwards as it receives • pipelined transfer: 13 MB/second with 100 Mbps network • Then issues a write/append to primary chunkserver • Primary chunkserver determines the order of updates to all replicas • in record append: primary chunkserver checks to see whether record append would exceed maximum chunk size • if yes, pad the chuck (and ask secondaries to do the same), and then ask client to append to the next chunk

  17. Leases and Mutation Order • Lease: • 60 second timeouts; can be extended indefinitely • extension request are piggybacked on heartbeat messages • after a timeout expires, master can grant new leases • Use leases to maintain consistent mutation order across replicas • Master grant lease to one of the replicas -> Primary • Primary picks serial order for all mutations • Other replicas follow the primary order

  18. Consistency Model • Changes to namespace (i.e., metadata) are atomic • done by single master server! • Master uses log to define global total order of namespace-changing operations • Relaxed consistency • concurrent changes are consistent but “undefined” • defined: after data mutation, file region that is consistent, and all clients see that entire mutation • an append is atomically committed at least once • occasional duplications • All changes to a chunk are applied in the same order to all replicas • Use version number to detect missed updates

  19. Master Namespace Management & Logs • Namespace: files and their chunks • metadata maintained as “flat names”, no hard/symbolic links • full path name to metadata mapping • with prefix compression • Each node in the namespace has associated read-write lock (-> a total global order, no deadlock) • concurrent operations can be properly serialized by this locking mechanism • Metadata updates are logged • logs replicated on remote machines • take global snapshots (checkpoints) to truncate logs (but checkpoints can be created while updates arrive) • Recovery • Latest checkpoint + subsequent log files

  20. Replica Placement • Goals: • Maximize data reliability and availability • Maximize network bandwidth • Need to spread chunk replicas across machines and racks • Higher priority to replica chunks with lower replication factors • Limited resources spent on replication

  21. Other Operations • Locking operations • one lock per path, can modify a directory concurrently • to access /d1/d2/leaf, need to lock /d1, /d1/d2, and /d1/d2/leaf • each thread acquires: a read lock on a directory & a write lock on a file • totally ordered locking to prevent deadlocks • Garbage Collection: • simpler than eager deletion due to • unfinished replicated creation, lost deletion messages • deleted files are hidden for three days, then they are garbage collected • combined with other background (e.g., take snapshots) ops • safety net against accidents

  22. Fault Tolerance and Diagnosis • Fast recovery • Master and chunkserver are designed to restore their states and start in seconds regardless of termination conditions • Chunk replication • Data integrity • A chunk is divided into 64-KB blocks • Each with its checksum • Verified at read and write times • Also background scans for rarely used data • Master replication • Shadow masters provide read-only access when the primary master is down

  23. GFS: Summary • GFS is a distributed file system that support large-scale data processing workloads on commodity hardware • GFS has different points in the design space • Component failures as the norm • Optimize for huge files • Success: used actively by Google to support search service and other applications • But performance may not be good for all apps • assumes read-once, write-once workload (no client caching!) • GFS provides fault tolerance • Replicating data (via chunk replication), fast and automatic recovery • GFS has the simple, centralized master that does not become a bottleneck • Semantics not transparent to apps (“end-to-end” principle?) • Must verify file contents to avoid inconsistent regions, repeated appends (at-least-once semantics)

  24. Highlights of Dynamo • Dynamo: key-value data store at massive scale • Used to maintain users’ shopping cart info • Key Design Goals: highly available and resilient at massive scale, while also meeting SLAs! • i.e., all customers have good experience, not simply most! • Target Workload & Usage Scenarios: • simple read/write operations to a (relatively small) data item uniquely identified by a key; e.g., usually less 1 MB • services must be able to configure Dynamo to consistently achieve their latency and throughput requirements • used by internal services: Non-hostile environments • System Interface: • get(key), put(key, context, object)

  25. Amazon Service-Oriented Arch

  26. Dynamo: Techniques Employed

  27. Dynamo: Key Partitioning & Replications& Sloppy Quorum for Read/Write • # of key replicas >= N • (here N =3) • Each key is associated with a preference list of N ranked (virtual) nodes • Sloppy Quorum: • R +W >N • each read is handled by • a (read) coordinator • -- any node in the ring is fine • Each write is also handled by a (write) coordinator • -- highest ranked available node in the preference list • read via get(): read from all N replicas; success if receiving R responses • write via put(): write to al N replicas; success if receiving W-1 “write OK” acks

  28. Dynamo: Vector Clock version evolution of an object over time

  29. Highlights of PNUTS • PNUTS: massively parallel and geographically distributed database system for Yahoo!’s web apps • data storage organized as hashed or ordered tables • hosted, centrally managed, geographically distributed service with automated load-balancing & fail-over • Target Workload • managing session states, content meta-data, user-generated content such as tags & comments, etc. for web applications • Key Design Goals: • scalability • response time and geographic scope • high availability and fault tolerant • relaxed consistency guarantees • more than eventual consistency supported by GFS & Dynamo

  30. PNUTS Overview • Data model and Features • expose a simple relational model to users, & support single-table scans with predicates • include: scatter-gather ops, async. notification, bulk loading • Fault Tolerance • Redundancy at multiple levels: data, meta-data, serving components, etc. • Leverage consistency model to support highly available reads & writes even after failure or partition • Pub-Sub Message System: topic-based YMB (msg. broker) • Record-level Mastering: write sync’ly to all copies expensively! • make all high latency ops asynchronous: allow local writes, and use record-level mastering to serve all requests locally • Hosting: hosted service shared by many applications

  31. PNUTS Data & Query Model • A simplified relational data model • data organized into tables of records with attributes • in addition to typical data types, allow “blob” data type • schemas are flexible: • allow new attribute addition at any time without halting query or update activities; • records not required to have values for all attributes • each record has a primary key: delete(key)/update(key) • Query language: PNUTS supports • selection and project from a single table • both hashed (for point access) and ordered table (for scan) • get(key), multi-get(list-of-keys), scan(range[, predicate]) • no support for “complex” queries, e.g., “join” or “group-by” • in the near future, provide interface to Hadoop, Pig Latin, …

  32. PNUTS Consistency Model • Applications typically manipulate one record at a time • PNUTS provides per-record timeline consistency • all replicas of a given record apply all updates to the record in the same order (one replica designated as “master”) • A range of APIs with varying levels of consistency guarantees v.generation.version • write • test-and-set-write(required-version) • Future: i) bundled updates • ii) “more” relaxed consistency to cope w/ major (regional data center) failures • read-only • read-critical (required-version) • read-latest

  33. PNUTS System Architecture Interval Mappings • Tables are partitioned into tablets, each tablet stored on one server per region • each tablet: ~ 100s MBs to a few GBs • Planned scale: 1000 servers per region, 1000 tablets each • key: 100 bytes  interval mapping table: 100s MB RAM • tablets ~500 MBs  a database of ~500 TBs

  34. Interval Mappings Ordered Table Hashed Table

  35. PNUTS: Other Features • Yahoo! Message Broker (YMB) • topic-based pub/sub system • together w/ PNUTS: Yahoo! Sherpa data service platform • YMB and Wide-Area Data Replication • Data updates considered “committed” once they are published by YMB • YMB asynchronously propagates the update to different regions and applies to all replicas • YMB provides “logging” and guarantees all published messages will be delivered to all subscribers • YMB logs purged only after PNUTS verifies • Consistency via YMB and Mastership • YMB provides partial ordering of published messages • per-record mastering: updates directed to master 1st, then propagates to other replicas via publishing to YMB • Recovery: can survive storage unit failures; tablet boundaries sync’ed across tablet replicas; recover a lost tablet by copying a remote replica

More Related