560 likes | 720 Views
Industrial Systems. Imranul Hoque and Sonia Jahid CS 525. ACMS The Akamai Configuration Management System. A. Sherman, P. Lisiecki, A. Berkheimer, J. Wein NSDI 2005. What is Akamai?. Akamai is a CDN Founded by: Daniel Lewin, Tom Leighton, Jonathan Seelig, Preetish Nijhawan
E N D
Industrial Systems Imranul Hoque and Sonia Jahid CS 525
ACMSThe Akamai Configuration Management System A. Sherman, P. Lisiecki, A. Berkheimer, J. Wein NSDI 2005
What is Akamai? • Akamai is a CDN • Founded by: Daniel Lewin, Tom Leighton, Jonathan Seelig, Preetish Nijhawan • Customers include: Yahoo, Google, Microsoft, Apple, Xerox, AOL, MTV … • Trivia: • Akamai is a Hawaiian word meaning intelligent • D. Lewin was aboard AA Flight 11 during 9/11 • Al-Jazeera was Akamai’s customer from March 28, 2003 to April 2, 2003
How Akamai Works? Image Source: Wikipedia
Challenges • 15,000 servers • 1200+ different networks • 60+ countries • Customers want to maintain close control on: • html cache timeouts • whether to allow cookies • whether to store session data • Config files must be propagated quickly
Challenges (1) • Single Server vs. Server Farm • A non-trivial fraction of servers may be down • Servers are widely dispersed • Config changes are generated from widely dispersed places • A server recovering from failure needs to be up to dated quickly
Requirements • High Fault Tolerance and Availability • Should have multiple entry points for accepting and storing configuration updates • Efficiency and Scalability • Must deliver updates within reasonable time • Persistent Fault Tolerant Storage • Must store updates and deliver them asynchronously to unavailable machines once they become available • Correctness • Should order updates correctly • Acceptance Guarantee • An accepted update submission should be propagated to the Akamai CDN
Architecture ACMS Storage Point ACMS Storage Point Publisher • Publish • Replication • Agreement • Accept & Upload • Propagation ACMS Storage Point Akamai CDN
Quorum-based Replication • An update should be replicated and agreed upon by a quorum of Storage Points (SP) • Quorum = majority • Each SP maintains connectivity by exchanging liveness message with peers • Network Operation Command Center (NOCC) observes this statistics • Red alert if majority fails to report pairwise connectivity to a quorum
Quorum-based Replication (2) • Acceptance Algorithm • Two phases: Replication & Agreement • Replication Phase • Accepting SP creates a temporary file with a unique filename (foo.A.1234) • Replicates this file to a quorum of SPs • If successful starts the agreement phase • Agreement Phase • Vector Exchange (VE) Protocol
Vector Exchange (Example) 1,0,0 1,1,0 1,1,1 A 1,1,1 B 1,0,0 1,1,0 C 1,0,0 1,0,1 1,1,1
Recovery via Index Merging • Acceptance algorithm guarantees that at least a quorum of SPs stores each update • SPs should sync up any missed update • Recovery protocol known as Index Merging • Configuration files are organized in a tree: Index Tree • Configuration files are split into groups • Group Index file lists the UIDs of the latest agreed upon updates for each file in group • Root Index file lists all Group Index files with the latest modification timestamps
Recovery via Index Merging (2) • At each round a Storage Point: • picks random set of (Q-1) SPs [Q = majority] • downloads and parses index files from those SPs • on detection of a more recent UID of a file, the SP updates its tree and downloads the file from one of its peers • To avoid frequent parsing: • SPs remember the timestamps of one another’s index file • Uses HTTP-IMS (if-modified-since)
Data Delivery • Receivers run on each of 15,000 nodes and check for configuration file updates • Receivers learn about latest configuration files in the same way SPs merge their index files • Receivers are interested in a subset of the index tree that describes their subscription • Receivers periodically query the snapshots on the SPs to learn of any updates • If the updates match any subscriptions, receivers download the files via HTTP-IMS requests • Optimized download due to Akamai caching
Operational Experience • Prototype version of Akamai • Consisted of a single primary Accepting Storage Point replicating submission to a few secondary SPs • Drawback: single point of failure • Quorum Assumption • 36 instances of SP disconnected from quorum for more than 10 minutes due to network outages during Jan-Sep of 2004 • In all instances there was an operating quorum of other SPs
Evaluation • 14,276 total file submissions with 5 SPs over a 48 hour period • Average file size: 121 KB
Evaluation (2) • Tail of Propagation Measurement: • Another random sample of 300 machines over a 4 day period • Looked at propagation of short files (< 20 KB) • 99.8% of time received within 2 minutes from becoming available • 99.96% of time received within 4 minutes Random sampling of 250 nodes
Evaluation (3) • Scalability • Front End scalability dominated by replication • For 5 SPs and average file size of 121 KB VE overhead is 0.4% of replication bandwidth • For 15 SPs VE overhead is 1.2% • For larger # of SPs consistent hashing can be used to split actual storage
Discussion • Fault Tolerant Replication • Distributed FS: Coda, Pangea, Bayou • All of these attempt to improve availability at the expense of consistency • ACMS must provide high level of consistency • Two phase acceptance algorithm used by ACMS is similar in nature to the Two Phase Commit • VE was inspired by concept of vector clocks and uses a quorum based approach similar to Paxos and BFS • Comparison with Software Update Systems • LCFG and Novadigm: Span a single or a few networks • Windows Update: Can delay update, centralized
Bigtable: A Distributed Storage System for Structured Data F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber OSDI 2006
What is Bigtable? • A distributed storage system for managing structured data • A sparse, distributed, persistent multidimensional sorted map. • Designed to scale to a large size of data • Implemented and used by Google • Web indexing • Google Earth • Orkut • Etc.
Is It a Database? • Doesn’t support full relational DB • Each value in Bigtable is an uninterpreted array of bytes • Doesn’t speak SQL • Wouldn’t pass ACID test
Data Model • Indexed by: (row: string, column: string, time: int64)
Data Model: Rows Row • Arbitrary strings, 64KB in size • Atomic r/w under a single row key • Data maintained lexicographically • Row range (Tablet) for a table dynamically partitioned • Tablet (100~200MB) is the unit of distribution and load balancing • Tablet allows efficient short row range reads
Data Model: Column Families Optional • Column keys (family:qualifier) grouped into column families • small # (in the hundreds) of distinct families per table, unbounded # of columns
Data Model: Timestamp • Each cell may contain multiple version of data indexed by timestamp • Assigned by Bigtable or by client applications • Client specifies: either keep n versions of a cell or most recent versions (e.g., last 7 days) • Supports “Garbage collection” mechanism
Building Blocks • Built on several other pieces of Google infrastructure • Distributed Google File System (GFS) • Cluster management system • Machine failures, monitoring machine status, resource management, job scheduling • SSTable file format • Provides persistent, ordered immutable map from keys to values • Internally contains sequence of blocks of typically 64KB • Block index used to lookup blocks
Building Blocks • Chubby lock service: • Provides directories and small files used as locks • Client maintains a session with a Chubby service • Bigtable uses Chubby for: • At most 1 active master at a time • Store starting location of Bigtable data • Find live Tablet servers • If UNAVAILABLE… BIGTABLE UNAVAILABLE
Implementation • Three major components: • A library linked to every client • One master server • Assigns tablets to tablet servers (TS) • Detects addition and expiration of TS • Balances TS load • Performs garbage collection in GFS • Handles schema changes
Implementation • Many Tablet Servers • Each manages set of tablets (ten to thousands) • Handles r/w requests to the tablets • Splits tablets grown too large
Tablet Location • 3 levels hierarchy • Client library caches tablet locations • If client cache empty, 3 network round-trips • If client cache stale, 6 network round-trips • Prefetch tablet location
Tablet Assignment • A TS locks a file on a specific Chubby directory upon start • Master periodically polls TS about its lock • If TS reports about lost lock, Master tries to acquire the lock exclusively. If it can, then Chubby is alive. • So, Master deletes TS’s file and moves tablets assigned into set of unassigned tablets.
Tablet Assignment • When a master is started by cluster management system • It grabs a master lock in Chubby • Finds live TS from Chubby • Discovers tablet assignment from TS • Scans METADATA table to know tablet sets • Adds appropriate tablets to unassigned set • Set of tablet changes when tablets are split/merged, tables created or deleted
Tablet Serving • Tablet persistent state stored in GFS • Recent updates stored in memory in a sorted buffer memtable • Write/ Read done after authorization
Compactions • Minor Compaction: • Shrinks TS memory usage • Merging Compaction: • Merges a few SSTables and memtable • Major Compaction: • Merging compaction that rewrites all SSTables into exactly one SSTable • Major compaction SSTable contains no deleted data
Refinements • Locality groups: Group multiple column families. E.g., group language and checksum for WebPage example. • Compression: Clients control whether compress SSTable (block level) for a locality group or not. • Caching: Scan cache and Block cache by TS • Bloom Filters: Allows to ask whether an SSTable might have data for a row/column pair
Performance Evaluation # of 1000 byte values read/written per second per TS • N # of TS, N varied • TS, master, test clients, GFS servers all ran on same set of machines • Row keys partitioned into 10N equal-sized range
Discussion • Provides an API in C++ • Achieved high availability and performance • Works well in practice as more than 60 Google applications use it • A good and flexible industrial storage system
MapReduceSimplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat OSDI 2004
Motivation • Special purpose computations at Google process large amount of raw data: • Crawled Documents • Web Request Logs • The output is: • Inverted Indices • Representation of Graph Structure of Web Documents • # of pages crawled per host • Most frequent queries in a given day • Input data is large and the computation is distributed
Motivation (2) • Issues: • How to parallelize the computation • How to distribute the data • How to handle failures • All these obscure the original simple computation with large amount of complex code • Solution: MapReduce
MapReduce • Programming model and associated implementation for processing and generating large data sets • Contains Map and Reduce functions • Inspired by Map and Reduce primitives of Lisp and other functional languages • Map is applied to input to compute a set of intermediate key/value pairs • Reduce combines the derived data appropriately • Allows parallelization of large computations easily
Example • Text 1: it is what it is • Text 2: what is it • Text 3: it is a banana Map tasks are assigned to workers Example inspired from original presentation by authors at OSDI
Example (2): Map • Worker 1: • (it 1), (is 1), (what 1), (it 1), (is 1) • Worker 2: • (what 1), (is 1), (it 1) • Worker 3: • (it 1), (is 1), (a 1), (banana 1)
Example (3): Reduce Input • Worker 1: • (a 1) • Worker 2: • (banana 1) • Worker 3: • (is 1), (is 1), (is 1), (is 1) • Worker 4: • (it 1), (it 1), (it 1), (it 1) • Worker 5: • (what 1), (what 1)
Example (3): Reduce Output • Worker 1: • (a 1) • Worker 2: • (banana 1) • Worker 3: • (is 4) • Worker 4: • (it 4) • Worker 5: • (what 2)
Execution Overview (2) • Master Data Structure • State and identity of the workers for each map and reduce task • Location and sizes of intermediate files produced by the map task • Fault Tolerance • Master pings workers periodically • Map and reduce tasks are reset depending on whether they were completed or not • Assumes that master failure is unlikely
Execution Overview (3) • Locality • MapReduce takes location information of input files into account and attempts to schedule a map task on a nearby machine • Task Granularity • M and R should be much larger than the number of workers for dynamic load balancing and failure recovery • R is often constrained by users as output of each reduce is a separate file • Backup Task • Some machine may take unusually long time (straggler) • Backup remaining in-progress tasks
Refinements • Partitioning Function • The key k is reduced by worker: hash(k) % R • User can specify: hash(hostname(k))%R • Combiner Function • Data can be merged before sent over network • Skipping Bad Records • Discard few records for which program crashes • Local Execution • Sequentially execute for debugging in local machine • Status Information • Show progress, error, output files