ACMS: The Akamai Configuration Management System

ACMS: The Akamai Configuration Management System Author’s slides edited by Brian Cho

The Akamai Platform • Akamai operates a Content Delivery Network of 15,000+ servers distributed across 1,200+ ISPs in 60+ countries • Web properties (Akamai’s customers) use these servers to bring their web content and applications closer to the end-users

Problem: configuration and control • Even with the widely distributed platform customers need to maintain the control of how their content is served • Customers need to configure their service options with the same ease and flexibility as if it was a centralized, locally hosted system

Problem (cont) • Customer profiles include hundreds of parameters. For example: • Cache TTLs • Allow lists • Whether cookies are accepted • Whether application sessions are stored • In addition, internal Akamai services require dynamic reconfigurations (mapping, load-balancing, provisioning services)

Why is this difficult? • 15,000 servers must synchronize to the latest configurations within a few minutes • Some servers may be down or “partitioned-off” at the time of reconfiguration • A server that comes up after some downtime must re-synchronize quickly • Configuration may be initiated from anywhere on the network and must reach all other servers • Discussion: Many more locations (1,200+ ISPs in 60+ countries) than e.g. MapReduce, GFS • Discussion: Why is this easy?

Outline • High-level overview of the functioning configuration system • Distributed protocols that guarantee fault-tolerance (based on earlier literature) • Operational experience and evaluation

Assumptions • Configuration files may vary in size from a few hundred bytes to 100MB • Submissions may originate from anywhere on the Internet • Configuration files are submitted in their entirety (no diffs)

System Requirements • High availability – system must be up 24x7 and accessible from various points on the network • Fault-tolerant storage of configuration files for asynchronous delivery • Efficient delivery – configuration files must be delivered to the “live” edge servers quickly • Recovery – edge servers must “recover” quickly • Consistency – for a given configuration file the system must synchronize to a “latest” version • Security – configuration files must be authenticated and encrypted • Discussion: Why is this easy? • No strict atomic commit required • Limited re-ordering allowed

Proposed Architecture: Two Subsystems • Front-end – a small collection of Storage Points responsible for accepting, storing, and synchronizing configuration files • Back-End – reliable and efficient delivery of configuation files to all of the edge servers - leverages the Akamai CDN

1. Publisher transmits a file to a storage point Storage Points 2. Storage Points store, synchronize and upload the new file on local web servers Publishers 3. Edge servers download the new file from the SPs via the CDN 15,000 Edge Servers

Front-end fault-tolerance • Mitigate distributed communication failures • Implement agreement protocol on top of replication • Vector exchange: a quorum based agreement scheme • No dependence on a single Storage Point • Eliminate dependence on any given network – SPs hosted by distinct ISP

Quorum Requirement • Definea quorum as a majority(e.g. 3 out of 5 SPs) • A quorum of SPs must agree on a submission • Every future majority overlaps with the earlier majority that agreed on a file • If there is no quorum of alive and communicating SPs, pending agreements halt until a quorum is reestablished

Accepting a file • A publisher contacts an accepting SP • The accepting SP replicates a temporary file to a majority of SPs • If replication succeeds the accepting SP initiates an agreement algorithm called Vector Exchange • Upon success the accepting SP “accepts” and all SPs upload the new file

Vector Exchange (based on vector clocks) • For each agreement SPs exchange a bit vector. • Each bit corresponds to “commitment” status of a corresponding SP. • Once a majority of bits are set “agreement” takes place • When any SP “learns” of an agreement it can upload the submission

Vector Exchange: an example: A • “A” initiates and broadcasts a vector: • A:1 B:0 C:0 D:0 E:0 E B • “C” sets its own bit and re-broadcasts: • A:1 B:0 C:1 D:0 E:0 D • “D” sets its bit and rebroadcats • A:1 B:0 C:1 D:1 E:0 C • Any SP learns of the “agreement” when it sees a majority of bits set.

Vector Exchange Guarantees • If a submission is accepted at least a majority have stored and agreed on the submission • The agreement is never lost by a future quorum. Q:Why? • A: any future quorum contains at least one SP that saw an initiated agreement. • VE borrows ideas from Paxos, BFS [Castro, Liskov] • Weaker, cannot implement a state machine with VE • VE offers simplicity, flexibility

Recovery Routine • Each SP runs a recovery routine continuously to query other SPs for “missed” agreements. • Recovery allows • SPs that experience downtime to recover state • Termination of VE messages once agreement occurs • Snapshot is a hierarchical index structure that describes latest versions of all accepted files

Back-end: Delivery SP • Processes on edge servers subscribe to specific configurations via their local Receiver process • Receivers periodically query the snapshots on the SPs to learn of any updates. • If the updates match any subscriptions the Receivers download the files via HTTP IMS requests. Receiver Edge Server

Delivery (continued) • Delivery is accelerated via the CDN • Local Akamai caches • Hierarchical download • Optimized overlay routing • Delivery scales with the growth of the CDN • Akamai caches use a short TTL (on the order of 30 seconds) for the configuration files

Operational Experience • Quorum Assumption • 36 instances of SP disconnected from quorum for more than 10 minutes due to network outages during Jan-Sep of 2004 • In all instances there was an operating quorum of other SPs • Shorter network outages do occur (e.g. two several minute outages between a pair of SPs over a 6 day period) • Permanent Storage – files may get corrupted • NOCC recorded 3 instances of file corruption on the SPs over a 6 months period – Use md5 hash when writing state files

Operational Experience - safeguards • To prevent CDN-wide outages due to a corrupted configuration some files are “zoned” • Publish a file to a set of edge servers = zone 1 • If the system processes the file successfully, publish to zone 2, etc… • Receivers failover from CDN to SPs • Recovery = backup for VE – useful in building state on a fresh SP

File Stats • Configuration file sizes range from a few hundred bytes to 100MB. The average file size is around 121KB. • Submission time dominated by replication to SPs (may take up to 2 minutes for very large files) • 15,000 files submitted over 48 hours

Propagation Time • Randomly sampled 250 edge servers to measure propagation time. • 55 seconds on avg. • Dominated by cache TTL and polling intervals

Propagation vs. File Sizes • Mean and 95th percentile propagation time vs. file size • 99.95% of updates arrived within 3 minutes • The rest delayed due to temporary connectivity issues

Scalability • Front-end scalability is dominated by replication • With 5 SPs and 121KB avg. file size, Vector Exchange overhead is 0.4% of bandwidth • With 15 SPs, overhead is 1.2% • For larger footprint can use hashing to pick a set of SPs for each configuration file • Back-end scalability • Cacheability grows as the CDN penetrates more ISPs • Reachability of edge machines inside remote ISPs improves with more alternate paths

Conclusion • ACMS uses a set of distributed algorithms that ensure high level of fault-tolerance • Quorum based system allows operators to ignore transient faults, and gives them more time to react to significant Storage Point failures • ACMS is a core subsystem of the Akamai CDN that customers rely on to administer content

Discussion • Are parts of the system generalizable? • What if front-end implemented a state machine? • General purpose coordination service (e.g Chubby, ZooKeeper) • What is missing from front-end for implementing a state machine? • No agreement of ordering

Backup Slides

Recovery Optimization: Snapshots • Snapshot is a hierarchical index structure that describes latest versions of all accepted files • Each SP updates its own snapshot when it learns of agreements • As part of the recovery process an SP queries snapshots on other SPs • Side-effect: snapshots are also used by the edge servers (back-end) to detect changes.

Operational Experience – we rely heavily on the Network Operations Control Center for early fault detection

Tail of Propagation • Another random sample of 300 edge servers over a 4 day period • Measured propagation of small files (under 20KB) • 99.8% of the time file is received within 2 minutes • 99.96% of the time file is received within 4 minutes

ACMS: The Akamai Configuration Management System