OnCall: Defeating Traffic Spikes with a Free-Market Application Cluster

OnCall Defeating Traffic Spikes with a Free-Market Application Cluster James Norris • Keith Coleman • Armando Fox • George Candea Stanford University

Motivation

CNN.com September 11, 2001 4x traffic in a single day8x traffic on second day Offline for 2.5 hours, diminished service afterwards Forced to borrow servers from sister AOL-TW websites 337.4 M 162.4 M Page Views 40 M

Slashdot, etc Slashdot Effect Knocks out sites (often at the worst possible time) Variable Traffic Ticket Sales Contests Online Fashion Shows etc…

What to do?

One Option: Overprovision + Works for steady state fluctuations (but is it optimal?) –Too expensive for spike conditions (8x servers for CNN) Think about it: Like having a fixed size buffer “Can only support 1000 entries”  Lame Stanford Axess: “Sorry, 49 people already logged in” And in steady state there is so much waste So what do we do? Use dynamic allocation

What is OnCall? OnCall is… a cluster management system designed to multiplex several (possibly competing) dynamic web applications onto a single cluster. Goal: Make spike handling possible while providing useful resource guarantees to all apps

OnCall: Overview Marketplace of Applications Applications rent and lend computing resources according to pre-defined market policies Generic Platform Based on VMs  application generic  fast app swapping

Marketplace

Market Rounds Offline Each application assigned ownership of G computers at a fixed price (or rate) Online • Determine market equilibrium price, P, by querying each application • Calculate new allocation sizes at price P • Adjust allocations, moving computers from sellers to buyers • Repeat every time quantum, t

Offline Market: G “G” Each app “owns” G nodes Resource guarantees Never have to sell: no matter what the price or what other apps’ demands, an app is guaranteed use of its G nodes Can lend by choice (if there are renters at desired price) Can rent extra nodes (if it needs to and/or can afford to)

How many nodes do you want for $10 each? How many nodes do you want for $5 each? 7 nodes 5 nodes 5 nodes 3 nodes 2 nodes 2 nodes Online Market 7 + 5 + 2 = 14, but I only have 10 nodes! 5 + 3 + 2 = 10 Perfect! 10 nodes in cluster Marketplace Policy Policy Policy

Online Market: Policies Inputs: • Performance stats • CPU usage • Disk I/O • etc. From Marketplace • Application inputs • Time of day • Historical usage From Application • Output: • # of computers • desired at price P POLICY Price P

Example Market Policy n < G (no spike) • For each round, application A computes the number of nodes, n, it needs to handle current traffic • Ex: Application A has a price threshold of $6: • If (P < $6), A will ask for n nodes • If (P ≥ $6), A will only ask for min(n, G) nodes – it can’t afford to rent extras n > G (spike)

Finding the Equilibrium • Sample points along the different policy functions • Determine the price at which the total number of nodes desired by all apps equals the total number of nodes available on the cluster

Notes and Assumptions Homogeneity Assumption Cluster is assumed to be homogeneous—all nodes rented at same price (for simplicity) Swapping Costs Time delay cost in start up / shut down of an app on a node. If a rental contract is renewed, app runs on same node. “P” Only for Extras Apps only pay price P for nodes above and beyond their own G Ex: Using 40, G = 30  40 – 30 = 10 nodes at price P

Platform

Cluster nodes running VMMs, OnCall Responders, and Application VMs Internet L7 Load Balancers Network Attached Storage containing Application VM capsules Cluster node running VMM with OnCall Manager & Marketplace Application VM Platform Overview

Runtime Operation Runtime cycle repeats every t • Marketplace calculates equilibrium price (and thus application allocations) • Managers assigns apps to physical nodes (minimizing shutdowns and startups) • Manager signals Responders to shutdown and start new app, as necessary • At end of round, Manager gathers new usage stats; reports stats to Market Policies • Repeat

Does this work?

Simulation Testbed Three Simulations, Four Traits • Spike handling under unconstrained resources • Spike handling under constrained resources • Resource guarantees • Fast server activation U.C. Berkeley X Cluster • 30 Nodes (double CNN.com) • Dual 1 GHz PIII, 1.5 GB RAM • VMware GSX Server on Linux

Sim 1: Spike Handling • G = 10 for both apps • App 1 handles spikes, App 2 makes $$ • Notice: Lag time between node assigned node active

Sim 2: Resource Constraints • G1= 12, G2= 6, G3= 12 • App 1 has higher budget than App 2, but both spike • App 1 handles spikes, App 2 sees guarantee, App 3 makes $$ • App 2 buys more when App 1’s spike subsides

Sim 3: Fast Activation OnCall Optimal: Load VMs from suspended state OnCall Limited: Load VMs from shutdown state Standard with OS: OS already installed on node Standard without OS: Must install OS first Significance: • Worst case, > 2x improvement • When spike lasts only 30 minutes, this is significant • If you can startup quickly, accurate predictor is not critical

More on Markets

Marketplace Optimality What is “optimal?” Under resource constraints, those applications with the most utility to derive from the use of additional nodes are given those nodes Utility Curves Curve specifies: dollar value an application derives from possessing a certain number of nodes for a specific time quantum. Trivially: Utility curves are always monotonically non-decreasing (i.e. it is never worse to own more nodes at a given total cost) To be optimal: Marginal utility curves are always monotonically non-increasing (i.e. every additional node is worth same or less than one before)

Marketplace Fairness Markets are optimal if… …they are free and fair Anti-competitive behavior Monopoly/Oligopoly Aggressive tactics Fairness through Regulation Ensure enough distinct owners  no monopoly Fine or ban app that engages in overtly anti-competitive behavior

Competitive vs Cooperative Competitive Environments Ex: ASP, where app owners may be in competition Cooperative Environments Ex: Search engine, Yahoogle Quick Case Study App 1: Paid web search (very high value in low latency) App 2: Ad-supported web search (high value in low latency) App 3: Crawler (latency OK, starvation not) For each app, model utility of running at a given time Benefit: If you add an app, just need to model that app, not remodel whole system

Profit Through Efficiency “Shut Down” App ASP shuts down servers when it can buy them for less than the cost of keeping them running (A/C, utilities, etc) ASP can then add additional capacity and sell only when profitable

Future Work

Future Work VM caching Cache VMs to local disk (speculatively or as read from NAS) Fault tolerance Add master-backup fault tolerance to the OnCall Manager Performance statistics Provide market policies with additional statistics (e.g. end-to-end response time) Scalable data layer Add support for scalable persistent stores that would allow replication on the data tier. Multiplexing Study trade-offs of running several applications on one node

Questions?

OnCall: Defeating Traffic Spikes with a Free-Market Application Cluster

OnCall: Defeating Traffic Spikes with a Free-Market Application Cluster

Presentation Transcript

OnCall

OnCall Roaster

Welcome to OnCall Benefits. The benefits you need, when you need them.

BOMT Winning The Day For OnCall

OnCall: Defeating Spikes with Dynamic Application Clusters