380 likes | 492 Views
Scheduling Your Network Connections. Mor Harchol-Balter Carnegie Mellon University. Joint work with Bianca Schroeder. FCFS. jobs. jobs. PS. SRPT. jobs. Q: Which minimizes mean response time?. “size” = service requirement. load r < 1. Q: Which best represents
E N D
Scheduling Your Network Connections Mor Harchol-Balter Carnegie Mellon University Joint work with Bianca Schroeder
FCFS jobs jobs PS SRPT jobs Q: Which minimizes mean response time? “size” = service requirement load r < 1
Q: Which best represents scheduling in web servers ? FCFS jobs “size” = service requirement load r < 1 jobs PS SRPT jobs
IDEA How about using SRPT instead of PS in web servers? client 1 “Get File 1” WEB SERVER (Apache) client 2 Internet “Get File 2” Linux 0.S. client 3 “Get File 3”
Immediate Objections 1) Can’t assume known job size Many servers receive mostly static web requests.“GET FILE”For static web requests, know file size Approx. know service requirement of request. 2) But the big jobs will starve ...
THEORY IMPLEMENT Outline of Talk 1) “Analysis of SRPT Scheduling: Investigating Unfairness” 2) “Size-based Scheduling to Improve Web Performance” 3) “Web servers under overload: How scheduling can help” www.cs.cmu.edu/~harchol/
THEORY SRPT has a long history ... 1966 Schrage & Miller derive M/G/1/SRPT response time: 1968 Schrage proves optimality 1979 Pechinkin & Solovyev & Yashkov generalize 1990 Schassberger derives distribution on queue length BUT WHAT DOES IT ALL MEAN?
THEORY SRPT has a long history (cont.) 1990 - 97 7-year long study at Univ. of Aachen under Schreiber SRPT WINS BIG ON MEAN! 1998, 1999 Slowdown for SRPT under adversary: Rajmohan, Gehrke, Muthukrishnan, Rajaraman, Shaheen, Bender, Chakrabarti, etc. SRPT STARVES BIG JOBS! Various o.s. books: Silberschatz, Stallings, Tannenbaum: Warn about starvation of big jobs ... Kleinrock’s Conservation Law: “Preferential treatment given to one class of customers is afforded at the expense of other customers.”
? PS ? SRPT Unfairness Question Let r=0.9. Let G: Bounded Pareto(a = 1.1, max=1010) Question: Which queue does biggest job prefer?
I SRPT THEORY Counter-intuitive! Our Analytical Results (M/G/1): All-Can-Win Theorem: Under workloads with heavy-tailed (HT) property, ALL jobs, including the very biggest, prefer SRPT to PS, provided load not too close to 1. Almost-All-Win-Big Theorem: Under workloads with HT property, 99% of all jobs perform orders of magnitude better under SRPT. PS SRPT
log-log plot 1 Pr{Life > x} = x Duration (xsecs) What’s Heavy-Tail? Fraction of jobs with CPU duration > x Berkeley Unix process CPU lifetimes [HD96]
What’s the Heavy-Tail property? Defn: heavy-tailed distribution: - a > < a < Pr { X x } ~ x , 0 2 Many real-world workloads well-modeled by truncated HT distribution. Key property: HT Property: “Largest 1% of jobs comprise half the load.”
I SRPT THEORY Counter-intuitive! Our Analytical Results (M/G/1): All-Can-Win Theorem: Under workloads with heavy-tailed (HT) property, ALL jobs, including the very biggest, prefer SRPT to PS, provided load not too close to 1. Almost-All-Win-Big Theorem: Under workloads with HT property, 99% of all jobs perform orders of magnitude better under SRPT. PS SRPT
THEORY Our Analytical Results (M/G/1): All-distributions-win-thm: If load < .5, for every job size distribution, ALL jobs prefer SRPT to PS. Bounding-the-damage Theorem: For any load, for every job size distribution, for every size x, æ ö r ç ÷ < + E [ T ( x )] 1 E [ T ( x )] SRPT PS ç ÷ 2(1 - r ) è ø
IMPLEMENT From theory to practice: What does SRPT mean within aWeb server? • Many devices: Where to do the scheduling? • No longer one job at a time.
IMPLEMENT Server’s Performance Bottleneck Site buys limited fraction of ISP’s bandwidth client 1 “Get File 1” WEB SERVER client 2 (Apache) Rest of Internet “Get File 2” ISP Linux 0.S. client 3 “Get File 3” 5 We model bottleneck by limiting bandwidth on server’s uplink.
IMPLEMENT Web Server Network/O.S. insides of traditional Web server Socket 1 Client1 Network Card Socket 2 Client2 BOTTLENECK Client3 Socket 3 Sockets take turns draining --- FAIR = PS.
IMPLEMENT Web Server Network/O.S. insides of our improved Web server Socket 1 Client1 S Network Card 1st Socket 2 Client2 2nd M BOTTLENECK 3rd Client3 Socket 3 L priority queues. Socket corresponding to file with smallest remaining data gets to feed first.
1 1 2 2 WAN EMU 3 3 1 APACHE WEB SERVER 200 200 2 Linux Linux 3 WAN EMU switch Linux 0.S. 1 2 WAN EMU 3 200 Linux Experimental Setup Implementation SRPT-based scheduling: 1) Modifications to Linux O.S.: 6 priority Levels 2) Modifications to Apache Web server 3) Priority algorithm design.
Flash Experimental Setup Apache 10Mbps uplink 1 2 WAN EMU 3 100Mbps uplink APACHE WEB SERVER 1 200 Linux 2 Surge 1 3 2 Trace-based WAN EMU 3 switch 200 Linux Open system Linux 0.S. 1 Partly-open 2 WAN EMU 3 200 WAN EMU Linux Geographically- dispersed clients Trace-based workload: Number requests made: 1,000,000 Size of file requested: 41B -- 2 MB Distribution of file sizes requested has HT property. Load < 1 Transient overload + Other effects: initial RTO; user abort/reload; persistent connections, etc.
Preliminary Comments 1 2 WAN EMU 3 APACHE WEB SERVER 1 200 Linux 2 1 3 2 WAN EMU 3 switch 200 Linux Linux 0.S. 1 2 WAN EMU 3 200 Linux • Job throughput, byte throughput, and bandwidth • utilization were same under SRPT and FAIR scheduling. • Same set of requests complete. • No additional CPU overhead under SRPT scheduling. • Network was bottleneck in all experiments.
Results: Mean Response Time . . . Mean Response Time (sec) . FAIR . SRPT . Load
Results: Mean Slowdown FAIR Mean Slowdown SRPT Load
Mean Response Time vs. Size Percentile Load =0.8 FAIR Mean Response time (ms) SRPT Percentile of Request Size
Summary so far ... • SRPT scheduling yields significant improvements in Mean Response Time at the server. • Negligible starvation. • No CPU overhead. • No drop in throughput.
More questions … • So far only showed LAN results. • Are the effects ofSRPT in a WANas strong? • So far only showed load < 1. • What happens under SRPT vs. FAIR when the • server runs under transient overload? • -> new analysis • -> implementation study
WAN EMU results Propagation delay has additive effect. Reduces improvement factor. FAIR SRPT
WAN EMU results Loss has quadratic effect. Reduces improvement factor a lot. FAIR SRPT
WAN results Geographically-dispersed clients Load 0.9 Load 0.7
Zzzzzzz zzz... Overload – 5 minute overview Person under overload
Q: What happens under overload? A: Buildup in number of connections. FAIR SRPT Q: What happens to response time?
Web server under overload When reach SYN-queue limit, server drops all connection requests. SYN-queue Clients Server SYN-queue ACK-queue Apache-processes
Transient Overload r>1 r>1 r>1 r<1 r<1 r>1 r>1 r>1 r<1 r<1 r<1
Transient Overload - Baseline Mean response time FAIR SRPT
Transient overload Response time as function of job size FAIR SRPT small jobs win big! big jobs aren’t hurt! WHY?
FACTORS Baseline Case WAN propagation delays RTT: 0 – 150 ms WAN loss Loss: 0 – 15% WAN loss + delay RTT: 0 – 150 ms, Loss: 0 – 15% Persistent Connections 0 – 10 requests/conn. RTO = 0.5 sec – 3 sec Initial RTO value ON/OFF SYN Cookies Abort after 3 – 15 sec, with 2,4,6,8 retries. User Abort/Reload Packet Length Packet length = 536 – 1500 Bytes RTT = 100 ms; Loss = 5%; 5 requests/conn., RTO = 3 sec; pkt len = 1500B; User aborts After 7 sec and retries up to 3 times. Realistic Scenario
Transient Overload - Realistic Mean response time FAIR SRPT
Conclusion • SRPT scheduling is a promising solution for reducing mean response time seen by clients, particularly when the load at server bottleneck is high. • SRPT results in negligible or zero unfairness to large requests. • SRPT is easy to implement. • Results corroborated via implementation and analysis.