220 likes | 465 Views
Benchmarking Swift. Eamonn O’Toole Mark Seger. Agenda. Benchmarking with HP’s getput Procedure, tools and operation Case study Selecting servers for HP’s public cloud. The Benchmarking Bible. Scripts work best for repeatability Both for load generation and measurement
E N D
Benchmarking Swift Eamonn O’Toole Mark Seger
Agenda • Benchmarking with HP’s getput • Procedure, tools and operation • Case study • Selecting servers for HP’s public cloud
The Benchmarking Bible • Scripts work best for repeatability • Both for load generation and measurement • Test from the bottom of the stack up • Longer runs tend to reduce cache effects • The middle of the test is as important as the duration • Avoid changing more than 1 thing at a time • It will take as long as it will take • There’s no such thing as a coincidence!
Size Matters • Large Objects • IOPS are small, so pay attention to MB/sec • These use a lot of bandwidth so make sure network wide enough • Use a lot of CPU so could need ~1core/stream/client • Small Objects • MB/sec is low, so pay attention to IOPS • Network bandwidth is less of a concern but latency is • CPU requirements are relatively low as well
Collectl • Developed about a dozen years ago • Open Source on sourceforge • Collects fine-grained metrics • CPU, Disk, Network, Memory and more • Process level, including I/O • Can generate stats in real-time or record for later playback • In playback mode can summarize metrics for each process • Colplot generates plots for visualizing overall performance
Getput Tools • Designed exclusively for Swift Benchmarking • Lots of options for simulating lots of behaviors • Puts, Gets, Deletes • Object sizes • Number of clients • Number of processes • Level of container sharing • Options for running tests • Ranges for numbers of objects, processes and clients • Define pre/post test initialization/analysis scripts • The complete list beyond the scope of this talk
Getting Started • Need swift credential exported to your environment • If swift stat works, getput will work and if it doesn’t it won’t! Simple put, get, del $ ./getput.py -cc -oo -n1 -s1k -tp,g,d Rank Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency Median LatRange 0 put 1 1 1k 20:28:04 20:28:04 0.01 1 7.30 0 0.137 0.137 0.14-00.14 0 get 1 1 1k 20:28:04 20:28:04 0.13 1 132.40 0 0.008 0.008 0.01-00.01 0 del 1 1 1k 20:28:04 20:28:04 0.06 1 66.32 0 0.015 0.015 0.02-00.02 Multiple sizes, multiple number of processes $ ./getput.py -cc -oo -n1 -s1k,2k -tp --procs 1,2 Rank Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency Median LatRange 0 put 1 1 1k 20:32:55 20:32:55 0.02 1 19.92 0 0.050 0.050 0.05-00.05 0 put 1 1 2k 20:32:55 20:32:55 0.07 1 37.50 0 0.027 0.027 0.03-00.03 Rank Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency Median LatRange 0 put 1 2 1k 20:32:55 20:32:55 0.03 2 28.82 0 0.071 0.084 0.06-00.08 0 put 1 2 2k 20:32:56 20:32:56 0.21 2 109.18 0 0.019 0.022 0.02-00.02 Note that 1KB PUTs are a lot slower than 2KB PUTs
Watching with collectl Large object upload $ ./getput.py -cc -oo -n1 -s1g -tp Rank Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency Median LatRange 0 put 1 1 1g 20:43:51 20:44:05 77.57 1 0.08 0 13.201 13.201 13.20-13.20 Network rate is NOT smooth Collectl p# <----CPU[HYPER]-----><----------Disks-----------><----------Network----------> #Time cpu sys inter ctxswKBRead Reads KBWrit Writes KBInPktInKBOutPktOut 20:52:28 4 0 440 217 0 0 12 1 13 26 3 24 20:52:29 13 3 454 100 0 0 44 11 0 3 0 2 20:52:30 10 1 14913 30949 0 0 0 0 841 14253 57030 39605 20:52:31 12 1 20892 44930 0 0 0 0 1221 20841 82666 57154 20:52:32 12 1 21092 44454 0 0 0 0 1248 21296 82315 56894 20:52:33 11 1 19808 40054 0 0 0 0 1162 19839 76518 52928 20:52:34 6 0 16505 33347 0 0 0 0 927 15824 69085 47908 20:52:35 7 0 17832 34715 0 0 0 0 1028 17541 67448 46858 20:52:36 6 0 20819 42114 0 0 0 0 1219 20785 80389 55628 20:52:37 9 0 10210 20885 0 0 0 0 591 10080 40941 28290 20:52:38 6 0 20067 39984 0 0 12 1 1160 19802 75784 52552 20:52:39 8 0 21208 44885 0 0 56 14 1263 21552 82416 56985 20:52:40 12 1 18289 36995 0 0 0 0 1073 18311 71868 49758 20:52:41 8 0 20044 37608 0 0 0 0 1223 20872 94048 64743 20:52:42 8 0 17100 28888 0 0 0 0 850 14503 91449 62516 20:52:43 12 0 19396 35053 0 0 0 0 1143 19512 92891 63792 20:52:44 6 0 5005 6023 0 0 0 0 178 3025 25813 17467 20:52:45 6 0 364 142 0 0 0 0 0 2 0 2 20:52:46 4 0 188 72 0 0 0 0 0 1 0 1
Running a Benchmark gpsuite –suite 1kobjs $ gpsuite --suite 1kobjs Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency LatRange put 1 1 1k 11:36:24 11:38:24 0.02 2081 17.34 0 0.058 0.052 0.01-00.78 get 1 1 1k 11:38:54 11:39:11 0.12 2081 119.78 0 0.008 0.007 0.01-00.27 del 1 1 1k 11:39:41 11:40:15 0.06 2081 60.50 0 0.017 0.011 0.01-00.75 put 1 2 1k 11:40:45 11:42:45 0.03 4030 33.58 0 0.060 0.052 0.01-01.03 get 1 2 1k 11:43:15 11:43:31 0.25 4030 258.12 0 0.008 0.007 0.01-00.25 del 1 2 1k 11:44:01 11:44:33 0.12 4030 126.27 0 0.016 0.011 0.01-00.76 put 1 4 1k 11:45:03 11:47:03 0.06 7864 65.50 0 0.061 0.052 0.01-00.97 get 1 4 1k 11:47:33 11:47:48 0.50 7864 514.76 0 0.008 0.007 0.01-00.22 del 1 4 1k 11:48:18 11:49:01 0.21 7864 210.04 0 0.019 0.011 0.01-00.84 put 1 8 1k 11:49:31 11:51:31 0.12 14711 122.56 0 0.065 0.052 0.01-00.99 get 1 8 1k 11:52:01 11:52:16 0.95 14711 975.96 0 0.008 0.007 0.01-00.25 del 1 8 1k 11:52:46 11:53:37 0.29 14711 298.07 0 0.027 0.011 0.01-01.23 put 1 16 1k 11:54:07 11:56:07 0.24 29435 245.23 0 0.065 0.052 0.01-01.33 get 1 16 1k 11:56:37 11:56:52 1.88 29435 1927.82 0 0.008 0.007 0.01-00.26 del 1 16 1k 11:57:23 11:58:31 0.45 29435 459.14 0 0.035 0.012 0.01-00.96 put 1 32 1k 11:59:01 12:01:01 0.38 46277 385.58 0 0.083 0.053 0.01-01.04 get 1 32 1k 12:01:31 12:01:44 3.58 46277 3662.58 0 0.009 0.007 0.01-00.62 del 1 32 1k 12:02:14 12:03:40 0.54 46277 549.55 0 0.058 0.012 0.01-03.43 put 1 48 1k 12:04:11 12:06:11 0.51 62605 521.51 0 0.092 0.054 0.01-01.49 get 1 48 1k 12:06:41 12:06:56 4.41 62605 4520.88 0 0.011 0.007 0.01-00.53 del 1 48 1k 12:07:26 12:09:07 0.63 62605 640.82 0 0.075 0.021 0.01-02.23 • PUTs scale linearly through 16 process, rate increases are slower at 32 and 48 • GETs look read good through 32 processes and slow down a bit at 48 • DELs had some irregular latencies in upper range
Example of getput maxing out Note – this cluster only had 1 object server gpsuite –suite 1kobjs Test Clts Proc OSize Start End MB/Sec Ops Ops/Sec Errs Latency LatRange put 1 1 1k 16:43:50 16:48:50 0.02 6116 20.39 0 0.049 0.01-01.08 put 8 32 1k 16:50:17 16:55:18 0.24 62506 207.68 0 0.154 0.01-03.84 put 8 128 1k 16:56:51 17:01:56 0.08 32430 107.51 0 1.191 0.01-07.12 put 8 256 1k 17:03:40 17:08:45 0.08 35056 115.50 0 2.216 0.01-09.40 put 8 512 1k 17:09:36 17:14:44 0.16 44663 147.10 0 3.481 0.01-308.52 put 8 1024 1k 17:15:41 17:20:49 0.16 44179 145.98 0 7.122 0.01-308.87 Wow! • Look at the latencies growing in both average and range • Also notice we’ve hit the wall at a little <150 IOPS • BUT swift did keep on chugging along
Selecting servers for HP’s public cloud • Get better understanding of Swift performance and optimise hardware/Swift combination • Two different hardware configurations • 12-disk data servers • Dedicated proxy servers • Data servers host account/container/object services • 5:1 data-servers:proxy-servers • 60-disk data servers • Dedicated proxy servers • Data servers host object services only • Container/account services on separate servers to object services • 1:1 data-servers:proxy-servers • Concentrate on transaction rates, especially PUTs of small objects (1KB to 10KB) • Most objects in production are small (50% <= 20KB) • High transaction rate exercises CPU, container & proxy services
Configuration 1 • Proxy servers • 12 physical cores, 2666MHz • 96GB RAM • 10 GigE • 2*2TB 7200 RPM drives (mirror) • ½ U width . . . Disk1 Disk2 • Data servers • 12 physical cores, 2666MHz • 24GB RAM • 1 GigE • 12*2TB 7200RPM drives • 1U high • Run object, container & account services Server 1 . . . . . . . . . Disk 1 Disk 1 Disk 12 Disk 12 . . . Server 5
First set of measurements: “idle” system • Idle: no external PUTs, GETs, DELETEs etc • This system has 123K containers & 17M objects per data-server • Measurements with different services turned on and off in graph • Significant “idle” CPU load • Biggest contributor to “idle” CPU burn is container replicator
Observations on Configuration 1 measurements • Idle CPU burn is 34%, increases to 38% at the maximum-achieved PUT rate (approx 338 PUTs/s) • Container services are the major CPU hogs • Small amount of memory hurts performance - most of the reads go to disk as opposed to cache • Major source of reads: object auditor • Object server reads grow approx. linearly with PUT rate (read 6x as much as write for 1KB PUTs) • Running the container service in conjunction with the object service hurts I/O - the container data flushes object data from cache
Conclusions from Configuration 1 measurements • PUT throughput (1KB) is limited by READ IOPs • Keep container and object services separate • The object services consumes relatively little CPU • Large amounts of RAM for buffer cache will help increase performance
Configuration 2 • Proxy/Container & Account Servers • Same server type for proxy services & container/account services • 12 physical cores, 2666MHz • 96-192GB RAM • 10 GigE • 4*1TB 7200 RPM disks, in a variety of RAID configurations • ½ U width . . . . . . . . . • Object servers • 12 physical cores, 2666MHz • 96GB RAM • 10 GigE • 60*2TB 7200 RPM disks • 4.3U high . . . Disk 1 Disk 1 Disk 60 Disk 4 • Note • Used many combinations of server & Swift services • Used many variations of server details – e.g. RAID config • Report results for a specific server/Swift service config
Observations on Configuration 2 measurements • We achieved a maximum throughput of approx 1600 PUTs/s using 1KB objects, and 2000 PUTs/s using 4KB objects • Dramatic jump in CPU usage particularly on the Object-server for the 2000 PUTs/s run • Benefiting from hyperthreading • On the Proxy/Account&Container-server, the dominant processes are the proxy-server and the container-server. • All reads are satisfied from cache on Object-server and Proxy/Account&Container-server
Conclusions from Configuration 2 measurements • Massive increase in operation throughput • 5x System 1 (per rack U) • Proxy services and account/container services can coexist • Object auditing time probably an issue with 60 disks • Estimate over 200 days for auditor to walk that many disks on “full” system • Possible solution: parallel object auditor • Patch under review https://review.openstack.org/#/c/59778/ • Next steps • Detailed object auditor measurements • Large container measurements using SSDs, striped disks
Links collectl : http://collectl.sourceforge.net/ getput: https://github.com/markseger/getput