1.11k likes | 1.32k Views
Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com www.RemoteControlDBA.com. Agenda. Arrive 0900 – 0910 Section 1 0910 – 1000 Break 1000 – 1010 Section 2 1010 – 1100 Break 1100 – 1110 Section 3 1110 – 1200 Break 1200 – 1330
E N D
Understanding Disk I/O By Charles Pfeiffer (888) 235-8916 CJPfeiffer@RemoteControlDBA.com www.RemoteControlDBA.com
Agenda • Arrive 0900 – 0910 • Section 1 0910 – 1000 • Break 1000 – 1010 • Section 2 1010 – 1100 • Break 1100 – 1110 • Section 3 1110 – 1200 • Break 1200 – 1330 • Section 4 1330 – 1420 • Break 1420 – 1430 • Section 5 1430 – 1520 • Break 1520 – 1530 • Q&A 1530 – 1630
Section 1 • General Information • RAID • Throughput v. Response Time
Who Is This Guy? • Been an independent consultant for 11 years • Sun Certified Systems Administrator • Oracle Certified Professional • Taught Performance and Optimization class at Learning Tree • Taught UNIX Administration class at Virginia Commonwealth University • Primarily focus on complete system performance analysis and tuning
What Is He Talking About? • Disks are horrible! • Disks are slow! • Disks are a real pain to tune properly! • Multiple interfaces and points of bottlenecking! • What is the best way to tune disk IO? Avoid it! • Disks are sensitive to minor changes! • Disks don’t play well in the SAN Box! • You never get what you pay for! • Thankfully, disks are cheap!
What Is He Talking About? (continued) • Optimize IO for specific data transfers • Small IO is easy, based on response time • Improved with parallelism, depending on IOps • Improved with better quality disks • Large IO is much more difficult • Increase transfer size. Larger IO slows response time! • Spend money on quantity not quality. Stripe wider! • You don’t get what you expect (label spec) • You don’t even come close!
Where Do Vendors Get The Speed Spec From? • 160 MBps capable does not mean 160 MBps sustained • Achieved in optimal conditions • Perfectly sized and contiguous disk blocks • Streamline disk processing • Achieved via a disk-to-disk transfer • No OS or FileSystem
What Do I Need To Know? • What is good v. bad? • What are realistic expectations in different cases? • How can you get the real numbers for yourself? • What should you do to optimize your IO?
Why Do I Care? • IO is the slowest part of the computer • IO improves slower than other components • CPU performance doubles every year or two • Memory and disk capacity double every year or two • Disk IO Throughput doubles every 10 to 12years! • A cheap way to gain performance • Disks are bottlenecks! • Disks are cheap. SANs are not, but disk arrays are!
What Do Storage Vendors Say? • Buy more controllers • Sure, if you need them • How do you know what you need? • Don’t just buy them to see if it helps • Buy more disks • Average SAN disk performs at < 1% • 50 disks performing at 1% = ½ disk • Try getting 20 disks to perform at 5% instead (= 1 whole disk)
What Do Storage Vendors Say? (continued) • Buy more cache • Sure, but its expensive • Get all you can get out of the cheap disks first • Fast response time is good • Not if you are moving large amounts of data • Large transfers shouldn’t get super-fast response time • Fast response time means you are doing small transfers
What Do Storage Vendors Say? (continued) • Isolate the IO on different subsystems • Just isolate the IO on different disks • Disks are the bottleneck, not controllers, cache, etc. • Again, expensive. Make sure you are maximizing the disks first.
What Do Storage Vendors Say? (continued) • Remove hot spots • Yes, but don’t do this blindly! • Contiguous blocks reduce IOps • Balance contention (waits) v. IOps (requests) carefully! • RAID-5 is best • No its not, its just easier for them!
The Truth About SAN • SAN = scalability • Yeah, but internal disk capacity has caught up • SAN != easy to manage • SAN = performance • Who told you that lie? • SAN definitely != performance
The Truth About SAN (continued) • But I can stripe wider and I have cache, so performance must be good • You share IO with everyone else • You have little control over what is on each disk • Hot Spots v. Fragmentation • Small transfer sizes • Contention
How Should I Plan? • What do you need? • Quick response for small data sets • Move large chunks of data fast • A little of both • Corvettes v. Dump Trucks • Corvettes get from A to B fast • Dump Trucks get a ton of dirt from A to B fast
RAID Performance Penalties • Loss of performance for RAID overhead • Applies against each disk in the RAID • The penalties are: • RAID-0 = None • 1, 0+1, 10 = 20% • 2 = 10% • 3, 30 = 25% • 4 = 33% • 5, 50 = 43%
Popular RAID Configurations • RAID-0 (Stripe or Concatenation) • Don’t concatenate unless you have to • No fault-tolerance, great performance, cheap • RAID-1 (Mirror) • Great fault-tolerance, no performance gain, expensive • RAID-5 (Stripe With Parity) • medium fault-tolerance, low performance gain, cheap
Popular RAID Configurations (continued) • RAID-0+1 (Two or more stripes, mirrored) • Great performance/fault-tolerance, expensive • RAID-10 (Two or more mirrors, striped) • Great performance/fault-tolerance, expensive • Better than RAID-0+1 • Not all hardware/software offer it yet
RAID-10 Is Better Than RAID-0+1 • Given: six disks • RAID-0+1 • Stripe disks one through three (Stripe A) • Stripe disks four through six (Stripe B) • Mirror stripe A to stripe B • Lose Disk two. Stripe A is gone • Requires you to rebuild the stripe
RAID-10 Is Better Than RAID-0+1 • RAID-10 • Mirror disk one to disk two • Mirror disk three to disk four • Mirror disk five to disk six • Stripe all six disks • Lose Disk two. Just disk two is gone • Only requires you to rebuild disk two as a submirror
Common Throughput Speeds (MBps) • Serial = 0.014 • IDE = 16.7, Ultra IDE = 33 • USB1 = 1.5, USB2 = 60 • Firewire = 50 • ATA/100 = 12.5, SATA = 150, Ultra SATA = 187.5
Common Throughput Speeds (MBps) (continued) • FW SCSI = 20, Ultra SCSI = 40, Ultra3 SCSI = 80, Ultra160 SCSI = 160 Ultra320 SCSI = 320 • Gb Fiber = 120, 2Gb Fiber = 240, 4Gb Fiber = 480
Expected Throughput • Vendor specs are maximum (burst) speeds • You won’t get burst speeds consistently • Except for disk-to-disk with no OS (e.g. EMC BCV) • So what should you expect? • Fiber = 80% as best-case in ideal conditions • SCSI = 70% as best-case in ideal conditions • Disk = 60% as best-case in ideal conditions • But even that is before we get to transfer size
BREAK See you in 10 minutes
Section 2 • Transfer Size • Mkfile • Metrics
Transfer Size • Amount of data moved in one IO • Must be contiguous block IO • Fragmentation carries a large penalty! • Device IOps limits restrict throughput • Maximum transfer size allowed is different for different file systems and devices • Is Linux good or bad for large IO?
Transfer Size Limits • Controllers = Unlimited • Disks and W2K3 NTFS = 2 MB • Remember the vendor Speed Spec • W2K NTFS, VxFS and UFS = 1 MB
Transfer Size Limits (continued) • NT NTFS and ext3 = 512 KB • ext2 = 256 KB • FAT16 = 128 KB • Old Linux = 64 KB • FAT = 32 KB
So Linux Is Bad?! • Again, what are you using the server for? • Transactional (OLTP) DB = fine • Web server, small file share = fine • DW, large file share = Might be a problem!
Good Transfer Sizes • Small IO / Transactional DB • Should be 8K to 128K • Tend to average 8K to 32K • Large IO / Data Warehouse • Should be 64K to 1M • Tend to average 16K to 64K • Not very proportional compared to Small IO! • And it takes some tuning to get there!
Find Your AVG Transfer Size • iostat –exn (from a live Solaris server) extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 2.8 1.1 570.7 365.3 0.0 0.1 2.9 19.0 1 3 0 0 0 0 d10 • (kr/s + kw/s) / (r/s + w/s) • (570.7 + 365.3) / (2.8 + 1.1) = 240K
Find Your AVG Transfer Size (continued) • PerfMon
Find Your AVG Transfer Size (continued) • AVG Disk Bytes / AVG Disk Transfers • Allow PerfMon to run for several minutes • Look at the average field for Disk Bytes/sec • Look at the average field for Disk Transfers/sec
The mkfile Test • Simple, low-overhead, write of a contiguous (as much as possible) empty file • Really is no comparison! Get cygwin/SFU on Windows to run the same test • ‘time mkfile 100m /mountpoint/testfile’ • Real is total time spent • Sys is time spent on hardware (writing blocks) • User is time spent at keyboard/monitor
The mkfile Test (continued) • User time should be minimal • Time in user space in the kernel • Not interacting with hardware • Waiting for user input, etc. • Unless its waiting for you to respond to a prompt, like to overwrite a file
The mkfile Test (continued) • System time should be 80% of real time • Time in system space in the kernel • Interacting with hardware • Doing what you want, reading from disk, etc. • Real – (System + User) = WAIT • Any time not directly accounted for by the kernel is time spent waiting for a resource • Usually this is waiting for disk access
The mkfile Test (continued) • Common causes for waits • Resource contention (disk or non-disk) • Disks are to busy • Need wider stripes • Not using all of the disks in a stripe • Disks repositioning • Many small transfers due to fragmentation • Bad block/stripe/transfer sizes
The Right Block Size • Smaller for small IO, bigger for large IO • The avg size of data written to disk per individual write • In most cases you want to be at one extreme • As big as you can for large IO / as small as you can for small IO • Balance performance v. wasted space. Disks are cheap! • Is there an application block size? • OS block size should be <= app block size
More iostat Metrics • iostat –exn (from a live Solaris server) extended device statistics ---- errors --- r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b s/w h/w trn tot device 2.8 1.1 570.7 365.3 0.0 0.1 2.9 19.0 1 3 0 0 0 0 d10 • %w (wait) = 1. Should be <= 10. • %b (busy) = 3. Should be <= 60. • Asvc_t = 19 (ms response). Most argue that this should be <= 5, 10 or 20 in today’s technology. Again, response v. throughput.
iostat On Windows • Not so easy • PerfMon can get you %b • Physical Disk > % Disk Time • Not available in cygwin or SFU • So what do you do for %w or asvc_t • Not much • You can ID wait issues as demonstrated later • Depend on the array/SAN tools
vmstat Metrics • Vmstat procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b w swpd free buff cache si so bi bo in cs us sy id wa 0 0 0 163608 77620 0 0 3 1 1 0 5 11 1 3 96 0 • b+w = (blocked/waiting) processes • Should be <= # of logical CPUs • us(er) v. sy(stem) CPU time
vmstat Metrics (continued) • Is low CPU idle bad? • Low is not 0 • Idle cycles = money wasted • Need to be able to process all jobs at peak • Don’t need to be able to process all jobs at peak and have idle cycles for show! • Better off watching the run/wait/block queues • Run queue should be <= 4 * # of logical CPUs
vmstat On Windows • Cygwin works (b/w consolidated to b)
vmstat On Windows (continued) • PerfMon • System time = idle time – user time
vmstat on Windows (continued) • PerfMon • Run Queue is per processor (<=4) • Block/Wait queue is blocking queue length
Additional Metrics • Do not swap! • On UNIX you should never swap • Use your native OS commands to verify • Don’t trust vmstat • On Windows some swap is OK • Use PerfMon to check Pages/sec. • Should be <= 100 • Use ‘free’ in cygwin
Additional Metrics (continued) • Network IO issues will make your server appear slow • ‘netstat –in’ displays errors/collisions • Collisions are common on auto-negotiate networks • Hard set the switch and server link speed/mode • Use ‘net statistics workstation’ on Windows