1.09k likes | 1.26k Views
I/O Performance Analysis and Tuning: From the Application to the Storage Device. Henry Newman Instrumental, Inc. MSP Area Computer Measurement Group March 2, 2006. Tutorial Goal. To provide the attendee with an understanding of the history and techniques used in I/O performance analysis.
E N D
I/O Performance Analysis and Tuning: From the Application to the Storage Device Henry Newman Instrumental, Inc. MSP Area Computer Measurement Group March 2, 2006
Tutorial Goal • To provide the attendee with an understanding of the history and techniques used in I/O performance analysis. “Knowledge is Power” - Sir Francis Bacon
Agenda • Common terminology • I/O and applications • Technology trends and their impact • Why knowing the data path is Important • Understanding the data path • Application performance • Performance analysis • Examples of I/O performance Issues • Summary
Common Terminology Using the same nomenclature
The Data Path • Before you can measure system efficiency, it is important to understand how the H/W and S/W work together end-to-end or along the “data path”. • Let’s review some of the terminology along the data path…
Terminology/Definitions • DAS • Direct Attached Storage • SAN • Storage Area Network • NAS • Network Attached Storage shared via TCP/IP • SAN Shared File System • File system that supports shared data between multiple servers • Sharing is accomplish via a metadata server or distributed lock manager
Client Client Client Client Local Area Network (LAN) Server 1 Application File System Server N Application File System F/C Switch F/C Switch Disk Disk Direct Attached Storage
Client Client Client Client Local Area Network (LAN) Server 1 Application File System Server 2 Application File System F/C Switch RAID Controller Disk Disk Storage Area Network
Client Client Client Client Local Area Network (LAN) Server 1 Application Server 2 Application Server N Application NAS Server O/S File System Disk Disk Disk Network Attached Storage
File System Terminology • File System Superblock • Describes the layout of the file system • The location and designation of volumes being used • The type of file system, layout, and parameters • The location of file system metadata within the file system and other file system attributes • File System Metadata • This is the data which describes the layout of the files and directories within a file system • File System Inode • This is file system data which describes the location, access information and other attributes of files and directories
Other Terms • HSM - Hierarchical Storage Management • Management of files that are viewed as if they are on disk within the file system, but are generally stored on a near line or off-line device such as tape, SATA, optical and/or other lower performance media. • LUN - Logical Unit Number • Term used in SCSI protocol to describe a target such as a disk drive, RAID array and/or tape drive.
Other Terms (cont.) • Volume Manager (VM) • Manages multiple LUNs grouped into a file system • Can usually stripe data across volumes or concatenate volumes • Filling a volume before moving to fill the next volume • Usually have a fixed stripe size allocated to each LUN before allocating the next LUN • For well tuned systems this is the RAID allocation (stripe width) or multiple
File 5 File 1 File system stripe group of LUNS used to match bandwidth File 6 File 2 File 7 File 3 File 8 File 4 Round-Robin File System Allocation
Populated with three (3) files File 1 File 2 File 3 removed File 3 Round-Robin File System
With stripe allocation, all writes go to all devices based on the allocation within the volume manager Each file is not allocated on a single disk, but all disks File 5 File 1 File 6 File 2 File 7 File 3 File 8 File 4 Striped File System Allocation
Populated with three (3) files File 1 File 2 Fragmented after File 3 removed File 3 Striped File System Allocation
Boot MFT Free Space Metadata Free Space Microsoft NTFS Layout • Newly formatted NTFS volume • Data and metadata are mixed and can easily become fragmented • Head seeks on the disks are a big issue • Given the different data access patterns for data (long block sequential) and metadata (short block random)
SAN Shared File System (SSFS) • The ability to share data between systems directly attached to the same devices • Accomplished through SCSI connectivity and a specialized file system and/or communications mechanism • Fibre Channel • iSCSI • Other communications methods
SAN Shared File System (SSFS) • Different types of SAN file systems allow multiple writes to the same file system and the same file open from more than one machine • POSIX limitations were never considered for shared file systems • Let’s take a look at 2 different types of SSFS…
Client Client Client Client Local Area Network (LAN) For metadata traffic Server Application Meta Data File System Server Application Client File System Server Application Client File System Server Application Client File System F/C Switch RAID Controller Disk Disk Disk Centralized Metadata SSFS
Client Client Client Client (LAN) local data traffic for file system Depends on implementation Server Application Lock Mgr. File System Server Application Lock Mgr File System Server Application Lock Mgr File System Server Application Lock Mgr File System F/C Switch RAID Controller Disk Disk Disk Distributed Metadata SSFS
More on SSFS • Metadata server approaches do not scale for clients counts over 64 as distributed lock managers • Lustre and GPFS are some examples of distributed metadata • Panasas behaves similarly in terms of scaling as to the distributed metadata approaches, but view the files as objects
Definition - Direct I/O • I/O which bypasses server memory mapped cache and goes directly to disk • Some file systems can automatically switch between paged I/O and direct I/O depending on I/O size • Some file systems require special attributes to force direct I/O for specific files or directories or enable by API • For emerging technologies often times call data movement directly to the device “Direct Memory Addressing” or “DMA” • Similar to what is done with MPI communications
Direct I/O Improvements • Direct I/O is similar to “raw” I/O used in database transactions • CPU usage for direct I/O • Can be as little as 5% of paged I/O • Direct I/O improves performance • If the data is written to disk and not reused by the application • Direct I/O is best used with large requests • This might not improve performance for some file systems
Well Formed I/O • The application, operating system, file system, volume manager and storage device all have an impact on I/O • I/O that is “well formed” must satisfy requirements from all of these areas for the I/O to move efficiently • I/O that is well formed reads and writes data on multiples of the basic blocks of these devices
Well Formed I/O and Metadata • If file systems do not separate data and metadata and their space is co-located • Can impact data alignment because metadata is interspersed with data • Large I/O requests are not necessarily sequential allocated sequentially • File systems allocated data based on internal allocation algorithms • Multiple writes streams prevent sequential allocation
Well Formed & Direct I/O from the OS • Even if you use the O_DIRECT option, I/O cannot move from user space to the device unless it begins and ends on 512 byte boundaries • On some systems additional OS requirements are mandated • Page alignment • 32 KB requests often times related to page alignment • Just because memory is aligned does not mean that the file system or RAID is aligned • These are out of your control
Well Formed & Direct I/O from Device • Just because data is aligned in the OS does not mean it is aligned for the device • I/O for disk drives must begin and end on 512 byte boundaries • And of course you have the RAID alignment issues • More on this later
Volume Managers (VMs) • For many file systems, the VMs control the allocation to each device • VMs often times have different allocations than the file system • Making read/write requests equal to or in multiples of the VM allocation generally improves performance • Some VMs have internal limits that prevent large numbers of I/O requests from being queued
Device Alignment • Almost all modern RAID devices have a fixed allocation per device • Ranges from 4 KB to 512 KB are common • File systems will have the same issue with RAID controllers as does memory with an operating system
0 0 512 512 1024 1024 262144 262144 Direct I/O Examples (Well Formed) • Any I/O request that begins and ends on a 512 word boundary is well formed* • Request of 262,144 begins at 0 and is 262,144 bytes long * Well formed in terms of the disk not the RAID
1 512 1024 262145 Direct I/O Examples (Not-Well Formed) • I/O that is not well formed can be broken into well formed parts and non-well formed parts by some file systems • Request that begin at byte 1 and end at byte 262,145 • 1st request 0-512 bytes of which 511 is moved to system buffer • 2nd request 513-262144 bytes (direct) • 3rd request 262145-262656 9 bytes buffered
Well Formed I/O Impact • Having I/O that is not well formed causes • Significant overhead in the kernel to read data that is not aligned • Impact depends on other factors such as page alignment • Other impacts on the RAID depend on the
I/O and Applications What is the data path?
What Happens with I/O • I/O can take different paths within the operating system depending on the type of I/O request • These different paths have a dramatic impact on performance • Two types of applications that I/O can take different paths • C library buffered • System calls
1 4 5 C Library Buffer Program space Page Cache File System Cache (some systems) Storage 2 1 Raw I/O no file system or direct I/O 2 All I/O Under file system read/write calls 3 File system meta-data and data most file systems 4 As data is aged it is moved to storage from the page cache 5 Data moved to the file system cache on some systems 3 I/O Data Flow Example • All data goes through system buffer cache • High overhead as data must compete with user operations for system cache
C Library Buffered • Library buffer size • The size of the stdio.h buffer. Generally, this is between 1,024 bytes and 8,192 bytes and can be changed on some systems by calls to setvbuf() • Moving data via the C library requires multiple memory moves and/or memory remapping to copy data from the user space to library buffer to the storage device • Library I/O generally has much higher overhead than system calls because you have to make more system given the small request sizes
C Library I/O Performance • If I/O is random and buffer is bigger than request • More data will be read than is needed • If I/O is sequential • If buffer is bigger than request than data is read ahead • If buffer is smaller than request size then multiple system calls will be required • Unless the buffer is larger than the request • It needs to be significantly larger given the extra overhead to move the data or remap the pages
System Calls • UNIX system calls are generally more efficient for random or sequential I/O • Exceptions for sequential I/O are for small requests as compared to C library I/O and large setvbuf() • System calls allow you to perform asynchronous I/O • Gives immediate control back to the program enabling management of ACK when you need the data on the device
Vendor Libraries • Some vendors have custom libraries • Manage data alignment • That have circular asynchronous buffering • Allow readahead • Cray, IBM and SGI all have libraries which can significantly improve I/O performance for some applications • There is currently no standard in this area • There is an effort by DOE to develop similar technology for Linux
Technology Trends and Their Impact What is changing and what is not
Block Device History • The concept of block devices has been around for a long time…at least 35 years • A block device is a data storage or transfer device that manipulates data in groups of a fixed size • For example, a disk whose data storage size is usually 512 bytes for SCSI devices
SCSI Technology History • The SCSI standard has been in place for a long time as well • There is an excellent historical account of SCSI http://www.pcguide.com/ref/hdd/if/scsi/over.htm • Though the SCSI history is interesting and the technology has been launched by many companies • The SCSI standard was published in 1986 • Which makes it nearly 19 years old
Changes Have Been Limited • Since the advent of block devices and the SCSI protocol, modest changes have been made to support • Interface changes, new device types, and some changes for error recovery and performance • Nothing has really changed in the basic concepts of the protocol • Currently there is no communication regarding data topology between block devices and SCSI • Although one new technology has promise - more on OSD later
1.00E+11 1.00E+10 1.00E+09 1.00E+08 1.00E+07 1.00E+06 Times Difference 1.00E+05 1.00E+04 1.00E+03 1.00E+02 1.00E+01 1.00E+00 CPU L1 Cache L2 Cache Memory Disk NAS Tape Registers Min Times Increase Max Times Increases ~ Relative Latency for Data Access Note: Approximate values for various technologies and 12 orders of magnitude
1.0E+04 1.0E+03 1.0E+02 Times Difference 1.0E+01 1.0E+00 1.0E-01 1.0E-02 CPU L1 Cache L2 Cache Memory Disk NAS Tape Registers Min Relative BW in GB/sec Max Relative BW Reduction in GB/sec ~ Relative Bandwidth for Data Note: Approximate values for various technologies and 6 orders of magnitude
10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 CPU Disk Drive Transfer Rate Transfer Rate RPMS Seek+Latency Seek+Latency Size RAID disk Read Write Performance Increases (1977-2005)
37.50 35 30 25 20 MB/sec. 15 10 5 0.30 0.33 0.17 0.17 0.08 0 Single Disk Single Disk RAID 4+1 300 RAID 8+1 300 SATA RAID 4+1 SATA RAID 8+1 1977 300 GB GB GB 300 GB 300 GB Bandwidth per GB of Capacity Bandwidth per GB of Capacity
1,655 1,700 1,600 1,500 1,400 1,300 1,200 1,100 1,000 900 Number of I/Os Per Second 765 800 700 600 500 400 270 300 200 100 0 1977 CDC Cyber 819 2005 300 GB Seagate 2005 400 GB SATA Cheetah 10K.7 4 KB IOPS for a Single Device