290 likes | 403 Views
SAN for the DBA. Dave Wentzel Competitive Advantage Business System – Architect Deloitte and Touche LLP 3/21/2003. Objectives. To allow a SQL/Oracle DBA to speak knowledgeably to a SAN engineer To understand from a DBA’s perspective when the SAN *might* be a bottleneck
E N D
SAN for the DBA Dave Wentzel Competitive Advantage Business System – Architect Deloitte and Touche LLP 3/21/2003
Objectives • To allow a SQL/Oracle DBA to speak knowledgeably to a SAN engineer • To understand from a DBA’s perspective when the SAN *might* be a bottleneck • Based on this presentation, you will NOT • Become a SAN engineer overnight • Be given access to any SAN environments (not even sandbox!!) • Pass the SNIA • Focused on SQL Server/CAB and Oracle/SAP • Will point out issues we’ve seen consulting with DOL and SharePoint • No other customers will be brought up • Sorry I don’t know Linux/AIX. SAP guys…when are you migrating to Oracle on Windows?
Terminologies • SAN: storage area network • DAS: direct attached storage • LUN: Logical Unit Number • HBA: Host Bus Adapter • NAS: network attached storage. NEVER use this for a db…it’s basically a file share. Oracle supports it with NFS on NAS…still…don’t use it • iSCSI SAN: new technology…SCSI commands are sent over a standard TCP/IP network. • FibreChannel • MPIO (MultiPath IO)
SAN Vendors We Use In House • EMC • Symmetrix (UK hosting center) • Virtualized (discussed later) • CABS EURO replica runs here • SAP (UK firm and central Europe (I think)) • Clariion (Glen Mills) • CABS master runs here • SAP HR/FI (global, not US) • HP • HP EVA (Australia and Glen Mills Sandbox) • CABS Americas replica runs here • HP XP (Glen Mills Sandbox) • We don’t use this for prod anymore (discussed later) • Sandbox SAP
Disk “Types” • IDE • Popular on desktops. Not in any of our SANs • Can only handle one outstanding IO request per channel. • SATA • We aren’t using this…yet • Newer models have TCQ (tagged command q’ing). Allows for serving of out-of-order command requests. Big perf boost. • Will soon be cheaper than SCSI and even available in desktops. • SCSI • Supports multiple IO requests
Disk “Speeds” and IOPs • What is IO’s per second? • These are only approximations. Assume 7200 RPM drives/serial reads
Explanations • SCSI has faster spindle speeds, which results in better IOPS • SATA gives more capacity and pricing should be better in the future • RAID will bring your IOPS higher.
Sector Size vs Stripe Size • sector size: smallest physical storage unit on the disk. This is determined by the disk manufacturer but is generally 512 bytes. • stripe size: the unit of data that is written and accessed from a disk in a RAID system. This is configurable by the SAN folks. Smaller stripe sizes allows data to be distributed to more disks and increases I/O parallelism. 64KB (a sql server extent) is the lower limit. 64KB or 256KB should work for most sql server workloads according to MS. 256 allows for more efficient read-ahead. • DB_BLOCK_SIZE should be specified in init.ora • We recommend 64 KB stripe sizes only. We have run into restore issues (sql server) otherwise.
Track Alignment • NTFS volumes need to be aligned with the SAN track sectors. • Demo • Check this BEFORE you sign off on the server install. It can’t be changed after SQL Server is installed. • Not one system was setup correctly when I checked. • Performance improvement of 10-20% theoretically. • No Dynamic disks in NTFS…only Basic • Process: • Diskpar –i <drive number> • Look for StartingOffset and divide by 4096 • If this isn’t a whole number integer send the system back to the data center and tell them to sector align the disks. Eventually they’ll learn. • Most common…StartingOffset is 32256…divide by 4096 = 7.875 • We can correct this for them using “diskpar –s <drive number>…but I’d rather “let them know we are watching” • Never sector align a formatted drive or a drive without a tested backup
NTFS configuration • Basic Disks only • NTFS allocation unit should be 64KB (this is the cluster size in diskmgmt.msc)
RAID • This isn’t a RAID tutorial…but rather what is absolutely essential to know. • RAID 5 (striping with parity) • Let’s all agree to stop recommending this • Proof? http://www.baarf.com • Only protects against single disk failure • RAID 1 (mirroring) • Resilient to multi disk failures. Good for tran logs • RAID 10 (1+0) (mirrored striped set) • This is what we should recommend. Excellent read/write performance. • We are using RAID 50 (mirrored RAID 5) in Glen Mills but are getting pushback when we need more disk. So, again, let’s recommend RAID 10
HBAs • QueueDepth setting. • Changing this can increase performance (depending) but you need to test. There are only so many IOPS a disk subsytem can handle so you don't want to set this too high otherwise other hosts on the SAN will suffer performance problems. If you have few hosts then up'ing the QD may be helpful. • Always consider using multiple HBAs if available. If you do make sure they are on different buses to prevent bus saturation. PCI slots have different bus speeds so make sure you plug your HBAs into the fastest slot. • MPIO
“Virtualized” SANs • I quote from the EMC guy’s email: “mixing data and logs on the same spindle is unlikely to have a measurable performance impact due to the large cache in the SAN. In practice you are more likely to be hit by a busy file server or print server sharing database spindles than by log and data sharing. Again, our monitoring software can detect these hotspots, and our data movement software can shift critical files to LUNs on 'cooler' spindles. ” • Anyone see a problem with this?
Data Separation • EMC whitepapers still preach data/log separation…so should we, regardless of what a SAN engineer tells us. • Doesn’t matter if the engineer tells us that behind the fabric everything is “virtualized” • SAN Engineer vs Storage Administrator • I’ve noticed separation is becoming less attainable because the DBAs are simply another group of users, and the SAN folks loathe the notion of making special provisions for any group, perceiving it to be something that would make their lives miserable. • Tough luck!!!
LUNs and PerformanceMaking the Best of Bad Situations • Assume you have 1 drawer with 14 drives in it on the Clariion • Setup with 3 luns, each with 4 drives a piece and allocate all the space on those drives to each separate lun • then the lun and the drive letter are basically the same thing.. • but you are limited to only 4 drives for each drive letter. That is not that many heads. • What would be better? • Assume 14 drives…one big lun with separate drive letters within that lun • More heads going after the data. • This won’t always help, but I am pretty sure that SQL can use multiple IO requests to the drives if it sees it as different drive letters. So you still might get some help by doing it that way. • This definitely works with Oracle. • Try to always avoid setting it up so that if your database server needs 150 GB, and we get 150GB on one or two phyysical drives (not LUNs). That is what has happened to the HP XP.. I didn’t get any say so in it and you can imagine how poorly the disk was performing. The higher ups are starting to see the errors of their ways and I think/hope we will be changing soon from the one size fits all model. Performance was so bad we managed to have new HBAs installed for CABS to the Clariion and migrated data from one SAN to another during a few Month End Rollovers. We need to avoid this in the future.
Cache • Glen Mills Sym has 8GB Cache. CABS has 3 SQL Servers connected to it…each with 4 GB RAM. SAP FI/HR have 2 Oracle boxes…each with 4GB RAM. 10 add’l hosts are connected with unknown specs. • If data is needed on one SQL Server and it is not in buffer cache (about 2.1 GB on our servers) then what are the odds it will be found in the Sym cache? Not good. • Cache should be overweighted to write cache • The SAN engineers will complain about this…tell them MS supports it if the controller cache is battery backed. (Search for SQL Server WAL protocol for proof).
LUN to Drive Letters • Yes, we’ve run out of drive letters in a few places due to small LUNs using Veritas Clustering. We are now recommending mount points (and moving to Windows Clustering). One problem with mount points is the lack of support to “easily” determine free/used disk space both programmatically (xp_fixeddrives shows 0 free/used) and from the Windows Explorer GUI. • This happened not just with CABS, but with DOL and SharePoint. • We were asking for 20GB LUNs…this was the problem. Now asking for 100GB…but now we are getting blowback because we didn’t plan properly. • ALWAYS format each LUN as a single volume and NEVER create multiple Windows logical partitions on a LUN. Don’t over-abstract! • Use Basic Disks…never Dynamic (track alignment issues mentioned earlier)
Monitoring Performance – SQL Server • Sysprocesses.waittype (use sp_catchprocesses please) will show PAGEIOLATCH_xx. This is a disk to memory transfer wait. • Please identify these for us using DBCC PAGE. We may be able to fix the problem by moving data structures to new filegroups • PerfMon Counter Buffer Manager: Page Life Expectancy will drop off as well. • We can solve some of this with better db/app design, indexes, relieving “hot spots”, better code…but we can only do so much. • Rule of thumb…if there is a PAGEIOLATCH_xx wait…and you have never seen the call execute with similarly poor performance…then it *might* be IO related
PAGEIOLATCH_UP • This is a special type of disk to memory transfer wait…a page io latch in update mode. This is contention against the PFS, GAM, and SGAM. Will show as waitresource x:1:y or x:2:y or x:3:y. • This means a lack of IO file parallelism within the file group. Solution is to add more files to the filegroup. Can even be on the same LUN!!! We just need more files to spread the Windows File Locks around. • We are working to fix this and to utilize more filegroups and files…but we aren’t there yet • Especially occurs with…
TempDB • Follow these rules • Segregated RAID 10 (or 5 as a last resort) LUN • Tempdev files of equal size • No autogrowth • At least 4 files. • This avoids 2:1:x and 2:2:x and 2:3:x waitresource for PAGEIOLATCH_UP waits
PerfMon/Windows • You need a baseline Avg (or Current…doesn’t matter) Disk Q Length. Use that baseline to compare with times of trouble. The SAN is abstracting the disk to Windows so the standard Windows advice of Q length < 2 * # spindles doesn’t apply
Oracle • Statspack and ADDM are the best tools • Sql_trace is too fine grained so we can’t run 24x7. • There is NO reason ADDM can’t be enabled. • Consider researching the Advisory Framework (I have no experience)
Oracle ASM • First and foremost: ASM is NOT a volume manager. Still must put thought into where the data/redo/archivelogs/control files are placed. If you “set it and forget it” with ASM then you can forget about performance. (Phil Demo) • Multiple ASMs won’t buy you better performance…you need more LUNs for that. • ASMs will probably be considered just another abstraction bottleneck in years to come but good for departmental servers. • General rule…one tablespace per LUN. Similar to one filegroup/LUN in SQL Server. • statspack • On AIX…use sar logging at 5 min intervals. Use the “hotspots” xls file to import the data (Phil Demo) • On Linux use iostat (Carlo Demo) • Consider NOT using Oracle load balancing of datafiles over mount points…AIX load balances IO across the vpath • Don’t overthink this…just make your mount points map 1:1 with your LUNs. This equates to mapping 1:1 Windows volumes to LUNs.
Oracle ADDM • This is showing excellent promise. • Ensure the AWR snapshots are running…ADDM will auto analyze and ACTUALLY detects root cause disk. • ADDM identifies each symptom with drill down to root cause…NICE • Saw this on the EVA with sapbw • Doesn’t seem to be a big overhead.
Other Oracle Tools • Goal is to reduce the “DB Time” throughput metric. • But this includes CPU and waits too • So we need to focus on DBIO_EXPECTED. This is the avg microseconds needed to read one Oracle block • Our SAN vendors are telling us we should get at least 10 ms (10000 microsecs) …so… • EXECUTE DBMS_ADVISOR.SET_DEFAULT_TASK_PARAMETER(‘ADDM’,’DBIO_EXPECTED’,10000); • Now ADDM will note automagically when you have exceeded that and will focus on IO subsystem for root cause.
Demo Using Two Saved AWRs • Trial Balance from a few months back on gsp • Wow…who says Oracle is difficult to use!!!
Oracle Waits Useful queries: V$SESSION_WAIT_CLASS A, V$SESSION B where B.SID = A.SID And B.USERNAME IS NOT NULL --and— V$SESSION_WAIT_HISTORY --and— V$SESS_TIME_MODEL…look for User I/O wait time ---and the new one in 10g— Select user_io_wait_time from v$SQLAREA
Questions to Ask Your SAN Guy When You Think You Have IO Trouble • What is the make/model of the SAN? • Go research it • Is it virtualized? • How is data segregated? • SQL: Data/Logs/tempdb/quorum drive/system dbs/etc • Oracle: Tblspaces/redo/archivelog/controlfiles/etc • How are the LUNs setup? • Spindles to LUN ratio • Inquire on the HBA/switch/fabric topology and do your research. • Multipath I/O? (MPIO). Should be set to round robin or least q depth. Jumbo frames should also be used. • Are the disks track aligned?