320 likes | 479 Views
Oracle9i Real Application Clusters on Commodity-based Cluster Technology A Case Study. Kevin Closson Sr. Staff Engineer, Database Engineering PolyServe Incorporated kevinc@polyserve.com. Agenda. Answers and Questions Project Description Project Goals
E N D
Oracle9i Real Application Clusters on Commodity-based Cluster TechnologyA Case Study Kevin Closson Sr. Staff Engineer, Database Engineering PolyServe Incorporated kevinc@polyserve.com
Agenda • Answers and Questions • Project Description • Project Goals • System and Test Environment Configuration • Hardware • Linux Distribution/Kernel Release • Oracle • Lessons Learned • Performance Observations • Clustered File System Considerations
Questions and Answers • Intel-based Servers? Can those processors really handle Oracle? • What about high I/O rates? • Huh? An 8 node Cluster? Isn’t that a management nightmare? • Really now, Linux? That isn’t a stable enough Operating System, is it? • Linux can’t be right for Oracle, it doesn’t scale to large processor counts, right? • What on earth could I possibly do with the primary release of Oracle9i Real Application Clusters (9.0.1)? Doesn’t it make sense to hold off until 9iR2? • What can I achieve without a high-speed interconnect, really ?
Project Description • Case Study • OLTP workload analysis on an 8 node Intel-based cluster running Oracle9i RAC on Redhat Linux • Goals • To conduct a preliminaryassessment of Oracle9i Real Application Clusters • Determine “Barriers to Entry” • To study new Oracle9i features and how to use them to improve performance, managability and uptime. • Assess Linux/SAN technology interoperability
System Configuration - Hardware • 8 Dell PowerEdge 1550 Dual Proc (900Mhz) each: • 1GB Physical Memory • 2 100BaseT NIC • 1 Gigabit Ether NIC (Broadcom BCM5700 ) • 1 Qlogic Fibre Channel HBA • Brocade 16port FC Switch • 288 GB SAN Storage Subsystem • RAID 1+0
100BaseT Public Network (PolyServe Secondary) 100BaseT Polyserve Primary Interconnect Gbit Ethernet Oracle9i RAC Dual CPU Intel Dual CPU Intel Dual CPU Intel Dual CPU Intel Dual CPU Intel Dual CPU Intel Dual CPU Intel Dual CPU Intel Brocade Silkworm 2250 16 Port FC Switch 288 GB Capacity SAN Test System Overview
System Configuration - Software • Redhat Linux 7.2 • Redhat Kernel version 2.4.7-10 • glibc-2.2.4-13 • gcc-2.96-98 • binutils-2.11.90.0.8-9 • Oracle9i Release 1 (9.0.1) with Real Application Clusters • Oracle “Clusterware” for Linux • OCMS • GCS • GES
Goals – Ascertain Barriers to Entry • Oracle9i Interconnect Requirement • It seemed to be commonly accepted that Gigabit Ether is required for Oracle9i Real Application Clusters • Tested both 100baseT and Gigabit ether for Oracle’s Interconnect • Configured in $ORACLE_HOME/oracm/admin/nmcfg.ora in DefinedNodes parameter • At all OLTP scale points, 100baseT and Gigabit Ethernet provided the same throughput! • Why? • Same latencies • With OLTP, Oracle’s demand on the interconnect is mostly for lock converts and shipping consistent version row pieces (not entire blocks) • A credit to Oracle’s utilization of Interconnect bandwidth • This testing did not include IPQO, so there was no barrier to entry based on interconnects available • For early-access Oracle9i RAC OLTP testing, it seems 100baseT is a safe bet for up to 8 dual-proc nodes
Goals – Study New Oracle9i Features • Oracle Managed Files • All testing done on Tablespaces created using OMF • Tablespaces created with default 100MB and grown based on AUTOEXTEND • Dynamic SGA • Not all cache buffers need to be allocated at Instance boot time. • This feature is an excellent enhancement for availability infrastructure • “Standby Instances” don’t have to be pre-sized and idle. • Saves Physical Memory on “secondary” nodes
Performance Observations • Testing based on varying read:write ratio OLTP workloads • Same database, different transaction mixes • Fixed pseudo user count from 1 through 8 nodes • Tests the ability for the same number of users to get more work done by adding hardware • Completely baseline configuration • Version 9.0.1 from OTN, no patches even !! • Minimal init.ora (only 17 parameters set! ) • Very minimal Schema Tuning • Added freelist Groups, not much else … • No partitioning of any sort • No data-dependant request routing • No “application partitioning” • Simple tables, not physically partitioned
OLTP Workload Specifics • Four read to write ratios • Reporting 99:1 • Customer Service 76:24 • Order Entry 70:30 • Maintenance 60:40 • I/O Rate per Transaction • Reporting 1.3/tps • Customer Service 2.5/tps • Order Entry 9/tps • Maintenance 15/tps
Reporting Workload 1800 1500 1200 TPS 900 600 300 0 1 2 3 4 5 6 Nodes • Read to write ratios • Reporting 99:1 • I/O Rate per Transaction • Reporting 1.3/tps
Customer Service Workload 1500 1250 1000 TPS 750 500 250 0 1 2 3 4 5 6 Nodes • Read to write ratios • Customer Service 76:24 • I/O Rate per Transaction • Customer Service 2.5/tps
Order Entry Workload 600 500 400 TPS 300 200 100 0 1 2 3 4 5 6 Nodes • Read to write ratios • Order Entry 70:30 • I/O Rate per Transaction • Order Entry 9/tps
Maintenance Workload 360 300 240 TPS 180 120 60 0 1 2 3 4 5 6 Nodes • Read to write ratios • Maintenance 60:40 • I/O Rate per Transaction • Maintenance 15/tps
Lesson Learned – Validate FC Switch • All Fibre Channel Switches may be created equal, but … • Use non-Oracle test to determine FC switch capability • Conduct single node random and sequential I/O tests using dd(1) or other • Conduct multiple node concurrent test • Practical Example of a poorly behaved switch: • 1GB file, Fixed number of Random 4K pread() calls*, Raw Disk access: node1: TIME 81.10 OPS/sec: 3945.75 MB/sec: 15.4 node2: TIME 101.29 OPS/sec: 3159.25 MB/sec: 12.3 node3: TIME 72.03 OPS/sec: 4442.59 MB/sec: 17.4 node4: TIME 72.04 OPS/sec: 4441.98 MB/sec: 17.4 node5: TIME 72.05 OPS/sec: 4441.36 MB/sec: 17.3 node6: TIME 72.06 OPS/sec: 4440.74 MB/sec: 17.3 node7: TIME 72.53 OPS/sec: 4411.97 MB/sec: 17.2 node8: TIME 72.76 OPS/sec: 4398.02 MB/sec: 17.2 *Program available upon request
Lessons Learned – Proper Statistics Collection • Use Statspack • Statspack does track multiple instance statistics • Produces per-instance report. • Post-process for aggregate cluster-wide stats • Use gv$ performance views • False readings from “sampling” instances is misleading • An important example: • Global Physical I/O profile: select tablespace_name,sum(PHYRDS) reads,sum(PHYWRTS) writes from dba_data_files,gv$filestat where dba_data_files.file_id = gv$filestat.file# group by tablespace_name order by reads desc;
Global I/O Profile TABLESPACE READS WRITES ----------------------------------------------- WAREHOUSE 632344 624621 WHR_IDX 475200 794 PRODUCT 482780 59523 CUSTOMER 128967 136494 ITM_IDX 107937 121074 ORDERS 52678 66485 NAME_IDX 24728 3746 ORD_IDX 29663 27842 CUS_IDX 6452 2384 ITEM 17435 72617 CSTATE_IDX 1028 1554 ROLL_1 4 56131 ROLL_2 3 61119 STATE_IDX 2982 4217 ----------------------------------------------- TOTALS: 1962460 1238977 TOTAL I/O 3201437 I/O per second 3557 READ PCT 62 WRITE PCT 38
Lessons Learned – Linux Kernel • Track Regression of Linux Kernel • 2.4.7 to 2.4.9 Kernel regression – 80% degradation for Shared memory based I/O due to Virtual Memory bug • Redhat Bug #59748 • Establish an environement for quick regression analysis before depolying new Kernel revisions.
Lesson Learned – DBWR Performance • Async I/O is not supported on Linux • This is not really a problem • CPU to DBWR ratio • Huge SMPs cannot survive without Async I/O for DBWR • Configure DBWR writer slaves • Caveats • Async I/O in Linux will be available soon (reportedly 2.5 Kernel • Seek out a vendor that supports Oracle Disk Manager • PolyServe is implementing ODM through a partnership with Oracle
Lesson Learned –DBWR I/O Slaves • Error During Instance Startup: • Workaround – Explicitly disable async I/O if slaves are used: Sat Jan 9 10:26:30 2002 Errors in file /usr1/oracle/rdbms/log/dbw0_22830.trc: ORA-01114: IO error writing block to file 1 (block # 300) ORA-01110: data file 1: '/usr1/oracle/rw/DATA/benchSYSTEM' ORA-27064: skgfdisp: cannot perform async I/O to file DBW0: terminating instance due to error 1242 dbwr_io_slaves = 20 disk_asynch_io = false _lgwr_async_io = false _dbwr_async_io = false
Lesson Learned – DBWR I/O Slaves • Estimated DBWR Slave I/O Capability • DBWR I/O Slaves are synchronous I/O handling processes • Process name: “ora_i101_PROD1”, etc • Total Slave Write Capacity is a product of: • Avg Disk Write Service Time • Number of Slaves • Example: • Given an average service time of 12ms, 20 DBWR slaves can handle roughly 1600 writes per second • ( ( 1/.012) * 20 ) • If suffering “free buffer waits”, try adding more slaves. This tactic does top out, however.
Lessons Learned – Buffer Overhead for Cache Fusion • SGA disk block buffers have additional overhead with Cache Fusion • Buffers are pinned down while current version data is sent to requesting nodes • Buffer pin duration reported as ‘global cache current block send time’ • Most of the latency measured with this statistic is wrapped up in the processes involved being in ‘runable’ state. That is, this measurement includes time that processes wait for CPU. This was a very heavily utilized cluster • Very small portion of this time is that of interconnect data transfer • It may be necessary to add buffers once Real Application Clusters are used
Lesson Learned – Linux raw(8) binding • To use raw partitions with Oracle, it is required to bind /dev/sd* device with the raw(8) command • Ensure scripts are THOROUGHLY TESTED • Watch for “device slip” • Bindings are not persistent across system reboot • Example simple script error, could be disastrous: # raw /dev/raw/raw1 /dev/sdb1 /dev/raw/raw1: bound to major 8, minor 17 # raw /dev/raw/raw3 /dev/sdb1 /dev/raw/raw3: bound to major 8, minor 17 # raw -qa /dev/raw/raw1: bound to major 8, minor 17 /dev/raw/raw3: bound to major 8, minor 17 $ sqlplus ‘/ as sysdba’ <<EOF alter tablespace system add datafile ‘/dev/raw/raw3’ SIZE 50M; EOF
Clustered File System Considerations • A good CFS should take care of the following: • SAN physical device abstraction • Based in WWN, a spindle or LUN does not need to be dealt with as some raw device • Disks are imported and treated as managed entities • Eliminates concerns of device slippage, raw(8) binding, etc • Allow Direct I/O for Oracle files in the Clustered File System • Control Files,Redo Logs, Tablespaces, etc • Shared Oracle Home directory • $ORACLE_HOME contains some 100,000 files under many directories • With Shared Oracle Home, only one copy of the Oracle product is needed • Ease of management • Patches, config, sheer mass, etc • OCMS presumes lack of CFS* *Tips and Techniques Guide Available Documenting a Workaround
Clustered File Systems Considerations • Use non-Oracle “micro” benchmark* that possesses a similar I/O profile to Oracle to ensure the cost of I/O operations to and from the SAN are nearly the same when: • Accessed through raw devices (bound with the raw(8) command) • Accessed through the Clustered File System • No serializing anomalies • Allow concurrent writes to files • Direct, completely unbuffered • External Caching in Page Cache is a NO-NO with RAC • Commensurate processor overhead • Some filesystem implementations consume inordinate amounts of processor bandwidth in I/O code path *Sample Program Available Upon Request
Micro Benchmark Random 4K pread() Fixed Workload 72 Processes Executing 5000 Random I/Os 100 90 80 70 60 2Node 50 Complete Tm (sec) 4Node 40 30 20 10 0 CFS RAW
Summary • Oracle9i Real Application Clusters, in its primary release, delivers tremendous horizontal scaling – and it’s stable • “Barriers to Entry” are minimal given a Linux approach • It is recommended that you have a good Clustered File System • All OLTP performance data in the presentation was collected from a cluster with Oracle executing from a shared Oracle Home in the PolyServe Clustered Filesystem • The datafiles, redo logs,control files and RAC quorum disk were also stored in the PolyServe Clustered Filesystem • An interesting Quote
What the Analysts Say…. • “PolyServe is delivering on the dream of true clustered file systems. The Oracle9i integration is a perfect application for this technology, and users will see an immediate benefit.” • Steve Duplessie, Enterprise Storage Group