Scalable, Fault-Tolerant NAS for Oracle - The Next Generation

Scalable, Fault-Tolerant NAS for Oracle - The Next Generation Kevin Closson Chief Software Architect Oracle Platform Solutions, Polyserve Inc

The Un-”Show Stopper” • NAS for Oracle is not “file serving”, let me explain… • Think of GbE NFS I/O paths from Oracle Servers to the NAS device that are totally direct. No VLANing sort of indirection. • In these terms, NFS over GbE is just a protocol as is FCPover FiberChannel • The proof is in the numbers. • A single dual-socket/dual-core ADM server running Oracle10gR2 can push through 273MB/s of large I/Os (scattered reads, direct path read/write, etc) of triple-bonded GbE NICs! • Compare that to infrastructure and HW costs of 4GbE FCP (~450MB/s, but you need 2 cards for redundancy) • OLTP over modern NFS with GbE is not a challenging I/O profile. • However, not all NAS devices are created equal by any means

Agenda • Oracle on NAS • NAS Architecture • Proof of Concept Testing • Special Characteristics

Oracle on NAS

Oracle on NAS • Connectivity • Fantasyland Dream Grid™ would be nearly impossible with FibreChannel switched fabric, for instance: • 128 nodes == 256 HBAs, 2 switches each with 256 ports just for the servers then you have to work out storage paths • Simplicity • NFS is simple. Anyone with a pulse can plug in cat-5 and mount filesystems. • MUCH MUCH MUCH MUCH MUCH simpler than: • Raw partitions for ASM • Raw, OCFS2 for CRS • Oracle Home? Local Ext3 or UFS? • What a mess • Supports shared Oracle Home, shared APPL_TOP too • But not simpler than a Certified Third Party Cluster Filesystem , but that is a different presentation • Cost • FC HBAs are always going to be more expensive than NICs • Ports on enterprise-level FC switches are very expensive

Oracle on NAS • NFS Client Improvements • Direct IO • open(,O_DIRECT,) works with Linux NFS clients, Solaris NFS client, likely others • Oracle Improvements • init.ora filesystemio_options=directIO • No async I/O on NFS, but look at the numbers • Oracle runtime checks mount options • Caveat: It doesn’t always get it right, but at least it tries (OSDS) • Don’t be surprised to see Oracle offer a platform-independent NFS client • NFS V4 will have more improvements

NAS Architecture

NAS Architecture • Single-headed Filers • Clustered Single-headed Filers • Asymmetrical Multi-headed NAS • Symmetrical Multi-headed NAS

Single Headed Filer Architecture

NAS Architecture: Single-headed Filer GigE Network Filesystems /u01 /u02 /u03

Oracle Database Servers Oracle Servers Accessing a Single-headed Filer: I/O Bottleneck A single one of these… Has the same (or more) bus bandwidth as this! I/O Bottleneck Filesystems /u01 /u02 /u03

Oracle Database Servers Oracle Servers Accessing a Single-headed Filer: Single Point of Failure Highly Available through failover-HA, DataGuard, RAC, etc Single Point of Failure Filesystems /u01 /u02 /u03

Clustered Single-headed Filers

Architecture: Cluster of Single-headed Filers Paths Active After Failover Filesystems /u01 /u02 Filesystems /u03

Oracle Database Servers Paths Active After Failover Filesystems /u01 /u02 Filesystems /u03 Oracle Servers Accessing a Cluster of Single-headed Filers

Oracle Database Servers Paths Active After Failover Filesystems /u01 /u02 Filesystems /u03 Architecture: Cluster of Single-headed Filers What if /u03 I/O saturates this Filer?

Oracle Database Servers Filer I/O Bottleneck. Resolution == Data Migration Paths Active After Failover Filesystems /u01 /u02 Filesystems /u03 Filesystems /u04 Migrate some of the “hot” data to /u04

Oracle Database Servers Data Migration Remedies I/O Bottleneck NEW Single Point of Failure Paths Active After Failover Filesystems /u01 /u02 Filesystems /u03 Filesystems /u04 Migrate some of the “hot” data to /u04

Summary: Single-headed Filers • Cluster to mitigate S.P.O.F • Clustering is a pure afterthought with filers • Failover Times? • Long, really really long. • Transparent? • Not in many cases. • Migrate data to mitigate I/O bottlenecks • What if the data “hot spot” moves with time? The Dog Chasing His Tail Syndrome • Poor Modularity • Expanded by pairs for data availability • What’s all this talk about CNS?

Asymmetrical Multi-headed NAS Architecture

Oracle Database Servers SAN Gateway Asymmetrical Multi-headed NAS Architecture Three Active NAS Heads / Three For Failover and “Pools of Data” FibreChannel SAN … … Note: Some variants of this architecture support M:1 Active:Standby but that doesn’t really change much.

Asymmetrical NAS Gateway Architecture • Really not much different than clusters of single-headed filers: • 1 NAS head to 1 filesystem relationship • Migrate data to mitigate I/O contention • Failover not transparent • But: • More Modular • Not necessary to scale up by pairs

Symmetric Multi-headed NAS

HP Enterprise File Services Clustered Gateway

/Dir1/File1 /Dir1/File1 /Dir1/File1 /Dir1/File1 /Dir2/File2 /Dir3/File3 /Dir2/File2 /Dir2/File2 /Dir2/File2 /Dir3/File3 /Dir3/File3 NAS Head NAS Head NAS Head NAS Head NAS Head NAS Head /Dir1/File1 /Dir1/File1 /Dir2/File2 /Dir3/File3 /Dir2/File2 /Dir3/File3 Symmetric vs Asymmetric EFS-CG

Enterprise File Services Clustered Gateway Component Overview • Cluster Volume Manager • RAID 0 • Expand Online • Fully Distributed, Symmetric Cluster Filesystem • The embedded filesystem is a fully distributed, symmetric cluster filesystem • Virtual NFS Services • Filesystems are presented through Virtual NFS Services • Modular and Scalable • Add NAS heads without interruption • All filesystems can be presented for read/write through any/all NAS heads

RAID 0 LUNS are RAID 1, so this implements S.A.M.E. Expand online Add LUNS, grow volume Up to 16TB Single Volume EFS-CG Clustered Volume Manager

The EFS-CG Filesystem • All NAS devices have embedded operating systems and file systems, butthe EFS-CG is: • Fully Symmetric • Distributed Lock Manager • No Metadata Server or Lock Server • General Purpose clustered file system • Standard C Library and POSIX support • Journaled with Online recovery • Proprietary format but uses standard Linux file system semantics and system calls including flock() and fcntl() clusterwide • Expand a single filesystem online up to 16TB, up to 254 filesystems in current release.

EFS-CG Filesystem Scalability

Scalability. Single Filesystem ExportUsing x86 Xeon-based NAS Heads (Old Numbers) 1,196 1,084 1,200 986 1,000 739 800 MegaBytes per Second (MB/s) 493 600 400 ApproximateSingle-headed Filer limit 246 123 200 0 1 2 4 6 8 9 10 Cluster Size (Nodes) NAS Heads HP StorageWorks Clustered File System is optimized for both READ and WRITE performance.

Virtual NFS Services • Specialized Virtual Host IP • Filesystem groups are exported through VNFS • VNFS failover and rehosting are 100% transparent to NFS client • Including active file descriptors, file locks (e.g. fctnl/flock), etc

EFS-CG Filesystems and VNFS

Oracle Database Servers Enterprise File Services Clustered Gateway Enterprise File Services Clustered Gateway vnfs1 vnfs1b vnfs2b vnfs3b NAS Head NAS Head NAS Head NAS Head /u03 /u03 /u02 /u01 /u04 /u04 … /u01 /u02 /u03 /u04

EFS-CG Management Console

EFS-CG Proof of Concept

EFS-CG Proof of Concept • Goals • Use Oracle10g (10.2.0.1) with a single high performance filesystem for the RAC database and measure: • Durability • Scalability • Virtual NFS functionality

EFS-CG Proof of Concept • The 4 filesystems presented by the EFS-CG were: • /u01. This filesystems contained all Oracle executables (e.g., $ORACLE_HOME) • /u02. This filesystem contained the Oracle10gR2 clusterware files (e.g., OCR, CSS) and some datafiles and External Tables for ETL testing • /u03. This filesystem was lower-performance space used for miscellaneous tests such as backup disk-to-disk • /u04. This filesystem resided on a high-performance volume that spanned two storage arrays. It contained the main benchmark database

EFS-CG P.O.C. Parallel Tablespace Creation • All datafiles created in a single exported filesystem • Proof of multi-headed, single filesystem write scalability

EFS-CG P.O.C. Parallel Tablespace Creation

EFS-CG P.O.C. Full Table Scan Performance • All datafiles located in a single exported filesystem • Proof of multi-headed, single filesystem sequential I/O scalability

EFS-CG P.O.C.Parallel Query Scan Throughput

EFS-CG P.O.C.OLTP Testing • OLTP Database based on an Order Entry Schema and workload • Test areas • Physical I/O Scalability under Oracle OLTP • Long Duration Testing

EFS-CG P.O.C.OLTP Workload Transaction Avg Cost * Averages with RAC can be deceiving, be aware of CR sends

EFS-CG P.O.C.OLTP Testing

EFS-CG P.O.C.OLTP Testing. Physical I/O Operations

EFS-CG Handles all OLTP I/O Types Sufficiently—no Logging Bottleneck

Long Duration Stress Test • Benchmarks do not prove durability • Benchmarks are “sprints” • Typically 30-60 minute measured runs (e.g., TPC-C) • This long duration stress test was no benchmark by any means  • Ramp OLTP I/O up to roughly 10,000/sec • Run non-stop until the aggregate I/O breaks through 10 Billion physical transfers • 10,000 physical I/O transfers per second for every second of nearly 12 days

Long Duration Stress Test

Scalable, Fault-Tolerant NAS for Oracle - The Next Generation

Scalable, Fault-Tolerant NAS for Oracle - The Next Generation

Presentation Transcript

Fault-Tolerant Broadcast

Systems Issues for Scalable, Fault Tolerant Internet Services

Fault-Tolerant Broadcast

Scalable, Fault-tolerant Management of Grid Services

- the new generation realtime operating system For embedded and fault tolerant

Fault-Tolerant CORBA

FAULT TOLERANT CORBA

Fault Tolerant MPI

Fault Tolerant Configuration

Tapestry: Scalable and Fault-tolerant Routing and Location

FAULT-TOLERANT COMPUTING

Automatic Generation of Fault-Tolerant CORBA-Services

Fault-tolerant Control

FAULT-TOLERANT NETWORKS AND FAULT-TOLERANT ROUTING

fault-tolerant

Scalable Networking for Next-Generation Computing Platforms

FAULT-TOLERANT TECHNIQUES FOR NANOCOMPUTERS

Fault-tolerant routing

Fault-Tolerant Consensus