Implementing ASM Without HW RAID, A User’s Experience

Implementing ASM Without HW RAID,A User’s Experience Luca Canali, CERN Dawid Wojcik, CERN UKOUG, Birmingham, December 2008

Outlook • Introduction to ASM • Disk groups, fail groups, normal redundancy • Scalability and Performance of the solution • Possible pitfalls, sharing experiences • Implementation details, monitoring, and tools to ease ASM deployment

Architecture and main concepts • Why ASM ? • Provides functionality of volume manager and a cluster file system • Raw access to storage for performance • Why ASM-provided mirroring? • Allows to use lower-cost storage arrays • Allows to mirror across storage arrays • arrays are not single points of failure • Array (HW) maintenances can be done in a rolling way • Stretch clusters

ASM and cluster DB architecture Servers SAN Storage • Oracle architecture of redundant low-cost components

Files, extents, and failure groups Files and extent pointers Failgroups and ASM mirroring

ASM disk groups • Example: HW = 4 disk arrays with 8 disks each • An ASM diskgroup is created using all available disks • The end result is similar to a file system on RAID 1+0 • ASM allows to mirror across storage arrays • Oracle RDBMS processes directly access the storage • RAW disk access ASM Diskgroup Mirroring Striping Striping Failgroup1 Failgroup2

Performance and scalability • ASM with normal redundancy • Stress tested for CERN’s use • Scales and performs

Case Study: the largest cluster I have ever installed, RAC5 • The test used:14 servers

Multipathed fiber channel • 8 FC switches: 4Gbps (10Gbps uplink)

Many spindles • 26 storage arrays (16 SATA disks each)

Case Study: I/O metrics for the RAC5 cluster • Measured, sequential I/O • Read: 6 GB/sec • Read-Write: 3+3 GB/sec • Measured, small random IO • Read: 40KIOPS (8 KB read ops) • Note: • 410 SATA disks, 26 HBAS on the storage arrays • Servers: 14 x 4+4Gbps HBAs, 112 cores, 224 GB of RAM

How the test was run • A custom SQL-based DB workload: • IOPS: Probe randomly a large table (several TBs) via several parallel queries slaves (each reads a single block at a time) • MBPS: Read a large (several TBs) table with parallel query • The test table used for the RAC5 cluster was 5 TB in size • created inside a disk group of70TB

Possible pitfalls • Production Stories • Sharing experiences • 3 years in production, 550 TB of raw capacity

Rebalancing speed • Rebalancing is performed (and mandatory) after space management operations • Typically after HW failures (restore mirror) • Goal: balanced space allocation across disks • Not based on performance or utilization • ASM instances are in charge of rebalancing • Scalability of rebalancing operations? • In 10g serialization wait events can limit scalability • Even at maximum speed rebalancing is not always I/O bound

Rebalancing, an example

VLDB and rebalancing • Rebalancing operations can move more data than expected • Example: • 5 TB (allocated): ~100 disks, 200 GB each • A disk is replaced (diskgroup rebalance) • The total IO workload is 1.6 TB (8x the disk size!) • How to see this: query v$asm_operation, the column EST_WORK keeps growing during rebalance • The issue: excessive repartnering

Rebalancing issues wrap-up • Rebalancing can be slow • Many hours for very large disk groups • Risk associated • 2nd disk failure while rebalancing • Worst case - loss of the diskgroup because partner disks fail

Fast Mirror Resync • ASM 10g with normal redundancy does not allow to offline part of the storage • A transient error in a storage array can cause several hours of rebalancing to drop and add disks • It is a limiting factor for scheduled maintenances • 11g has new feature ‘fast mirror resync’ • Great feature for rolling intervention on HW

ASM and filesystem utilities • Only a few tools can access ASM • Asmcmd, dbms_file_transfer, xdb, ftp • Limited operations (no copy, rename, etc) • Require open DB instances • file operations difficult in 10g • 11g asmcmd has the copy command

ASM and corruption • ASM metadata corruption • Can be caused by ‘bugs’ • One case in prod after disk eviction • Physical data corruption • ASM will fix automatically most corruption on primary extent • Typically when doing a full backup • Secondary extent corruption goes undetected untill disk failure/rebalance can expose it

Disaster recovery • Corruption issues were fixed using physical standby to move to ‘fresh’ storage • For HA our experience is that disaster recovery is needed • Standby DB • On-disk (flash) copy of DB

Implementation details

Storage deployment • Current storage deployment for Physics Databases at CERN • SAN, FC (4Gb/s) storage enclosures with SATA disks (8 or 16) • Linux x86_64, no ASM lib, device mapper instead (naming persistence + HA) • Over 150 FC storage arrays (production, integration and test) and ~ 2000 LUNs exposed • Biggest DB over 7TB (more to come when LHC starts – estimated growth up to 11TB/year)

Storage deployment • ASM implementation details • Storage in JBOD configuration (1 disk -> 1 LUN) • Each disk partitioned on OS level • 1st partition – 45% of disk size – faster part of disk – short stroke • 2nd partition – rest – slower part – full stroke inner sectors – full stroke outer sectors – short stroke

Storage deployment • Two diskgroups created for each cluster • DATA – data files and online redo logs – outer part of the disks • RECO – flash recovery area destination – archived redo logs and on disk backups – inner part of the disks • One failgroup per storage array Failgroup1 Failgroup2 Failgroup3 Failgroup4 DATA_DG1 RECO_DG1

Storage management • SAN configuration in JBOD configuration – many steps, can be time consuming • Storage level • logical disks • LUNs • mappings • FC infrastructure – zoning • OS – creating device mapper configuration • multipath.conf – name persistency + HA

Storage management • Storage manageability • DBAs set-up initial configuration • ASM – extra maintenance in case of storage maintenance (disk failure) • Problems • How to quickly set-up SAN configuration • How to manage disks and keep track of the mappings:physical disk -> LUN -> Linux disk -> ASM Disk SCSI [1:0:1:3] & [2:0:1:3] ->/dev/sdn & /dev/sdax ->/dev/mpath/rstor901_3 ->ASM – TEST1_DATADG1_0016

Storage management • Solution • Configuration DB - repository of FC switches, port allocations and of all SCSI identifiers for all nodes and storages • Big initial effort • Easy to maintain • High ROI • Custom tools • Tools to identify • SCSI (block) devices <-> device mapper device <-> physical storage and FC port • Device mapper mapper device <-> ASM disk • Automatic generation of device mapper configuration

Storage management SCSI id (host,channel,id) -> storage name and FC port SCSI ID -> block device-> device mapper name and status -> storage name and FC port [ ~]$ lssdisks.py The following storages are connected: * Host interface 1: Target ID 1:0:0: - WWPN: 210000D0230BE0B5 - Storage: rstor316, Port: 0 Target ID 1:0:1: - WWPN: 210000D0231C3F8D - Storage: rstor317, Port: 0 Target ID 1:0:2: - WWPN: 210000D0232BE081 - Storage: rstor318, Port: 0 Target ID 1:0:3: - WWPN: 210000D0233C4000 - Storage: rstor319, Port: 0 Target ID 1:0:4: - WWPN: 210000D0234C3F68 - Storage: rstor320, Port: 0 * Host interface 2: Target ID 2:0:0: - WWPN: 220000D0230BE0B5 - Storage: rstor316, Port: 1 Target ID 2:0:1: - WWPN: 220000D0231C3F8D - Storage: rstor317, Port: 1 Target ID 2:0:2: - WWPN: 220000D0232BE081 - Storage: rstor318, Port: 1 Target ID 2:0:3: - WWPN: 220000D0233C4000 - Storage: rstor319, Port: 1 Target ID 2:0:4: - WWPN: 220000D0234C3F68 - Storage: rstor320, Port: 1 SCSI Id Block DEV MPath name MP status Storage Port ------------- ---------------- -------------------- ---------- ------------------ ----- [0:0:0:0] /dev/sda - - - - [1:0:0:0] /dev/sdb rstor316_CRS OK rstor316 0 [1:0:0:1] /dev/sdc rstor316_1 OK rstor316 0 [1:0:0:2] /dev/sdd rstor316_2 FAILED rstor316 0 [1:0:0:3] /dev/sde rstor316_3 OK rstor316 0 [1:0:0:4] /dev/sdf rstor316_4 OK rstor316 0 [1:0:0:5] /dev/sdg rstor316_5 OK rstor316 0 [1:0:0:6] /dev/sdh rstor316_6 OK rstor316 0 . . . . . . Custom made script

Storage management device mapper name -> ASM disk and status [ ~]$ listdisks.py DISK NAME GROUP_NAME FG H_STATUS MODE MOUNT_S STATE TOTAL_GB USED_GB ---------------- ------------------ ------------- ---------- ---------- ------- -------- ------- ------ ----- rstor401_1p1 RAC9_DATADG1_0006 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5 rstor401_1p2 RAC9_RECODG1_0000 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 119.9 1.7 rstor401_2p1 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 111.8 111.8 rstor401_2p2 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 120.9 120.9 rstor401_3p1 RAC9_DATADG1_0007 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.6 rstor401_3p2 RAC9_RECODG1_0005 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_4p1 RAC9_DATADG1_0002 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5 rstor401_4p2 RAC9_RECODG1_0002 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_5p1 RAC9_DATADG1_0001 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5 rstor401_5p2 RAC9_RECODG1_0006 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_6p1 RAC9_DATADG1_0005 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5 rstor401_6p2 RAC9_RECODG1_0007 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_7p1 RAC9_DATADG1_0000 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.6 rstor401_7p2 RAC9_RECODG1_0001 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_8p1 RAC9_DATADG1_0004 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.6 rstor401_8p2 RAC9_RECODG1_0004 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_CRS1 rstor401_CRS2 rstor401_CRS3 rstor402_1p1 RAC9_DATADG1_0015 RAC9_DATADG1 RSTOR402 MEMBER ONLINE CACHED NORMAL 111.8 59.9 . . . . . . Custom made script

Storage management [ ~]$ gen_multipath.py # multipath default configuration for PDB defaults { udev_dir /dev polling_interval 10 selector "round-robin 0" . . . } . . . multipaths { multipath { wwid 3600d0230006c26660be0b5080a407e00 alias rstor916_CRS } multipath { wwid 3600d0230006c26660be0b5080a407e01 alias rstor916_1 } . . . } Custom made script device mapper alias – naming persistency and multipathing (HA) SCSI [1:0:1:3] & [2:0:1:3] ->/dev/sdn & /dev/sdax ->/dev/mpath/rstor916_1

Storage monitoring • ASM-based mirroring means • Oracle DBAs need to be alerted of disk failures and evictions • Dashboard – global overview – custom solution – RACMon • ASM level monitoring • Oracle Enterprise Manager Grid Control • RACMon – alerts on missing disks and failgroups plus dashboard • Storage level monitoring • RACMon – LUNs’ health and storage configuration details – dashboard

Storage monitoring • ASM instance level monitoring • Storage level monitoring new failing disk onRSTOR614 new disk installed onRSTOR903 slot 2

Conclusions • Oracle ASM diskgroups with normal redundancy • Used at CERN instead of HW RAID • Performance and scalability are very good • Allows to use low-cost HW • Requires more admin effort from the DBAs than high end storage • 11g has important improvements • Custom tools to ease administration

Q&A Thank you • Links: • http://cern.ch/phydb • http://www.cern.ch/canali

Implementing ASM Without HW RAID, A User’s Experience

Implementing ASM Without HW RAID, A User’s Experience

Presentation Transcript

ASM A utoliv S upplier M anual

ASM A utoliv S upplier M anual

User Experience

A User s Guide To

ASM A utoliv S upplier M anual

Implementing a Best Practice Measles SIA: Ethiopia s Experience

COMP-9: EasyAsk for e-Commerce: A User s Experience

Making a difference with a User Experience

Making a Better User Experience

User Experience Design

UX – User experience

User Experience

The User Experience

User experience

User experience map

User Experience

User Experience

User Experience Modelling

Creating a Superior User Experience

User Experience Design

California ’ s Experience: Designing a Streamlined User-Friendly Enrollment System

User experience sampling