1 / 35

Implementing ASM Without HW RAID, A User’s Experience

Implementing ASM Without HW RAID, A User’s Experience. Luca Canali, CERN Dawid Wojcik, CERN UKOUG, Birmingham, December 2008. Outlook. Introduction to ASM Disk groups, fail groups, normal redundancy Scalability and Performance of the solution Possible pitfalls, sharing experiences

gamma
Download Presentation

Implementing ASM Without HW RAID, A User’s Experience

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Implementing ASM Without HW RAID,A User’s Experience Luca Canali, CERN Dawid Wojcik, CERN UKOUG, Birmingham, December 2008

  2. Outlook • Introduction to ASM • Disk groups, fail groups, normal redundancy • Scalability and Performance of the solution • Possible pitfalls, sharing experiences • Implementation details, monitoring, and tools to ease ASM deployment

  3. Architecture and main concepts • Why ASM ? • Provides functionality of volume manager and a cluster file system • Raw access to storage for performance • Why ASM-provided mirroring? • Allows to use lower-cost storage arrays • Allows to mirror across storage arrays • arrays are not single points of failure • Array (HW) maintenances can be done in a rolling way • Stretch clusters

  4. ASM and cluster DB architecture Servers SAN Storage • Oracle architecture of redundant low-cost components

  5. Files, extents, and failure groups Files and extent pointers Failgroups and ASM mirroring

  6. ASM disk groups • Example: HW = 4 disk arrays with 8 disks each • An ASM diskgroup is created using all available disks • The end result is similar to a file system on RAID 1+0 • ASM allows to mirror across storage arrays • Oracle RDBMS processes directly access the storage • RAW disk access ASM Diskgroup Mirroring Striping Striping Failgroup1 Failgroup2

  7. Performance and scalability • ASM with normal redundancy • Stress tested for CERN’s use • Scales and performs

  8. Case Study: the largest cluster I have ever installed, RAC5 • The test used:14 servers

  9. Multipathed fiber channel • 8 FC switches: 4Gbps (10Gbps uplink)

  10. Many spindles • 26 storage arrays (16 SATA disks each)

  11. Case Study: I/O metrics for the RAC5 cluster • Measured, sequential I/O • Read: 6 GB/sec • Read-Write: 3+3 GB/sec • Measured, small random IO • Read: 40KIOPS (8 KB read ops) • Note: • 410 SATA disks, 26 HBAS on the storage arrays • Servers: 14 x 4+4Gbps HBAs, 112 cores, 224 GB of RAM

  12. How the test was run • A custom SQL-based DB workload: • IOPS: Probe randomly a large table (several TBs) via several parallel queries slaves (each reads a single block at a time) • MBPS: Read a large (several TBs) table with parallel query • The test table used for the RAC5 cluster was 5 TB in size • created inside a disk group of70TB

  13. Possible pitfalls • Production Stories • Sharing experiences • 3 years in production, 550 TB of raw capacity

  14. Rebalancing speed • Rebalancing is performed (and mandatory) after space management operations • Typically after HW failures (restore mirror) • Goal: balanced space allocation across disks • Not based on performance or utilization • ASM instances are in charge of rebalancing • Scalability of rebalancing operations? • In 10g serialization wait events can limit scalability • Even at maximum speed rebalancing is not always I/O bound

  15. Rebalancing, an example

  16. VLDB and rebalancing • Rebalancing operations can move more data than expected • Example: • 5 TB (allocated): ~100 disks, 200 GB each • A disk is replaced (diskgroup rebalance) • The total IO workload is 1.6 TB (8x the disk size!) • How to see this: query v$asm_operation, the column EST_WORK keeps growing during rebalance • The issue: excessive repartnering

  17. Rebalancing issues wrap-up • Rebalancing can be slow • Many hours for very large disk groups • Risk associated • 2nd disk failure while rebalancing • Worst case - loss of the diskgroup because partner disks fail

  18. Fast Mirror Resync • ASM 10g with normal redundancy does not allow to offline part of the storage • A transient error in a storage array can cause several hours of rebalancing to drop and add disks • It is a limiting factor for scheduled maintenances • 11g has new feature ‘fast mirror resync’ • Great feature for rolling intervention on HW

  19. ASM and filesystem utilities • Only a few tools can access ASM • Asmcmd, dbms_file_transfer, xdb, ftp • Limited operations (no copy, rename, etc) • Require open DB instances • file operations difficult in 10g • 11g asmcmd has the copy command

  20. ASM and corruption • ASM metadata corruption • Can be caused by ‘bugs’ • One case in prod after disk eviction • Physical data corruption • ASM will fix automatically most corruption on primary extent • Typically when doing a full backup • Secondary extent corruption goes undetected untill disk failure/rebalance can expose it

  21. Disaster recovery • Corruption issues were fixed using physical standby to move to ‘fresh’ storage • For HA our experience is that disaster recovery is needed • Standby DB • On-disk (flash) copy of DB

  22. Implementation details

  23. Storage deployment • Current storage deployment for Physics Databases at CERN • SAN, FC (4Gb/s) storage enclosures with SATA disks (8 or 16) • Linux x86_64, no ASM lib, device mapper instead (naming persistence + HA) • Over 150 FC storage arrays (production, integration and test) and ~ 2000 LUNs exposed • Biggest DB over 7TB (more to come when LHC starts – estimated growth up to 11TB/year)

  24. Storage deployment • ASM implementation details • Storage in JBOD configuration (1 disk -> 1 LUN) • Each disk partitioned on OS level • 1st partition – 45% of disk size – faster part of disk – short stroke • 2nd partition – rest – slower part – full stroke inner sectors – full stroke outer sectors – short stroke

  25. Storage deployment • Two diskgroups created for each cluster • DATA – data files and online redo logs – outer part of the disks • RECO – flash recovery area destination – archived redo logs and on disk backups – inner part of the disks • One failgroup per storage array Failgroup1 Failgroup2 Failgroup3 Failgroup4 DATA_DG1 RECO_DG1

  26. Storage management • SAN configuration in JBOD configuration – many steps, can be time consuming • Storage level • logical disks • LUNs • mappings • FC infrastructure – zoning • OS – creating device mapper configuration • multipath.conf – name persistency + HA

  27. Storage management • Storage manageability • DBAs set-up initial configuration • ASM – extra maintenance in case of storage maintenance (disk failure) • Problems • How to quickly set-up SAN configuration • How to manage disks and keep track of the mappings:physical disk -> LUN -> Linux disk -> ASM Disk SCSI [1:0:1:3] & [2:0:1:3] ->/dev/sdn & /dev/sdax ->/dev/mpath/rstor901_3 ->ASM – TEST1_DATADG1_0016

  28. Storage management • Solution • Configuration DB - repository of FC switches, port allocations and of all SCSI identifiers for all nodes and storages • Big initial effort • Easy to maintain • High ROI • Custom tools • Tools to identify • SCSI (block) devices <-> device mapper device <-> physical storage and FC port • Device mapper mapper device <-> ASM disk • Automatic generation of device mapper configuration

  29. Storage management SCSI id (host,channel,id) -> storage name and FC port SCSI ID -> block device-> device mapper name and status -> storage name and FC port [ ~]$ lssdisks.py The following storages are connected: * Host interface 1: Target ID 1:0:0: - WWPN: 210000D0230BE0B5 - Storage: rstor316, Port: 0 Target ID 1:0:1: - WWPN: 210000D0231C3F8D - Storage: rstor317, Port: 0 Target ID 1:0:2: - WWPN: 210000D0232BE081 - Storage: rstor318, Port: 0 Target ID 1:0:3: - WWPN: 210000D0233C4000 - Storage: rstor319, Port: 0 Target ID 1:0:4: - WWPN: 210000D0234C3F68 - Storage: rstor320, Port: 0 * Host interface 2: Target ID 2:0:0: - WWPN: 220000D0230BE0B5 - Storage: rstor316, Port: 1 Target ID 2:0:1: - WWPN: 220000D0231C3F8D - Storage: rstor317, Port: 1 Target ID 2:0:2: - WWPN: 220000D0232BE081 - Storage: rstor318, Port: 1 Target ID 2:0:3: - WWPN: 220000D0233C4000 - Storage: rstor319, Port: 1 Target ID 2:0:4: - WWPN: 220000D0234C3F68 - Storage: rstor320, Port: 1 SCSI Id Block DEV MPath name MP status Storage Port ------------- ---------------- -------------------- ---------- ------------------ ----- [0:0:0:0] /dev/sda - - - - [1:0:0:0] /dev/sdb rstor316_CRS OK rstor316 0 [1:0:0:1] /dev/sdc rstor316_1 OK rstor316 0 [1:0:0:2] /dev/sdd rstor316_2 FAILED rstor316 0 [1:0:0:3] /dev/sde rstor316_3 OK rstor316 0 [1:0:0:4] /dev/sdf rstor316_4 OK rstor316 0 [1:0:0:5] /dev/sdg rstor316_5 OK rstor316 0 [1:0:0:6] /dev/sdh rstor316_6 OK rstor316 0 . . . . . . Custom made script

  30. Storage management device mapper name -> ASM disk and status [ ~]$ listdisks.py DISK NAME GROUP_NAME FG H_STATUS MODE MOUNT_S STATE TOTAL_GB USED_GB ---------------- ------------------ ------------- ---------- ---------- ------- -------- ------- ------ ----- rstor401_1p1 RAC9_DATADG1_0006 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5 rstor401_1p2 RAC9_RECODG1_0000 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 119.9 1.7 rstor401_2p1 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 111.8 111.8 rstor401_2p2 -- -- -- UNKNOWN ONLINE CLOSED NORMAL 120.9 120.9 rstor401_3p1 RAC9_DATADG1_0007 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.6 rstor401_3p2 RAC9_RECODG1_0005 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_4p1 RAC9_DATADG1_0002 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5 rstor401_4p2 RAC9_RECODG1_0002 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_5p1 RAC9_DATADG1_0001 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5 rstor401_5p2 RAC9_RECODG1_0006 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_6p1 RAC9_DATADG1_0005 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.5 rstor401_6p2 RAC9_RECODG1_0007 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_7p1 RAC9_DATADG1_0000 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.6 rstor401_7p2 RAC9_RECODG1_0001 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_8p1 RAC9_DATADG1_0004 RAC9_DATADG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 111.8 68.6 rstor401_8p2 RAC9_RECODG1_0004 RAC9_RECODG1 RSTOR401 MEMBER ONLINE CACHED NORMAL 120.9 1.8 rstor401_CRS1 rstor401_CRS2 rstor401_CRS3 rstor402_1p1 RAC9_DATADG1_0015 RAC9_DATADG1 RSTOR402 MEMBER ONLINE CACHED NORMAL 111.8 59.9 . . . . . . Custom made script

  31. Storage management [ ~]$ gen_multipath.py # multipath default configuration for PDB defaults { udev_dir /dev polling_interval 10 selector "round-robin 0" . . . } . . . multipaths { multipath { wwid 3600d0230006c26660be0b5080a407e00 alias rstor916_CRS } multipath { wwid 3600d0230006c26660be0b5080a407e01 alias rstor916_1 } . . . } Custom made script device mapper alias – naming persistency and multipathing (HA) SCSI [1:0:1:3] & [2:0:1:3] ->/dev/sdn & /dev/sdax ->/dev/mpath/rstor916_1

  32. Storage monitoring • ASM-based mirroring means • Oracle DBAs need to be alerted of disk failures and evictions • Dashboard – global overview – custom solution – RACMon • ASM level monitoring • Oracle Enterprise Manager Grid Control • RACMon – alerts on missing disks and failgroups plus dashboard • Storage level monitoring • RACMon – LUNs’ health and storage configuration details – dashboard

  33. Storage monitoring • ASM instance level monitoring • Storage level monitoring new failing disk onRSTOR614 new disk installed onRSTOR903 slot 2

  34. Conclusions • Oracle ASM diskgroups with normal redundancy • Used at CERN instead of HW RAID • Performance and scalability are very good • Allows to use low-cost HW • Requires more admin effort from the DBAs than high end storage • 11g has important improvements • Custom tools to ease administration

  35. Q&A Thank you • Links: • http://cern.ch/phydb • http://www.cern.ch/canali

More Related