A Closer Look inside Oracle ASM

A Closer Look inside Oracle ASM UKOUG Conference 2007 Luca Canali, CERN IT

Outline • Oracle ASM for DBAs • Introduction and motivations • ASM is not a black box • Investigation of ASM internals • Focus on practical methods and troubleshooting • ASM and VLDB • Metadata, rebalancing and performance • Lessons learned from CERN’s production DB services

ASM • Oracle Automatic Storage Management • Provides the functionality of a volume manager and filesystem for Oracle (DB) files • Works with RAC • Oracle 10g feature aimed at simplifying storage management • Together with Oracle Managed Files and the Flash Recovery Area • An implementation of S.A.M.E. methodology • Goal of increasing performance and reducing cost

Servers SAN Storage ASM for a Clustered Architecture • Oracle architecture of redundant low-cost components

ASM Disk Groups • Example: HW = 4 disk arrays with 8 disks each • An ASM diskgroup is created using all available disks • The end result is similar to a file system on RAID 1+0 • ASM allows to mirror across storage arrays • Oracle RDBMS processes directly access the storage • RAW disk access ASM Diskgroup Mirroring Striping Striping Failgroup1 Failgroup2

Files, Extents, and Failure Groups Files and extent pointers Failgroups and ASM mirroring

ASM Is not a Black Box • ASM is implemented as an Oracle instance • Familiar operations for the DBA • Configured with SQL commands • Info in V$ views • Logs in udump and bdump • Some ‘secret’ details hidden in X$TABLES and ‘underscore’ parameters

Selected V$ Views and X$ Tables

ASM Parameters • Notable ASM instance parameters: *.asm_diskgroups='TEST1_DATADG1','TEST1_RECODG1' *.asm_diskstring='/dev/mpath/itstor*p*' *.asm_power_limit=5 *.shared_pool_size=70M *.db_cache_size=50M *.large_pool_size=50M *.processes=100

More ASM Parameters • Underscore parameters • Several undocumented parameters • Typically don’t need tuning • Exception: _asm_ausize and _asm_stripesize • May need tuning for VLDB in 10g • New in 11g, diskgroup attributes • V$ASM_ATTRIBUTE, most notable • disk_repair_time • au_size • X$KFENV shows ‘underscore’ attributes

ASM Storage Internals • ASM Disks are divided in Allocation Units (AU) • Default size 1 MB (_asm_ausize) • Tunable diskgroup attribute in 11g • ASM files are built as a series of extents • Extents are mapped to AUs using a file extent map • When using ‘normal redundancy’, 2 mirrored extents are allocated, each on a different failgroup • RDBMS read operations access only the primary extent of a mirrored couple (unless there is an IO error) • In 10g the ASM extent size = AU size

ASM Metadata Walkthrough • Three examples follow of how to read data directly from ASM. • Motivations: • Build confidence in the technology, i.e. ‘get a feeling’ of how ASM works • It may turn out useful one day to troubleshoot a production issue.

Example 1: Direct File Access 1/2 Goal: Reading ASM files with OS tools, using metadata information from X$ tables • Example: find the 2 mirrored extents of the RDBMS spfile • sys@+ASM1> select GROUP_KFFXP Group#, DISK_KFFXP Disk#, AU_KFFXP AU#, XNUM_KFFXP Extent#from X$KFFXP where number_kffxp=(select file_number from v$asm_alias where name='spfiletest1.ora'); GROUP# DISK# AU# EXTENT# ---------- ---------- ---------- ---------- 1 1617528 0 1 4 14838 0

Example 1: Direct File Access 2/2 • Find the disk path sys@+ASM1> select disk_number,path from v$asm_disk where GROUP_NUMBER=1 and disk_number in (16,4); DISK_NUMBER PATH ----------- ------------------------------------ 4 /dev/mpath/itstor417_1p1 16 /dev/mpath/itstor419_6p1 • Read data from disk using ‘dd’ dd if=/dev/mpath/itstor419_6p1 bs=1024k count=1 skip=17528 |strings

X$KFFXP

Example 2: A Different Way A different metadata table to reach the same goal of reading ASM files directly from OS: sys@+ASM1> select GROUP_KFDAT Group# ,NUMBER_KFDAT Disk#, AUNUM_KFDAT AU# from X$KFDAT where fnum_kfdat=(select file_number from v$asm_alias where name='spfiletest1.ora'); GROUP# DISK# AU# ---------- ---------- ---------- 1 4 14838 1 16 17528

X$KFDAT

Example 3: Yet Another Way Using the internal package dbms_diskgroup declare fileType varchar2(50); fileName varchar2(50); fileSz number; blkSz number; hdl number; plkSz number; data_buf raw(4096); begin fileName := '+TEST1_DATADG1/TEST1/spfiletest1.ora'; dbms_diskgroup.getfileattr(fileName,fileType,fileSz, blkSz); dbms_diskgroup.open(fileName,'r',fileType,blkSz, hdl,plkSz, fileSz); dbms_diskgroup.read(hdl,1,blkSz,data_buf); dbms_output.put_line(data_buf); end; /

DBMS_DISKGROUP • Can be used to read/write ASM files directly • It’s an Oracle internal package • Does not require a RDBMS instance • 11g’s asmcmd cp command uses dbms_diskgroup

File Transfer Between OS and ASM • The supported tools (10g) • RMAN • DBMS_FILE_TRANSFER • FTP (XDB) • WebDAV (XDB) • They all require a RDBMS instance • In 11g, all the above plus asmcmd • cp command • Works directly with the ASM instance

Strace and ASM 1/3 Goal: understand strace output when using ASM storage • Example: read64(15,"#33\0@\"..., 8192, 473128960)=8192 • This is a read operation of 8KB from FD 15 at offset 473128960 • What is the segment name, type, file# and block# ?

Strace and ASM 2/3 • From /proc/<pid>/fd I find that FD=15 is /dev/mpath/itstor420_1p1 • This is disk 20 of D.G.=1 (from v$asm_disk) • From x$kffxp I find the ASM file# and extent#: • Note: offset 473128960 = 451 MB + 27 *8KB sys@+ASM1>select number_kffxp, xnum_kffxp from x$kffxp where group_kffxp=1 and disk_kffxp=20 and au_kffxp=451; NUMBER_KFFXP XNUM_KFFXP ------------ ---------- 268 17

Strace and ASM 3/3 • From v$asm_alias I find the file alias for file 268: USERS.268.612033477 • From v$datafile view I find the RDBMS file#: 9 • From dba extents finally find the owner and segment name relative to the original IO operation: sys@TEST1>select owner,segment_name,segment_type from dba_extents where FILE_ID=9 and 27+17*1024*1024 between block_id and block_id+blocks; OWNER SEGMENT_NAME SEGMENT_TYPE ----- ------------ ------------ SCOTT EMP TABLE

… … … … … … … … … … … … … … … … B3 B1 B7 B4 B6 B0 B2 B5 … … … … … … … … … … … … … … … … AU = 1MB A4 A3 A2 A6 A1 A0 A7 A5 Fine striping size = 128KB (1MB/8) Investigation of Fine Striping • An application: finding the layout of fine-striped files • Explored using strace of an oracle session executing ‘alter system dump logfile ..’ • Result: round robin distribution over 8 x 1MB extents

Metadata Files • ASM diskgroups contain ‘hidden files’ • Not listed in V$ASM_FILE (file# <256) • Details are available in X$KFFIL • In addition the first 2 AUs of each disk are marked as file#=0 in X$KFDAT • Example (10g): GROUP# FILE# FILESIZE_AFTER_MIRR RAW_FILE_SIZE ---------- ---------- ------------------- ------------- 1 1 2097152 6291456 1 2 1048576 3145728 1 3 264241152 795869184 1 4 1392640 6291456 1 5 1048576 3145728 1 6 1048576 3145728

ASM Metadata 1/2 • File#0, AU=0: disk header (disk name, etc), Allocation Table (AT) and Free Space Table (FST) • File#0, AU=1: Partner Status Table (PST) • File#1: File Directory (files and their extent pointers) • File#2: Disk Directory • File#3: Active Change Directory (ACD) • The ACD is analogous to a redo log, where changes to the metadata are logged. • Size=42MB * number of instances Source: Oracle Automatic Storage Management, Oracle Press Nov 2007, N. Vengurlekar, M. Vallath, R.Long

ASM Metadata 2/2 • File#4: Continuing Operation Directory (COD). • The COD is analogous to an undo tablespace. It maintains the state of active ASM operations such as disk or datafile drop/add. The COD log record is either committed or rolled back based on the success of the operation. • File#5: Template directory • File#6: Alias directory • 11g, File#9: Attribute Directory • 11g, File#12: Staleness registry, created when needed to track offline disks

ASM Rebalancing • Rebalancing is performed (and mandatory) after space management operations • Goal: balanced space allocation across disks • Not based on performance or utilization • ASM spreads every file across all disks in a diskgroup • ASM instances are in charge of rebalancing • Extent pointers changes are communicated to the RDBMS • RDBMS’ ASMB process keeps an open connection to ASM • This can be observed by running strace against ASMB • In RAC, extra messages are passed between the cluster ASM instances • LMD0 of the ASM instances are very active during rebalance

ASM Rebalancing and VLDB • Performance of Rebalancing is important for VLDB • An ASM instance can use parallel slaves • RBAL coordinates the rebalancing operations • ARBx processes pick up ‘chunks’ of work. By default they log their activity in udump • Does it scale? • In 10g serialization wait events can limit scalability • Even at maximum speed rebalancing is not always I/O bound

ASM Rebalancing Performance • Tracing ASM rebalancing operations • 10046 trace of the +arbx processes • Oradebug setospid … • oradebug event 10046 trace name context forever, level 12 • Process log files (in bdump) with orasrp (tkprof will not work) • Main wait events from my tests with RAC (6 nodes) • DFS lock handle • Waiting for CI level 5 (cross instance lock) • Buffer busy wait • ‘unaccounted for’ • enq: AD - allocate AU • enq: AD - deallocate AU • log write(even) • log write(odd)

ASM Single Instance Rebalancing • Single instance rebalance • Faster in RAC if you can rebalance with only 1 node up (I have observed: 20% to 100% speed improvement) • Buffer busy wait can be the main event • It seems to depend on the number of files in the diskgroup. • Diskgroups with a small number of (large) files have more contention (+arbx processes operate concurrently on the same file) • Only seen in tests with 10g • 11g has improvements regarding rebalancing contention

Rebalancing, an Example Data: D.Wojcik, CERN IT

Rebalancing Workload • When ASM mirroring is used (e.g. with normal redundancy) • Rebalancing operations can move more data than expected • Example: • 5 TB (allocated): ~100 disks, 200 GB each • A disk is replaced (diskgroup rebalance) • The total IO workload is 1.6 TB (8x the disk size!) • How to see this: query v$asm_operation, the column EST_WORK keeps growing during rebalance • The issue: excessive repartnering

ASM Disk Partners • ASM diskgroup with normal redundancy • Two copies of each extents are written to different ‘failgroups’ • Two ASM disks are partners: • When they have at least one extent set in common (they are the 2 sides of a mirror for some data) • Each ASM disk has a limited number of partners • Typically 10 disk partners: X$KFDPARTNER • Helps to reduce the risk associated with 2 simultaneous disk failures

Free and Usable Space • When ‘ASM mirroring’ is used not all the free space should be occupied • V$ASM_DISKGROUP.USABLE_FILE_MB: • Amount of free space that can be safely utilized taking mirroring into account, and yet be able to restore redundancy after a disk failure • it’s calculated for the case of the worst scenario, anyway it is a best practice not to have it go negative (it can) • This can be a problem when deploying a small number of large LUNs and/or failgroups

Fast Mirror Resync • ASM 10g with normal redundancy does not allow to offline part of the storage • A transient error in a storage array can cause several hours of rebalancing to drop and add disks • It is a limiting factor for scheduled maintenances • 11g has new feature ‘fast mirror resync’ • Redundant storage can be put offline for maintenance • Changes are accumulated in the staleness registry (file#12) • Changes are applied when the storage is back online

8675 IOPS Read Performance, Random I/O IOPS measured with SQL (synthetic test) ~130 IOPS per disk Destroking, only the external part of the disks is used

Read Performance, Sequential I/O Limited by HBAs -> 4 x 2 Gb (measured with parallel query)

Implementation Details • Multipathing • Linux Device Mapper (2.6 kernel) • Block devices • RHEL4 and 10gR2 allow to skip raw devices mapping • External half of the disk for data disk groups • JBOD config • No HW RAID • ASM used to mirror across disk arrays • HW: • Storage arrays (Infortrend): FC controller, SATA disks • FC (Qlogic): 4Gb switch and HBAs (2Gb in older HW) • Servers are 2x CPUs, 4GB RAM, 10.2.0.3 on RHEL4, RAC of 4 to 8 nodes

Q&A Q&A • Links: • http://cern.ch/phydb • http://twiki.cern.ch/twiki/bin/view/PSSGroup/ASM_Internals • http://www.cern.ch/canali

A Closer Look inside Oracle ASM