Disk Failures

Disk Failures Xiaqing He ID: 204 Dr. Lin

Content 1)Focus on : “How to recover from disk crashes” common term RAID “redundancy array of independent disks” 2)Several schemes to recover from disk crashes: • Mirroring—RAID level 1; • Parity checks--RAID 4; • Improvement--RAID 5; • RAID 6;

1) Mirroring • The simplest scheme to recovery from Disk Crashes • How does Mirror work? -- making two or more copied of the data on different disks • Benefit: -- save data in case of one disk will fail; -- divide data on several disks and let access to several blocks at once

1) Mirroring (con’t) • For mirroring, when the data can be lost? -- the only way data can be lost if there is a second (mirror/redundant) disk crash while the first (data) disk crash is being repaired. • Possibility: Suppose: • One disk: mean time to failure = 10 years; • One of the two disk: average of mean time to failure = 5 years; • The process of replacing the failed disk= 3 hours=1/2920 year; So: • the possibility of the mirror disk will fail=1/10 * 1/2,920 =1/29,200; • The possibility of data loss by mirroring: 1/5 * 1/29,200 = 1/146,000

2)Parity Blocks • why changes? -- disadvantages of Mirroring: uses so many redundant disks • What’s new? -- RAID level 4: uses only one redundant disk • How this one redundant disk works? -- modulo-2 sum; -- the jth bit of the redundant disk is the modulo-2 sum of the jth bits of all the data disks. • Example

2)Parity Blocks(con’t)___Example Data disks: • Disk1: 11110000 • Disk2: 10101010 • Disk3: 00111000 Redundant disk: • Disk4: 01100010

2)RAID 4 (con’t) • Reading -- Similar with reading blocks from any disk; • Writing 1)change the data disk; 2)change the corresponding block of the redundant disk; • Why? -- hold the parity checks for the corresponding blocks of all the data disks

2)RAID 4 (con’t) _ writing For a total N data disks: 1) naïve way: • read N data disks and compute the modulo-2 sum of the corresponding blocks; • rewrite the redundant disk according to modulo-2 sum of the data disks; 2) better way: • Take modulo-2 sum of the old and new version of the data block which was rewritten; • Change the position of the redundant disk which was 1’s in the modulo-2 sum;

2)RAID 4 (con’t) _ writing_Example • Data disks: • Disk1: 11110000 • Disk2: 10101010  01100110 • Disk3: 00111000 • to do: • Modulo-2 sum of the old and new version of disk 2: 11001100 • So, we need to change the positions 1,2,5,6 of the redundant disk. • Redundant disk: • Disk4: 01100010  10101110

2)RAID 4 (con’t) _failure recovery • Redundant disk crash: -- swap a new one and recomputed data from all the data disks; • One of Data disks crash: -- swap a new one; -- recomputed data from the other disks including data disks and redundant disk; • How to recomputed? (same rule, that’s why there will be some improvement) -- take modulo-2 sum of all the corresponding bits of all the other disks

3) An Improvement: RAID 5 • Why need a improvement? -- Shortcoming of RAID level 4: suffers from a bottleneck defect (when updating data disk need to read and write the redundant disk); • Principle of RAID level 5 (RAID 5): -- treat each disk as the redundant disk for some of the blocks; • Why it is feasible? The rule of failure recovery for redundant disk and data disk is the same: “take modulo-2 sum of all the corresponding bits of all the other disks” So, there is no need to retreat one as redundant disk and others as data disks

3) RAID 5 (con’t) • How to recognize which blocks of each disk treat this disk as redundant disk? -- if there are n+1 disks which were labeled from 0 to N, then we can treat the ith cylinder of disk J as redundant if J is the remainder when I is divided by n+1; • Example;

3) RAID 5 (con’t)_example N=3; • The first disk, labeled as 0 : 4,8,12…; • The second disk, labeled as 1 : 1,5,9…; • The third disk, labeled as 2 : 2,6,10…; • ………. Suppose all the 4 disks are equally likely to be written, for one of the 4 disks, the possibility of being written: • 1/4 + 3 /4 * 1/3 =1/2 • If N=m => 1/m +(m-1)/m * 1/(m-1) = 2/m

4) Coping with multiple disk crashes • RAID 6 – deal with any number of disk crashes if using enough redundant disks • Example a system of seven disks ( four data disks_numer 1-4 and 3 redundant disks_ number 5-7); • How to set up this 3*7 matrix ? (why is 3? – there are 3 redundant disks) 1)every column values three 1’s and 0’s except for all three 0’s; 2) column of the redundant disk has single 1’s; 3) column of the data disk has at least two 1’s;

4) Coping with multiple disk crashes (con’t) • Reading: • read form the data disks and ignore the redundant disk • Writing: • Change the data disk • change the corresponding bits of all the redundant disks

4) Coping with multiple disk crashes (con’t) • In those system which has 4 data disks and 3 redundant disk, how they can correct up to 2 disk crashes? • Suppose disk a and b failed: • find some row r (in 3*7 matrix)in which the column for a and b are different (suppose a is 0’s and b is 1’s); • Compute the correct b by taking modulo-2 sum of the corresponding bits from all the other disks other than b which have 1’s in row r; • After getting the correct b, Compute the correct a with all other disks available; • Example

4) Coping with multiple disk crashes (con’t)_example 3*7 matrix data disk redundant disk disk number 1 2 3 4 5 6 7

4) Coping with multiple disk crashes (con’t)_example First block of all the disks disk contents 1) 11110000 2) 10101010 3) 00111000 4) 01000001 5) 01100010 6) 00011011 7) 10001001

4) Coping with multiple disk crashes (con’t)_example Two disks crashes; disk contents 1) 11110000 2) ????????? 3) 00111000 4) 01000001 5) ????????? 6) 00011011 7) 10001001

4) Coping with multiple disk crashes (con’t)_example In that 3*7 matrix, find in row 2, disk 2 and 5 have different value and disk 2’s value is 1 and 5’s value is 0. so: compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,4,6; then compute thefirst block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,2,3; 1) 11110000 2) ????????? => 00001111 3) 00111000 4) 01000001 5) ????????? => 01100010 6) 00011011 7) 10001001

Disk Failures

Disk Failures

Presentation Transcript

Disk Utility Encrypted Disk Image

Bank Failures

Democratic Failures

Information Failures

FAILURES

MARKET FAILURES

Market Failures

FAILURES

DISK

Disk Failures

BRAIN FAILURES!!!

FAILURES

Hardware failures

Diagnosing Failures

Disk Failures

Notable Failures

Failures

Market Failures

Market Failures

Market Failures

HANDLING FAILURES

Market Failures