1 / 20

Disk Failures

Disk Failures. Xiaqing He ID: 204 Dr. Lin. Content. 1)Focus on : “How to recover from disk crashes” common term RAID “redundancy array of independent disks” 2)Several schemes to recover from disk crashes: Mirroring—RAID level 1; Parity checks--RAID 4;

vashon
Download Presentation

Disk Failures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Disk Failures Xiaqing He ID: 204 Dr. Lin

  2. Content 1)Focus on : “How to recover from disk crashes” common term RAID “redundancy array of independent disks” 2)Several schemes to recover from disk crashes: • Mirroring—RAID level 1; • Parity checks--RAID 4; • Improvement--RAID 5; • RAID 6;

  3. 1) Mirroring • The simplest scheme to recovery from Disk Crashes • How does Mirror work? -- making two or more copied of the data on different disks • Benefit: -- save data in case of one disk will fail; -- divide data on several disks and let access to several blocks at once

  4. 1) Mirroring (con’t) • For mirroring, when the data can be lost? -- the only way data can be lost if there is a second (mirror/redundant) disk crash while the first (data) disk crash is being repaired. • Possibility: Suppose: • One disk: mean time to failure = 10 years; • One of the two disk: average of mean time to failure = 5 years; • The process of replacing the failed disk= 3 hours=1/2920 year; So: • the possibility of the mirror disk will fail=1/10 * 1/2,920 =1/29,200; • The possibility of data loss by mirroring: 1/5 * 1/29,200 = 1/146,000

  5. 2)Parity Blocks • why changes? -- disadvantages of Mirroring: uses so many redundant disks • What’s new? -- RAID level 4: uses only one redundant disk • How this one redundant disk works? -- modulo-2 sum; -- the jth bit of the redundant disk is the modulo-2 sum of the jth bits of all the data disks. • Example

  6. 2)Parity Blocks(con’t)___Example Data disks: • Disk1: 11110000 • Disk2: 10101010 • Disk3: 00111000 Redundant disk: • Disk4: 01100010

  7. 2)RAID 4 (con’t) • Reading -- Similar with reading blocks from any disk; • Writing 1)change the data disk; 2)change the corresponding block of the redundant disk; • Why? -- hold the parity checks for the corresponding blocks of all the data disks

  8. 2)RAID 4 (con’t) _ writing For a total N data disks: 1) naïve way: • read N data disks and compute the modulo-2 sum of the corresponding blocks; • rewrite the redundant disk according to modulo-2 sum of the data disks; 2) better way: • Take modulo-2 sum of the old and new version of the data block which was rewritten; • Change the position of the redundant disk which was 1’s in the modulo-2 sum;

  9. 2)RAID 4 (con’t) _ writing_Example • Data disks: • Disk1: 11110000 • Disk2: 10101010  01100110 • Disk3: 00111000 • to do: • Modulo-2 sum of the old and new version of disk 2: 11001100 • So, we need to change the positions 1,2,5,6 of the redundant disk. • Redundant disk: • Disk4: 01100010  10101110

  10. 2)RAID 4 (con’t) _failure recovery • Redundant disk crash: -- swap a new one and recomputed data from all the data disks; • One of Data disks crash: -- swap a new one; -- recomputed data from the other disks including data disks and redundant disk; • How to recomputed? (same rule, that’s why there will be some improvement) -- take modulo-2 sum of all the corresponding bits of all the other disks

  11. 3) An Improvement: RAID 5 • Why need a improvement? -- Shortcoming of RAID level 4: suffers from a bottleneck defect (when updating data disk need to read and write the redundant disk); • Principle of RAID level 5 (RAID 5): -- treat each disk as the redundant disk for some of the blocks; • Why it is feasible? The rule of failure recovery for redundant disk and data disk is the same: “take modulo-2 sum of all the corresponding bits of all the other disks” So, there is no need to retreat one as redundant disk and others as data disks

  12. 3) RAID 5 (con’t) • How to recognize which blocks of each disk treat this disk as redundant disk? -- if there are n+1 disks which were labeled from 0 to N, then we can treat the ith cylinder of disk J as redundant if J is the remainder when I is divided by n+1; • Example;

  13. 3) RAID 5 (con’t)_example N=3; • The first disk, labeled as 0 : 4,8,12…; • The second disk, labeled as 1 : 1,5,9…; • The third disk, labeled as 2 : 2,6,10…; • ………. Suppose all the 4 disks are equally likely to be written, for one of the 4 disks, the possibility of being written: • 1/4 + 3 /4 * 1/3 =1/2 • If N=m => 1/m +(m-1)/m * 1/(m-1) = 2/m

  14. 4) Coping with multiple disk crashes • RAID 6 – deal with any number of disk crashes if using enough redundant disks • Example a system of seven disks ( four data disks_numer 1-4 and 3 redundant disks_ number 5-7); • How to set up this 3*7 matrix ? (why is 3? – there are 3 redundant disks) 1)every column values three 1’s and 0’s except for all three 0’s; 2) column of the redundant disk has single 1’s; 3) column of the data disk has at least two 1’s;

  15. 4) Coping with multiple disk crashes (con’t) • Reading: • read form the data disks and ignore the redundant disk • Writing: • Change the data disk • change the corresponding bits of all the redundant disks

  16. 4) Coping with multiple disk crashes (con’t) • In those system which has 4 data disks and 3 redundant disk, how they can correct up to 2 disk crashes? • Suppose disk a and b failed: • find some row r (in 3*7 matrix)in which the column for a and b are different (suppose a is 0’s and b is 1’s); • Compute the correct b by taking modulo-2 sum of the corresponding bits from all the other disks other than b which have 1’s in row r; • After getting the correct b, Compute the correct a with all other disks available; • Example

  17. 4) Coping with multiple disk crashes (con’t)_example 3*7 matrix data disk redundant disk disk number 1 2 3 4 5 6 7

  18. 4) Coping with multiple disk crashes (con’t)_example First block of all the disks disk contents 1) 11110000 2) 10101010 3) 00111000 4) 01000001 5) 01100010 6) 00011011 7) 10001001

  19. 4) Coping with multiple disk crashes (con’t)_example Two disks crashes; disk contents 1) 11110000 2) ????????? 3) 00111000 4) 01000001 5) ????????? 6) 00011011 7) 10001001

  20. 4) Coping with multiple disk crashes (con’t)_example In that 3*7 matrix, find in row 2, disk 2 and 5 have different value and disk 2’s value is 1 and 5’s value is 0. so: compute the first block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,4,6; then compute thefirst block of disk 2 by modulo-2 sum of all the corresponding bits of disk 1,2,3; 1) 11110000 2) ????????? => 00001111 3) 00111000 4) 01000001 5) ????????? => 01100010 6) 00011011 7) 10001001

More Related