Solid State Storage (SSS) System Error Recovery

Solid State Storage (SSS) System Error Recovery LHO 08 For NASA Langley Research Center

Background • NASA Langley Research Center is building a system to record streaming video and other data when the Space Shuttle docks with the Space Station. • This data will be used to develop algorithms that will enable the next generation of the space station to perform autonomous docking. • Due to the harsh environment in space the data will be stored in a RAID array of solid state SATA drives with the capability of recovering data even if two drives fail. • This Solid State Storage (SSS) system is being developed at VCU. • We will look at the that portion of the system that deals with drive error recovery.

Proposed SSS system Overview To data recorder

SSS Data Recovery • The Solid State Storage (SSS) system will consist of six solid state data drives. The discussion will be directed to this specific configuration. • The data will be sector striped across these six drives. • A modified RAID 6 system capable of recovering data from two corrupted sectors in a stripe is proposed. • Optimized for long single-thread transfers that are multiples of the entire stripe.

RAID 5 • To illustrate concepts and implications consider a RAID 5 implementation. • RAID 5 uses striped array with rotating parity. • Optimized for short, multithreaded transfers. • Capable of recovering from a single drive failure.

RAID 5 system consisting of three data drives and rotating parity. Four stripes for sectors A, B, C, and D are shown.

Rotating Parity • Why rotating parity? • The following steps are necessary to update a single data sector in a stripe. • The old data sector and the parity sector for the stripe must be read. • Compute the new parity using the new data sector, old data sector, and old parity. • Write new data sector and new parity sector. • Thus, to write to a data sector both the data sector and parity sector must be read and written. • Since there are many data drives a fixed parity drive would accessed much more frequently than a data drive. • This excessive access of a single parity drive is avoid by rotating parity across all drives.

Rotating parity not needed in SSS • The SSS is required to store long data streams. Not random sectors. • Make the size of these streams a multiple of the stripe size. • An entire stripe with parity will be buffered. • The entire stripe with party will be simultaneously written to all drives. • It is not necessary to first read the drives. • The SSS will always read and write entire stripes. • Easier to implement. • Faster access.

Parity Parity encoding is given by Where Di represent a data byte in a sector on drive i. If both sides of the above equation are exclusive ored with P, then D5 for example can be recovered by

Parity problem • Using parity it is easy to recover data on a single drive if we know that drive is bad. • We may have data corruption on a drive without without the entire drive failing. • Undetectable based on parity alone. • Propose to include a 32-bit CRC in sector. • Simple to implement. • Less than 1% overhead. • In RAID 6 will ensure as long as a stripe has no more than two bad sectors the data in that stripe can be recovered.

Key Conclusions • Write data as entire stripes. • Used fixed parity drive. • Include sector CRC.

Raid 6 (modified) • Use two fixed parity drives (P and Q). • Data can be recovered if two sectors in a stripe are corrupted. • P parity is the same as RAID 5 (simple XOR). • Easy to encode and easy to recover data. • Q parity is more complicated.

Q parity encoding The Q parity is a Reed-Solomon code given by Where  is Galois Field (GF) multiplication and giis a constant. For i < 8 it turns out that gi = 2i. For larger i, it not as simple. For example g8 = 29. But for the SSS application Q simplifies to The problem is how to compute the GF multiplication.

GF multiplication • In ordinary arithmetic multiplication can be accomplished summing the logs and taking the inverse log. • GF multiplication is typically accomplished using lookup tables to find the GF log and inverse log. The addition in modulo 255. See Xilinx application note XAPP731 “Hardware Accelerator for RADD 6 Parity Generation / Data Recovery Controller”.

Examples

Examples Note: AB = 0 if A = 0 or B = 0. This is a special case and cannot be computed using logs. It is also worth noting that A1 = A. This does follow from using logs since logGF(0x01) = 0.

Elaboration on Galois Field Mathematics • Évariste Galois (1832) • Established many of the ideas of group theory. • Left only sixty pages of mathematical writings. • Mortally wounded in a duel at age 20. • Most of his major centrifugations stem from a letter written the night before the duel. • His work has had great impact. • Provides powerful tool for investigating fundamental mathematical problems. • Roots of algebraic equations. • GF theory provides simple proof that an angle cannot be trisected using only compass and unmarked straightedge. • This had baffled mathematicians since the time of Euclid. • Recently applied to computer design and data-communication systems.

Galois Field Mathematics • A Galois Field is a algebraic structure <G,,> where G is a set consisting of 2n elements,  is addition mod 2 (bit wise XOR) and  is GF multiplication. Math similar to ordinary arithmetic. •  and  is commutative and associative. • Distributive such that • We are only concerned with GF(28) where the set G has 256 elements. We will use a hex byte to specify the elements. • Then A  A = 0x00, A  0x00 = 0x00, A  0x01 = A

GF(28) • The GF log look up tables are generates based on what in GF theory is called a primitive polynomial. Primitive polynomials have certain properties that lead to the error correction techniques. • GF(28) is generated using the primitive polynomial • This is the same primitive polynomials use to determine the feed back path for an 8-bit maximum count linear feedback shift registers (LFBSR’s). • The LFBSR can be use to perform GF multiplication.

The 8 bit LFBSR Q0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Or reversing order so that the most significant bit is at the left A shift has the same effect as  2. In VHDL Q <= Q(6) & Q(5) & Q(4) & (Q(3) XOR Q(7)) & (Q(2) XOR Q(7)) & (Q(1) XOR Q(7)) & Q(0) & Q(7);

1 Before shift After Shift X2 0 X7 X6 0 X6 X5 0 X5 X4 1 X4 X3X7 1 X3 X2X7 1 X2 X1X7 0 X1 X0 1 X0 X7

Galois Field Division

Solid State Storage (SSS) System Error Recovery