180 likes | 199 Views
Data Corruption in the Enterprise. Jim Williams HEPiX Fall 2007. Agenda. What is Data Integrity? End to End Data Integrity Existing E2E Data Integrity Limitations of Today’s E2E Data Integrity Future Work. What is Data Integrity?. Definition.
E N D
Data Corruption in the Enterprise Jim Williams HEPiX Fall 2007
Agenda • What is Data Integrity? • End to End Data Integrity • Existing E2E Data Integrity • Limitations of Today’s E2E Data Integrity • Future Work
What is Data Integrity? Definition • Defined as the non-malicious loss of data resulting from component failure (hardware/software) or inadvertent administrative action • Low-frequency, high-impact
What is Data Integrity? Causes • Operating System bugs • Core O/S • Device drivers • Storage hardware and firmware bugs • HBAs • Arrays • Disks • Administrative errors • System administrators • Database administrators
What is Data Integrity? What happens after data corruption? • Find a “good” copy of the lost data… • Block recovery, usually from offline storage • Requires highly trained skill-set • Often involves extended downtime • High intensity situation, best avoided What if it could have been avoided?
What is Data Integrity? Remediation • Adding protection metadata to data • Universally done at component level • CRC, Parity, Reference Tags • Proprietary protection metadata • T10 Protection Information Model standard provided for protection metadata across system components
What is Data Integrity? Why is storage data corruption different? • TCP end to end data integrity is pretty good • Applications can use proprietary means for end to end data integrity • Applications do not control both ends between application and storage • Short-term application failures are much less costly than data loss
What is Data Integrity? • At the storage level, there are two kinds of data corruption • Latent sector errors • Silent data corruption • It is usually the case that for a storage device perspective, it is better to not return data than return the wrong data
Existing E2E Data Integrity E2E Data Integrity prevents, not simply detects corruption • The checksum in an Oracle data block, by itself, only allows the Oracle RDBMS, to detect when the data block is read, that something in the storage stack corrupted data. • However, if the storage device understands the Oracle data block structure, then the storage device can prevent corrupted data from being PERMANTELY written! This is the idea behind Oracle HARD
Existing E2E Data Integrity Oracle HARD
Existing E2E Data Integrity T10 Protection Information Model
Existing E2E Data Integrity Comparison
Limitations of Today’s E2E Data Integrity • T10 • Does not span to application • Does not address host oriented failures • Computational expensive to implement on host • Oracle HARD • Does not span to disk drive • Proprietary • Oracle oriented (DB block structure)
Future Work • Data Integrity Initiative (DII) • Oracle, Emulex, LSI, Seagate come together to address the problem of data corruption • Announced DII technology demo at SNW Spring 07 • DII turning over reigns to SNIA • SNIA Data Integrity Task Force (DITF) kicked off in October • Open to new members http://www.snia.org/apps/org/workgroup/data_integrity/
Future Work • Enhanced integrity checking • Operating System (I/O stack) • Passing of protection metadata through stack • Application to Operating System • File I/O extensions for protection metadata • HBA and driver • Validation of protection metadata • Translation of protection metadata
Future Work • Important studies • Data Integrity [Bernd Panzer-Steindel CERN/IT] • Disk replacements [Schroeder, Gibson FAST’07] • Disk replacements & SMART data [Pinheiro et al., FAST’07] • Latent sector errors [Bairavasundaram et al., Sigmetrics’07] • Disk Failures in the Real World [L. Bairavasundaram, G. Goodson, B. Schroeder, A. Arpaci-Dusseau, R. Arpaci-Dusseau, FAST’08]