1 / 19

Combining Chunk Boundary Calculations and Signature Calculation for Deduplication

Combining Chunk Boundary Calculations and Signature Calculation for Deduplication. Witold Litwin Université Paris Dauphine Darrell Long University of California Santa Cruz Thomas Schwarz Universidad Católica del Uruguay.

senwe
Download Presentation

Combining Chunk Boundary Calculations and Signature Calculation for Deduplication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combining Chunk Boundary Calculations and Signature Calculation for Deduplication Witold Litwin Université Paris Dauphine Darrell Long University of California Santa Cruz Thomas Schwarz Universidad Católica del Uruguay 10th International Information and Telecommunication Technologies Conference, (I2TS), Dec. 2011, Florianopolis, Br. IEEE Latin America Transactions, vol. 10(1), 2011.

  2. Deduplication • How not to store the same data twice • Breaks data into chunks, calculates signature of chunks, uses signature to check if chunk is already stored • If the chunk is already stored, create a file manifest that allows reconstructing the file • The chunk is replaced by a pointer to the chunk stored elsewhere • Has sometimes horrible read performance, as files need to be reconstructed • Is very appropriate for write heavy loads (backup, archival) • Is used for web-based storage

  3. Deduplication • Is part of storage solutions by all major providers • Has impressive compression rates for backup workload • Up to 20:1 Zhu, Li and Patterson, 2008 • Up to 30:1 Mandagere, Zhou, Smith, and Uttamchandi, 2008

  4. Deduplication • Can be for streams (backup workloads) or for files (archival) • In the case for files, use a complete file hash to discover identical files • Breaks incoming data into chunks • Typical value is 4KB or 8KB • Calculates the signature of each chunk (hash) • Looks up chunk signatures in a data base • Create a file manifest: • Describes where chunks are • Stores new chunks in the system, stores manifest, updates information on chunk signatures

  5. Deduplication incoming stream or file Chunker chunks Calculate Signatures File Manifest: chunk 1 is at … chunk 2 is at … … File Print: file signature chunk sig 1 chunk sig 2 … Look up chunks and create file manifest

  6. Deduplication • Suffers from two bottlenecks • I/O bottleneck: Needs to lookup chunk signatures in a database • Proposals: Bloom filter, adaptive caching, extreme binning • Needs to scan file twice: • To calculate chunk boundaries • To calculate chunk signatures • Suffers from a leap of faith: • There is no time to verify chunk identity byte-by-byte • Rely on chunk signature identity to identify identical chunks • Advent of MD5 and SHA1 convinced people that the risk of falsely identifying chunks is acceptable

  7. Our proposal • Use the calculation of chunk boundaries to strengthen the collision resilience of the chunk signatures

  8. Chunk Boundary Calculation • Why not fixed-sized chunks: • Small changes to the file cannot be found Fixed sized sliding window Previous Chunk sig  0 sig = 0 Chunk Boundary Chunk Boundary

  9. Chunk Boundary Calculation • Context defined chunk definition: • A small, local change affects in all likelihood only one chunk • If it is located on a chunk boundary, it affects two chunks • Deduplication ratio is much higher

  10. Chunk Boundary Calculation • Need to calculate a signature of a sliding window • Can use “rolling hashes” • When moving the window one letter to the right, can calculate the signature of the new window using: • The signature of the old sliding window • The character on the left (leaving the sliding window) • The character on the right (entering the sliding window)

  11. Chunk Boundary Calculation • Can use Rabin Fingerprints (O. Rabin, 1989) • Or Algebraic Signatures (Litwin, Schwarz, 2004) • We use the latter • Because we invented them • Because they are marginally better than Rabin Fingerprints for our purpose • Both allow cumulative calculation of a signature of the chunk seen so far

  12. Adding to the Chunk Signature • Chunk signature is MD5 or SHA1 • There are attacks using artificial collisions, but they are theoretical so far • There is a small, but positive collision probability • Two different chunks share the same signature value • Dedup then destroys / alters the later file • To keep dedup acceptable, need to have the resulting data loss orders of magnitude less than losses from other sources

  13. Adding to the Chunk Signature

  14. Adding to the Chunk Signature • Conditions: • Want x nines assurance against having any collision in a storage system of size N • The number of nines is given by the other failure sources • The figure uses x = 6 • Conclusion: • Adding one byte to the chunk signature increases the possible size of the data set by 101.20412 (for big enough x) • Example: MD5 has 16B. • At six nines, maximum number of chunks is 8.6921011 • With two bytes more, it is 2.225  1014 • Changes from hundreds of petabytes to tens of exabytes

  15. Flatness • Signature is flat if the probability of any text to have a certain signature is constant • Measuring flatness is difficult • No results known for MD5 or SHA-1, though support for almost perfect flatness • Algebraic Signatures: • Are perfectly flat for complete random input • Are very flat for experiment undertaken • Short words (taken from password list)

  16. Flatness of Algebraic Signatures

  17. Flatness of Algebraic Signatures

  18. Flatness of Algebraic Signatures

  19. Conclusions • Can reuse chunk boundary calculations to strengthen collision resistance of chunk signatures • Causes no additional calculation costs

More Related