280 likes | 623 Views
SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems. Daehee Kim , Sejun Song, Baek -Young Choi University of Missouri-Kansas City. Cloud Storage – Dropbox , Google drive,…. Network : High network bandwidth consumption. Server :
E N D
SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems Daehee Kim, Sejun Song, Baek-Young Choi University of Missouri-Kansas City
Cloud Storage – Dropbox, Google drive,… • Network : • High network bandwidth consumption • Server : • large storage consumption i.e. Remote Backup .. Anywhere, Anytime • Client : • High uploading overhead ….. …. ….. …. Employee Employee Sales Marketing Individual
Data deduplication • File-level • Sub file-level • Fixed-size chunk • Variable-size chunk • Deduplication location • Server-based • Traditionally on the high capacity servers • Client-based • Limited by the client capacity Deduplication granularity
File-Level (File-Level Deduplication) control data storage unique index Index table index X duplicate index
Sub-File Level : Fixed Size Chunk (Fixed Size Block Deduplication) e.g. granularity : 15 byte fixed size boundary boundary boundary File1 nice people, good papers, and good conference, …… …… nice people, go od papers, and good conference Offset shifting problem No redundancies found File2 welcome, nice people, good papers, and good conference, …… welcome, nice p eople, good pap ers, and good c ……
Sub-File Level : Variable Size Chunk (Variable Size Block Deduplication) e.g. matching pattern : “go” boundary boundary File1 nice people, good papers, and good conference, …… …… nice people, go od papers, and go = File2 welcome, nice people, good papers, and good conference, …… welcome, nice people, go od papers, and go …… Based on content, not fixed offset
Deduplication : Comparisons Good for client-based Good for server-based Deduplication ratio File-level < Fixed size << Variable size better Processing time File-level < Fixed size <<<< Variable size worse Index overhead File-level << Fixed size Variable size worse • Current cloud storage systems • Client-based • JustCloud, Mozy : file deduplication • Dropbox : large fixed size block deduplication (4MB)
Objective • High deduplication ratio • Low network traffic • Low processing time • Less index overhead Develop an efficient client deduplicationthat achieves
Outline Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion
Observations [ Example ] email attachments meta body (text) text pdf docx images … <</Type/Page/ …>> Page Image object <</Type/.. Image/.. Filter/.. Length>> <stream>Encoded image<endstream> Text object <</Filter/ .. /Length >> <stream>Encoded text<endstream> • structured file can be decomposed to various objects • Fast decomposition without shifting problem • e.g. compressed files ( zip, rar, ..), document files (pdf, doc, ppt, docx, pptx), emails
Observations • Large number of structured files exist in cloud-based storage systems [ dataset ]
Our Approach (SAFE) • Apply object-based deduplication for structured files • Decompose a file into objects • Find redundancies based on decomposed objects. • Combine small sized meta data into an object (to reduce index sizes) • Apply file-level deduplication for redundant files • Speed up and small index sizes
Outline Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion
SAFE Architecture Email parser meta pdf img Files Emails Redundant file end File-level dedup unique file Structured? Unstructured file Structured file Structure Library File parser All object indexes objects Object manager Object-level dedup objects (index, object) Unique object indexes objects Store manager
SAFE in Cloud Storage SAFE file-level dedup : : object-level dedup Indexes (objects) Indexes (unique objects) unique objects Server Client
Outline Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion
Setup • Compared deduplications • File-level (like JustCloud, Mozy) • Fixed block (4MB, like Dropbox) • Variable block (8 KB average chunk size) • Collected real data sets • Structured files (docx, pptx, and pdf) • From file system and emails of five graduate students in the same department • file system : 4 GB, emails : 2.5 GB
Evaluation Metrics • Overhead • Processing time • Relative processing time to File-Level • Index size • Relative index size per File-Level • Performance • Deduplication ratio • Space savings by removing redundancies • ( (InputData – ConsumedStorage) / InputData) * 100 • Network Traffic • Size of data transferred to a storage over network • Byte
Deduplication Ratio • is about 30% to 60% in SAFE. • is 2 times higher in SAFE than in “File-level” • is as good in SAFE as variable size block deduplication (Block-V) for email datasets • is even higher in SAFE than Block-V for file system datasets x1.5 x2 File system datasets Email datasets
Network Traffic • is the lowest in SAFE for both datasets • is 15% and 30% lower in SAFE than file-level deduplication (File) and fixed size block deduplication (Block-F) for both data sets. 15% 30% File system datasets Email datasets
Processing Time • is hundreds times faster in SAFE than in Block-V • is as fast in SAFE as in File-level hundreds times hundreds times File system datasets Email datasets
Index Size • Is proportional to the number of unique blocks (40B per index) • i.e. for 4000 emails, index sizes are 0.1 MB (file-level) and 1.3 MB (SAFE) • Is 2 to 3 times less in SAFE (1.3MB) than Block-V (3.7MB) • Block-V has 8KB block size in average • Is 2 times more in file system than email datasets • SAFE has multiple decomposed objects for a file • i.e. file system dataset has more pdf files (pdf file can be decomposed into more objects than docx) File system datasets Email datasets
Conclusions • High deduplication ratio: as good as Block-V • Low network traffic: as good as Block-V • Low processing time • hundreds times than Block-V • Less index overhead • 2 ~ 3 times less than Block-V • Future work • Extend to incorporate more structured file types Developed an efficient structure-awareclient-based deduplication (SAFE)
Thank you! Questions? {daehee.kim, sjsong, choiby} @umkc.edu