SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems

SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems Daehee Kim, Sejun Song, Baek-Young Choi University of Missouri-Kansas City

Cloud Storage – Dropbox, Google drive,… • Network : • High network bandwidth consumption • Server : • large storage consumption i.e. Remote Backup .. Anywhere, Anytime • Client : • High uploading overhead ….. …. ….. …. Employee Employee Sales Marketing Individual

Data deduplication • File-level • Sub file-level • Fixed-size chunk • Variable-size chunk • Deduplication location • Server-based • Traditionally on the high capacity servers • Client-based • Limited by the client capacity Deduplication granularity

File-Level (File-Level Deduplication) control data storage unique index Index table index X duplicate index

Sub-File Level : Fixed Size Chunk (Fixed Size Block Deduplication) e.g. granularity : 15 byte fixed size boundary boundary boundary File1 nice people, good papers, and good conference, …… …… nice people, go od papers, and good conference Offset shifting problem No redundancies found File2 welcome, nice people, good papers, and good conference, …… welcome, nice p eople, good pap ers, and good c ……

Sub-File Level : Variable Size Chunk (Variable Size Block Deduplication) e.g. matching pattern : “go” boundary boundary File1 nice people, good papers, and good conference, …… …… nice people, go od papers, and go = File2 welcome, nice people, good papers, and good conference, …… welcome, nice people, go od papers, and go …… Based on content, not fixed offset

Deduplication : Comparisons Good for client-based Good for server-based Deduplication ratio File-level < Fixed size << Variable size better Processing time File-level < Fixed size <<<< Variable size worse Index overhead File-level << Fixed size  Variable size worse • Current cloud storage systems • Client-based • JustCloud, Mozy : file deduplication • Dropbox : large fixed size block deduplication (4MB)

Objective • High deduplication ratio • Low network traffic • Low processing time • Less index overhead Develop an efficient client deduplicationthat achieves

Outline Motivation, Background, and Goal Observations and Approach Design Evaluation Conclusion

Observations [ Example ] email attachments meta body (text) text pdf docx images … <</Type/Page/ …>> Page Image object <</Type/.. Image/.. Filter/.. Length>> <stream>Encoded image<endstream> Text object <</Filter/ .. /Length >> <stream>Encoded text<endstream> • structured file can be decomposed to various objects • Fast decomposition without shifting problem • e.g. compressed files ( zip, rar, ..), document files (pdf, doc, ppt, docx, pptx), emails

Observations • Large number of structured files exist in cloud-based storage systems [ dataset ]

Our Approach (SAFE) • Apply object-based deduplication for structured files • Decompose a file into objects • Find redundancies based on decomposed objects. • Combine small sized meta data into an object (to reduce index sizes) • Apply file-level deduplication for redundant files • Speed up and small index sizes

SAFE Architecture Email parser meta pdf img Files Emails Redundant file end File-level dedup unique file Structured? Unstructured file Structured file Structure Library File parser All object indexes objects Object manager Object-level dedup objects (index, object) Unique object indexes objects Store manager

SAFE in Cloud Storage SAFE file-level dedup : : object-level dedup Indexes (objects) Indexes (unique objects) unique objects Server Client

Setup • Compared deduplications • File-level (like JustCloud, Mozy) • Fixed block (4MB, like Dropbox) • Variable block (8 KB average chunk size) • Collected real data sets • Structured files (docx, pptx, and pdf) • From file system and emails of five graduate students in the same department • file system : 4 GB, emails : 2.5 GB

Evaluation Metrics • Overhead • Processing time • Relative processing time to File-Level • Index size • Relative index size per File-Level • Performance • Deduplication ratio • Space savings by removing redundancies • ( (InputData – ConsumedStorage) / InputData) * 100 • Network Traffic • Size of data transferred to a storage over network • Byte

Deduplication Ratio • is about 30% to 60% in SAFE. • is 2 times higher in SAFE than in “File-level” • is as good in SAFE as variable size block deduplication (Block-V) for email datasets • is even higher in SAFE than Block-V for file system datasets x1.5 x2 File system datasets Email datasets

Network Traffic • is the lowest in SAFE for both datasets • is 15% and 30% lower in SAFE than file-level deduplication (File) and fixed size block deduplication (Block-F) for both data sets. 15% 30% File system datasets Email datasets

Processing Time • is hundreds times faster in SAFE than in Block-V • is as fast in SAFE as in File-level hundreds times hundreds times File system datasets Email datasets

Index Size • Is proportional to the number of unique blocks (40B per index) • i.e. for 4000 emails, index sizes are 0.1 MB (file-level) and 1.3 MB (SAFE) • Is 2 to 3 times less in SAFE (1.3MB) than Block-V (3.7MB) • Block-V has 8KB block size in average • Is 2 times more in file system than email datasets • SAFE has multiple decomposed objects for a file • i.e. file system dataset has more pdf files (pdf file can be decomposed into more objects than docx) File system datasets Email datasets

Conclusions • High deduplication ratio: as good as Block-V • Low network traffic: as good as Block-V • Low processing time • hundreds times than Block-V • Less index overhead • 2 ~ 3 times less than Block-V • Future work • Extend to incorporate more structured file types Developed an efficient structure-awareclient-based deduplication (SAFE)

Thank you! Questions? {daehee.kim, sjsong, choiby} @umkc.edu

SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems

SAFE : Structure-Aware File and Email Deduplication for Cloud-based Storage Systems

Presentation Transcript

Distributed File Systems

Behavior Based Safety

File System Extensibility and Non-Disk File Systems

雲端計算 Cloud Computing

Cloud Computing

CS 5600 Computer Systems

CMPT 454

Chapter 12: Mass-Storage Systems

Chapter 11: Storage and File Structure

Ontology Storage, Reasoning and Query ---- Methods, Systems and Applications

Chapter IX File Systems

Chapter 7 Storage Systems

Information Systems (6CFU)

Storage Systems

Rule-based Knowledge (Expert) Systems

Intelligent Storage Systems

儲存技術簡介及應用

CS 5600 Computer Systems

Guide to Computer Forensics and Investigations Fourth Edition

Chapter 11: Storage and File Structure

Behavior Based Safety