120 likes | 293 Views
Understanding the Benefits and Costs of Deduplication. Mahmoud Abaza , and Joel Gibson School of Computing and Information Systems, Athabasca University mahmouda@athabascau.ca. Questions to ask…. What is deduplication ? Why is it important to understand?
E N D
Understanding the Benefits and Costs of Deduplication Mahmoud Abaza, and Joel Gibson School of Computing and Information Systems, Athabasca University mahmouda@athabascau.ca
Questions to ask… • What is deduplication? • Why is it important to understand? • Do all vendors implement deduplication the same way? • How much reduction in physical disk storage can be expected, if any?
.. questions to ask • What are the advantages, and disadvantages including risk? • Is it worth it to my IT budget? • Is deduplication strictly a business tool or could it benefit home users?
Types of deduplications • File-based (example: Micrsoft’s SIS system) • Block-based (digital signature for each block) • Delta Encoding (storing one file as well as the difference between two files )
Deduplication side • Client-side Deduplication (deduplication before copying to array server) • Target-side Deduplication (deduplication that occurs on a backup set after it has been copied to a storage array )
Target-side Deduplication Process • In-line processing (while data is being ingested into the storage system) • Post processing (The data is first written to disk, and then checked for similar copies. )
Inline Processing • Advantage : Reduces amount of overall disk IO • Disadvantage: Slow ingestion time
Post Processing • Advantage : multiple hosts and CPUs can be involved to make the process fast. • Disadvantage: Requires a large pool of storage , plus large disk IO
How much reduction in physical disk storage can be expected, if any? • Depends on type of data. Case studies: • Data Domain LLC, TiVo was able to achieve “data compression rates of 30 to 1 consistently.” • study of SIS found that “for 4 weeks of full backups, achieves 87% of the savings of block-based.”
Experimental Results A deduplication algorithm is run against some real-world data on personal workstation. We chose to backup a set of folders that contained mostly software downloads, music, photos, and videos – a real challenge considering these are typically compressed files already.
Home Based Deduplication Results • Run # 1 - Initial Backup • New files added to backup: 15 935 • Total size of files: 98.8 GB • Physical disk space used for backup: 85.5 GB • Time to process:03:13:47 hh:mm:ss • Run # 2 - Second Backup • New files added to backup: 57 • Size of files: 105 MB • Physical disk space used for backup: 83.7 MB • Time to process: 00:01:49 hh:mm:ss
Conclusion: deduplication. It can mean different things to different vendors, but the basic premise is the same – eliminate duplicate data.