180 likes | 514 Views
DeDu : Building a Deduplication Storage system over Cloud computing. Speaker: Yen-Yi Chen MA190104 Date: 2013/05/28. This paper appears in : Computer Supported Cooperative work in Design(CSCWD) ,2011 15 th International Data of Conference: 8-10 June 2011 Author(s):
E N D
DeDu: Building a Deduplication Storage system over Cloud computing Speaker: Yen-Yi Chen MA190104 Date:2013/05/28 This paper appears in : Computer Supported Cooperative work in Design(CSCWD) ,2011 15th International Data of Conference: 8-10 June 2011 Author(s): Zhe Sun, Jun Shen, Fac. of inf., Univ. of Wollongong, Wollongong, NSW, Australia Jianming Yong, Fac. of bus., Univ. of Southern Queensland, Toowoomab, QLD ,Australia
Outline • Introduction • Two issues to be addressed • Deduplication • Theories and approaches • System design • Simulations and Experiments • Conclusions
Introduction • 雲端運算興起、分散式系統架構 • 資訊爆炸、資料海量 • 儲存設備成本上升 • 增加資料傳輸與減緩佔用網路頻寬
Introduction • System name:DeDu • Front-end: deduplication application • Back-end: Hadoop Distributed File System • HDFS • HBase
Two issues to be addressed • How does the system identify the duplication? *hash function-MD5 and SHA-1 • How does the system manage the data? *HDFS and HBase
Deduplication Data Store Data Store Data Store A C A C a C B b C A C A c B A B B B B A C A b a a 1. Data chunks are evaluated to determine a unique signature for each 2. Signature values are compared to identify all duplicates 3.Duplicate data chunks are replaced with pointes to a single stored chunk. Saving storage space
Theories and approaches A. The architecture of source data and link files B. Architecture of deduplication cloud storage system
System design • Data organisation • Storage of the files • Access to the files • Deletion of files
Conclusions • 1. The fewer the data nodes, the writing efficiency is high; but the reading efficiency is low; • 2. The more data nodes, the writing efficiency is low, but reading efficiency is hight; • 3. single file is big, the time to calculate hash values becomes higher ; but transmission cost is low; • 4.single file is small, the time to calculate hash values becomes lower ; but transmission cost is high.