220 likes | 415 Views
Identifying File Similarity in Large Data Sets by Modulo File Length. Yongtao Zhou, Yuhui Deng, Xiaoguang Cheng, Junjie Xie Department of Computer Science, Jinan University, Guangzhou, 510632, P. R.China. Agenda. Motivation Challenges Simhash Traditional sampling algorithm(TSA)
E N D
Identifying File Similarity in Large Data Sets by Modulo File Length Yongtao Zhou, Yuhui Deng, Xiaoguang Cheng, Junjie Xie Department of Computer Science, Jinan University, Guangzhou, 510632, P. R.China
Agenda • Motivation • Challenges • Simhash • Traditional sampling algorithm(TSA) • Position-aware similarity algorithm(PAS) • Evaluation • Conclusion
Motivation • The explosive growth of data • IDC: 4ZBbytes data in 2014, 50% growth in contrast to 2012 • Data more and more complex • 5V: Volume, Variety, Velocity, Value, and Veracity • Structured, semi-structured, and unstructured data • File similarity detection is very important to data management • Cluster similar data in data mining • Find similar file in backup • Employ similarity to enhance the cache hierarchy in clouds • …………
Challenges • Reducing the computing overhead of similarity detection • Similarity detection is I/O bound and CPU bound task • Require lots of memory space and frequently access disk • The compute overhead increase with the growth of data sets • Reducing the time of similarity detection • Some applications requiring real time and high throughput • Achieving both the efficiency and accuracy
Simhash • Fingerprints of similar files differ in a small number of bit positions • Determine the similarity of files by working out their Hamming distance
Traditional sampling algorithm(TSA) • TSA is described in algorithm 1 by using pseudo-code • Transforming similarity detection problem into set intersection problem
Traditional sampling algorithm(TSA) • TSA is simple and fixed overhead, but it is very sensitive to file modifications • We chosen n = 6, Lenc = 1KB.
Position-aware similarity algorithm(PAS) • PAS is described in algorithm 2 by using pseudo-code
Position-aware similarity algorithm(PAS) • We chosen T as 28KB, n = 6, Lenc = 1KB
Evaluation environment • Ubuntu operation system (kernel version is 2.6.32) • VirtualBox virtual machine software(4.3.8.r92456) • 1GB memory • 2.0 GHz Intel(R) Pentium(R) CPU
Data set • Data set D1 • 11.5GB, 2756 files • Table 1 summarizes the profile of D1 • Figure 4 show the distribution of file size • Data set D2 • 14 txt files • 128MB Table 1. The profile of data set D1 Figure 4. The file size distribution of data set D1
The principle of parameters selection • Comparing the detection probability of PAS against the actual of matching chunks. • Splitting file into chunks, then maps these chunks into fingerprint by using MD5 hash function and get fingerprint set Finger(A). The actual portion of matching chunk fingerprint of file A and file B is described as follows:
Sampling position impact factor T • The impact of T on the detection probability (Lenc = 32byte, N = 8, T = 2KB, 8KB, 32KB, 128KB, 512KB) • We take T = 512K.
PAS compare to Simhash: Time overhead The size of file are 2MB, 5MB and 10MB, respectively Data set D1
PAS compare to Simhash: CPU and memory usage CPU and memory utilization of PAS and simhash with data set D1
Conclusion • PAS is very effective in detecting file similarity • The time overhead, CPU and memory occupation of PAS are much less than that of simhash We believe that the PAS algorithm is applicable.