Identifying File Similarity in Large Data Sets by Modulo File Length

Identifying File Similarity in Large Data Sets by Modulo File Length Yongtao Zhou, Yuhui Deng, Xiaoguang Cheng, Junjie Xie Department of Computer Science, Jinan University, Guangzhou, 510632, P. R.China

Agenda • Motivation • Challenges • Simhash • Traditional sampling algorithm(TSA) • Position-aware similarity algorithm(PAS) • Evaluation • Conclusion

Motivation • The explosive growth of data • IDC: 4ZBbytes data in 2014, 50% growth in contrast to 2012 • Data more and more complex • 5V: Volume, Variety, Velocity, Value, and Veracity • Structured, semi-structured, and unstructured data • File similarity detection is very important to data management • Cluster similar data in data mining • Find similar file in backup • Employ similarity to enhance the cache hierarchy in clouds • …………

Challenges • Reducing the computing overhead of similarity detection • Similarity detection is I/O bound and CPU bound task • Require lots of memory space and frequently access disk • The compute overhead increase with the growth of data sets • Reducing the time of similarity detection • Some applications requiring real time and high throughput • Achieving both the efficiency and accuracy

Simhash • Fingerprints of similar files differ in a small number of bit positions • Determine the similarity of files by working out their Hamming distance

Traditional sampling algorithm(TSA) • TSA is described in algorithm 1 by using pseudo-code • Transforming similarity detection problem into set intersection problem

Traditional sampling algorithm(TSA) • TSA is simple and fixed overhead, but it is very sensitive to file modifications • We chosen n = 6, Lenc = 1KB.

Position-aware similarity algorithm(PAS)

Position-aware similarity algorithm(PAS) • PAS is described in algorithm 2 by using pseudo-code

Position-aware similarity algorithm(PAS) • We chosen T as 28KB, n = 6, Lenc = 1KB

Evaluation environment • Ubuntu operation system (kernel version is 2.6.32) • VirtualBox virtual machine software(4.3.8.r92456) • 1GB memory • 2.0 GHz Intel(R) Pentium(R) CPU

Data set • Data set D1 • 11.5GB, 2756 files • Table 1 summarizes the profile of D1 • Figure 4 show the distribution of file size • Data set D2 • 14 txt files • 128MB Table 1. The profile of data set D1 Figure 4. The file size distribution of data set D1

The principle of parameters selection • Comparing the detection probability of PAS against the actual of matching chunks. • Splitting file into chunks, then maps these chunks into fingerprint by using MD5 hash function and get fingerprint set Finger(A). The actual portion of matching chunk fingerprint of file A and file B is described as follows:

Sampling position impact factor T • The impact of T on the detection probability (Lenc = 32byte, N = 8, T = 2KB, 8KB, 32KB, 128KB, 512KB) • We take T = 512K.

PAS and Simhash parameters configuration

PAS compare to Simhash: Time overhead The size of file are 2MB, 5MB and 10MB, respectively Data set D1

PAS compare to Simhash: CPU and memory usage CPU and memory utilization of PAS and simhash with data set D1

PAS compare to Simhash: Precision and Recall

Conclusion • PAS is very effective in detecting file similarity • The time overhead, CPU and memory occupation of PAS are much less than that of simhash We believe that the PAS algorithm is applicable.

Thanks!

Identifying File Similarity in Large Data Sets by Modulo File Length

Identifying File Similarity in Large Data Sets by Modulo File Length

Presentation Transcript

Data File Structures

Data File Handling in C++

Example Data File

Data File Corruption

Data File Handling

Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets

Data sets SAS, flat file, etc.

Large Scale File Systems

using large data sets

DATA FILE HANDLING

DATA FILE HANDLING IN C++

Very large data sets

HTS data file

Send A Large File - Antzip.com

using large data sets

using large data sets

Manipulating Large Data Sets

Large File Transfer Free Sendbig.com