200 likes | 368 Views
Value-Based Program Characterization and Its Application to Software Plagiarism Detection. ICSE 2011 Yoon-Chan Jhi, Xinran Wang, Sencun Zhu, Peng Liu, Dinghao Wu Penn State University Xiaoqi Jia State Key Laboratory of Information Security, Institute of Software,
E N D
Value-Based Program Characterization and Its Application to Software Plagiarism Detection ICSE2011 Yoon-Chan Jhi, Xinran Wang, Sencun Zhu, Peng Liu, Dinghao Wu Penn State University Xiaoqi Jia State Key Laboratory of Information Security, Institute of Software, Chinese Academy of Sciences Embedded Lab. Park Yeongseong
Contents • Introduction • State of the art • Core values • Design • Experiment • Discussion • Conclusion • Q&A
Introduction • Identifying same or similar code is very important • Previous works • Static source code comparison – C1 • Static excutable code comparison – C2 • Dynamic control flow based methods – C3 • Dynamic API based methods– C4
Introduction • Three highly desired requirements • R1 – Resiliency • R2 - Ability to directly work on binary executables • R3 – Platform independence • BUT!!!! Not satisfy requirement • Static source code comparison – C1 R1 R2 • Static excutable code comparison – C2 R1 • Dynamic control flow based methods – C3 R1 R3 • Dynamic API based methods – C4 R3
Introduction • Introduce new approach • Core-values • 5 optimization options (-O0 ~ -O3, -Os) • 3 Compilers ( GCC, TCC, WCC ) • KlassMaster, Thicket, Loco/Diablo Obfuscators
State of the arts • Code Obfuscation Techniques • data obfuscation, control obfuscation, layout obfuscation and preventive transformations • indirect branches, control-flow flattening, function-pointer aliasing • Static Analysis Based Plagiarism Detection • String-based • AST-based • Token-based • PDG-based • Birthmark-based
State of the arts • Dynamic Analysis Based Plagiarism Detection • Whole program path based (WPP) • Sequence of API function calls birthmark(EXESEQ) • Frequency of API function calls birthmark(EXEFREQ) • System call based birthmark
Core values • Runtime values • The output operands of the machine instructions executed • Core values • Constructed from runtime values • Eliminate non-core values • If is not derived form , is not a core-value of • If is not in the set of runtime values of is not a core-value of
Design-Value Sequence Extraction • Not all values associated with the execution of a program arecore-values • Value-updating instruction • Related to the program’s semantics
Design-Value Sequence Refinementand Similarity Metric • To refine value sequences • Sequential refinement – reduction rate 16%~34% • Optimization-based refinement – 5 optimization • Address removal – exclude pointer values
Experiment • Intel Quad-Core 2.00 GHz CPU • 4GB RAM • Linux machin • QEMU 0.9.1 • Questions • resilient • false accusation • credible
Experiment-Obfuscation tool(resiliency) • Obfuscation techniques • SandMark, KlassMaster : Java bytecode obfuscators • Test application : Jlex • Lexical analyzer
Experiment-Similar Programs(false accusation) • Test Application • 5 individual XML pasers:expat, libxml2, Parsifal, rxp,xercesc
Experiment-Different Programs(credible) • Test application • Bzip2, gzip, oggenc, 9 of 11 programs • Result • Similarity scores between 0 and 0.27 • zip and gzip similarity scores are 1.0 • Same compression algorithm : deflate • zip and bzip2 similarity scores are 0.01 to 0.03 • Different compression algorithm : block sorting
Conclusion • introduce a novel approach to dynamic characterization of executable programs. • The value-based method successfully discriminates 34 plagiarisms by SandMark, KlassMaster, Thicket.