360 likes | 380 Views
Detect software plagiarism in multithreaded programs using dynamic birthmarks optimized for thread scheduling nuances. Explore a novel approach for effective plagiarism detection.
E N D
Plagiarism Detection for Multithreaded Software Based on Thread-Aware Software Birthmarks Zhenzhou Tian zztian@stu.xjtu.edu.cn MOE Key Lab for Intelligent Networks and Network Security Xi’an Jiaotong University, China 2020/1/4
Outline • Introduction • Thread-Aware Birthmark Methods • Evaluation • Unsolved Problems & Future Work
Introduction • Software plagiarism has been a serious threat to the healthy development of software industry • Violate licenses for commercial interests or unwittingly • Weak code protection awareness • Powerful automated code obfuscation tools • Distributed in binary form
Introduction • A series of methods are proposed for plagiarism detection • Software Watermarking • Insert extra data • “a sufficiently determined attacker will eventually be able to defeat any watermark” • Static and Dynamic Software Birthmarks • Dynamic birthmarks are more resilient to semantic-preserving code obfusctions
Introduction • A series of methods are proposed for plagiarism detection • Software Watermarking • Static and Dynamic Software Birthmarks • Increasingly popular trend towards multithreaded programming brings new challenge to existing dynamic birthmark methods • Existing dynamic birthmark remain optimized for sequential programs • Neglect the effect of thread scheduling • Two executions of a single program under same input can be very different, rendering the existing methods ineffective
Introduction • DKISB: dynamic key instruction sequence birthmark • SCSSB: system call short sequence birthmark
Introduction • Contributions: • Two thread-aware dynamic birthmarks TW-DKISB and TW-SCSSB are proposed to detect software plagiarism • Operates directly on binary executables • Not limited to specific operating systems and languages • Resilient to various automated obfuscation techniques 29 different obfuscation techniques in SandMark
Introduction • Contributions: • A prototype is implemented using the Pin instrumentation framework, and extensive experiments are conducted. • A suite of benchmarks is compiled for researchers to conduct experiments and present their findings http://labs.xjtudlc.com/labs/benchmark.html
Outline • Introduction • Thread-Aware Birthmark Methods • Evaluation • Unsolved Problems & Future Work
Software Birthmark • A set of characteristics extracted from a program that reflects intrinsicpropertiesof the program, and which can be used to identify the program uniquely. • Two types: Static and Dynamic software birthmarks • Dynamic birthmark definedby Myles
Thread-Aware Dynamic Software Birthmark • Predetermining a thread schedule is very difficult • Try to shield their influence on executions instead of enforcing thread schedule
Thread-Aware Dynamic Software Birthmarks • Main Idea: Split then Aggregate • Execution order in each thread is relatively stable. • Projecting the trace on thread-ids to obtain sub-traces to extract Slice birthmarks • Aggregatingall slice birthmarks. Different traces of a program under the same input Same slices
Slice Birthmark & Program Birthmark K-Gram SAM SSM Slice Birthmarks
Thread-Aware Birthmark based Plagiarism Detection 5 main modules: • DAM: monitoring and recording • PP: constitute valid traces • BG: extract thread-aware birthmarks • BSC: calculate similarity scores • PD: determine detection result
Thread-Aware Birthmark based Plagiarism Detection 5 main modules: • DAM: monitoring and recording • PP: constitute valid traces • BG: extract thread-aware birthmarks • BSC: calculate similarity scores • PD: determine detection result
Dynamic Analysis Module • Monitoring the execution of a program using Pin • DKISExtractor: performs dynamic taint analysis to identify and record key instructions • SysTracer: record each execution of system calls
Thread-Aware Birthmark based Plagiarism Detection 5 main modules: • DAM: monitoring and recording • PP: constitute valid traces • BG: extract thread-aware birthmarks • BSC: calculate similarity scores • PD: determine detection result
Thread-Aware Birthmark based Plagiarism Detection 5 main modules: • DAM: monitoring and recording • PP: constitute valid traces • BG: extract thread-aware birthmarks • BSC: calculate similarity scores • PD: determine detection result
Pre-Processor & Birthmark Generator • Pre-Processor: filter out noises and extract valid traces • Birthmark Generator: generate TW-DKISBs and TW-SCSSBs utilizing SA model and SS model implemented
Thread-Aware Birthmark based Plagiarism Detection 5 main modules: • DAM: monitoring and recording • PP: constitute valid traces • BG: extract thread-aware birthmarks • BSC: calculate similarity scores • PD: determine detection result
Thread-Aware Birthmark based Plagiarism Detection 5 main modules: • DAM: monitoring and recording • PP: constitute valid traces • BG: extract thread-aware birthmarks • BSC: calculate similarity scores • PD: determine detection result
Similarity Calculator & Plagiarism Decider • Similarity Calculator Four Similarity Metrics
Similarity Calculator & Plagiarism Decider • Similarity Calculator Bipartite matching
Similarity Calculator & Plagiarism Decider • Similarity Calculator • Decision Maker
Outline • Introduction • Thread-Aware Birthmark Methods • Evaluation • Unsolved Problems & Future Work
Evaluation • A high quality birthmark manifests in that the ratio of false classifications should be rather low for a given ɛ • Two properties to check
Evaluating Resilience Property • Resilience to different compilers and optimization levels Statistical differences for 20 versions of pigz Similairty scores between binaries of pigz
Evaluating Resilience Property • Resilience to special obfuscation tools Cosine similarity between ConGzip and its 29 Sandmark obfuscated versions
Evaluating Resilience Property • Resilience to special obfuscation tools • Allatori, DashO, Jshrink, ProGuard and RetroGround Resilience to Allatori-Series obfuscation tools
Evaluating Credibility Property • Similarity between independently implemented programs • 6 compression software: Lbzip, lrzip, pbzip2, pigz, plzip and rar • 5 audio players: Cmus, mocp, mp3blaster, mplayer and sox • 10 web browsers: arora, chromium, dillo, dooble, epiphany, firefox, konqueror, luakit, midori and seaMonkey Credibility evaluation of TW-SCSSBs using 10 web browsers
Comparing with Traditional Birthmarks • Performance Evaluation Metric • By varying ɛ from 0-0.5, an F-Measure curve can be drawn • AUC: area under the F-Measure curve Detection Criteria
Comparing with Traditional Birthmarks F-Measure curves for TW-SCSSBSA, TW-SCSSBSS, and SCSSB
Outline • Introduction • Thread-Aware Birthmark Methods • Evaluation • Unsolved Problems & Future Work
Unsolved Problems & Future Work • Problems • Partial and library plagiarism problems • Tool is preliminary • Impact of K is not evaluated • Future Works • Conduct experiments using other kinds tools, such as the shelling tools (Upx, ASProtect etc.); and on real plagiarism cases • Improve our method to support for partial plagiarism detection • Evaluate the effect of K to detection ability • Form a relatively mature tool