220 likes | 388 Views
Phoenix-Based Clone Detection using Suffix Trees. Robert Tairas. http://www.cis.uab.edu/tairasr/clones. Advisor: Dr. Jeff Gray. ACM Southeast Conference Melbourne, FL March 11, 2006. Code Clones. A sequence of statements that are duplicated in multiple locations in a program. Source Code.
E N D
Phoenix-Based Clone Detection using Suffix Trees Robert Tairas http://www.cis.uab.edu/tairasr/clones Advisor: Dr. Jeff Gray ACM Southeast Conference Melbourne, FL March 11, 2006
Code Clones • A sequence of statements that are duplicated in multiple locations in a program Source Code ClonedCode _____ _______ _________ ___ ______ ___ ______ ______ _____ ________ ___ _______________ ____ _______ ______ _________ _____ _______ ________ ___ _______________ _______ _____ _________ _______ ___ ______ ________ __________ _______ ____ _______ ________ ___ _______________ ____ ______ _________ _____ ____ ________ ____ __________ _______ ___ ________ ___ _______________ _____ ________ ___ _______________ ______ _______ ______ _______ _________ _______ ___ ________ ___ _____________ ________ ___ _______________
Clones in Source Code • Copy-and-paste parts of code from one location to another • The copied code alreadyworks correctly • No time to be efficient • Research shows that5-10% of large scalecomputer programsare clones (Baxter, 98) Source Code _____ _______ _________ ___ ______ ___ ______ ______ _____ ________ ___ _______________ ____ _______ ______ _________ _____ _______ ________ ___ _______________ _______ _____ _________ _______ ___ ______ ________ __________ _______ ____ _______ ________ ___ _______________ ____ ______ _________ _____ ____ ________ ____ __________ _______ ___ ________ ___ _______________ _____ ________ ___ _______________ ______ _______ ______ _______ _________ _______ ___ ________ ___ _____________
Clones in Source Code • Dominant decomposition: A block of statements that performs a function/concern dominates another block • The two concerns crosscut each other • One concern will have to yield to the other • Related to Aspect Oriented Programming (AOP)
Clones in Source Code • logging in org.apache.tomcat • red shows lines of code that handle logging • not in just one place • not even in a small number of places
But first we need to find the clones Clone Dilemma • Maintenance • To update code that is cloned will require all clones to be updated • Restructure/refactor • Separate into aspects
Contribution: Automated Clone Detection • Searches for exact matching function level clones utilizing suffix tree structures in the Microsoft Phoenix framework Microsoft Phoenix Clone Detector Source Code Report of Clones Suffix Trees
Types of Clones Original code int main() { int x = 1; int y = x + 5; return y; } int func1() { int x = 1; int y = x + 5; return y; } int func2() { int p = 1; int q = p + 5; return q; } int func3() { int s = 1; int t = s + 5; s++; return t; } Exact match Exact match, with only the variable names differing Near exact match As defined in an experiment comparing existing clone detection techniques at the 1st International Workshop on Detection of Software Clones (02)
What is Phoenix? • Next-Generation Framework for • building Compilers • building Software Analysis Tools • Basis for Microsoft compilers for 10+ years More information: http://research.microsoft.com/phoenix Note: Contents of this slide courtesy of John Lefor at Microsoft Research
Compilers Tools Browser Visualizer Lint HL Opts HL Opts HL Opts LL Opts LL Opts LL Opts Code Gen Code Gen Formatter Obfuscator Refactor Xlator Profiler SecurityChecker Phx APIs Phoenix Core AST IR Syms Types CFG SSA assembly Native Image C++ IR C++AST Phx AST Profile C++ PREfast Lex/Yacc C# VB C++ Delphi Cobol Eiffel Tiger Note: This slide courtesy of John Lefor at Microsoft Research
abgf$ $ f$ gf$ bgf$ Suffix Trees • A suffix tree of a string is a tree where each suffix of the stringis represented by a path from the root to a leaf • In bioinformatics it is used to search for patterns in DNA or protein sequences Example: suffix tree for abgf$
Another Suffix Tree Example Suffix tree for abcebcf$ 12345678 $ abcebcf$ 8 f$ bc c 1 ebcf$ 7 ebcf$ ebcf$ 4 f$ f$ 2 6 3 5 Leaf numbers: The number indicates the starting position of the suffix from the left of the string.
bc ebcf$ f$ 5 Another Suffix Tree Example Suffix tree for abcebcf$ 12345678 $ abcebcf$ 8 f$ c 1 ebcf$ bcebcf$ 7 ebcf$ 4 f$ 2 6 3 Leaf numbers: The number indicates the starting position of the suffix from the left of the string.
# abgf 2,5 $abgf# bgf f gf # 1,5 $abgf# 2,1 # # # $abgf# $abgf# $abgf# 2,2 1,1 2,4 2,3 1,4 1,3 1,2 Another Suffix Tree Example Two identical strings (abgf) separated by unique terminating characters Suffix tree for abgf$abgf# Leaf numbers: The first number indicates the string. The second number indicates the starting position of the suffix in that string.
Abstract Syntax Tree Nodes int func2() { return y; } int func1() { return x; } FUNCDEFN FUNCDEFN COMPOUND COMPOUND RETURN RETURN SYMBOL SYMBOL Note: Node names are Phoenix-defined.
FUNCDEFN COMPOUND RETURN SYMBOL FUNCDEFN COMPOUND RETURN SYMBOL # abgf 2,5 $abgf# bgf f gf # 1,5 $abgf# 2,1 # # # $abgf# $abgf# $abgf# 2,2 1,1 2,4 2,3 1,4 1,3 1,2 Remember This? Suffix tree for abgf$abgf# a b g f $ a b g f # For exact function matching, we’re looking for suffix tree nodes of edges, where the edges include all the AST nodes of a function. Leaf numbers: The first number indicates the function. The second number indicates the starting position of the suffix in that function.
int func1() { int x = 3; int y = x + 5; return y; } int func2() { int x = 1; int y = y + 5; return y; } i32, 1 x, i32 i32 x, i32 i32, 5 y, i32 y, i32 False Positives Original code int main() { int x = 1; int y = x + 5; return y; } FUNCDEFN COMPOUND DECLARATION CONSTANT DECLARATION PLUS SYMBOL CONSTANT RETURN SYMBOL
Phoenix Phases • Processes are divided into “phases” • Custom phases can be inserted to perform tasks such as software analysis • Phases are inserted through “plug-ins” in the form of a library (DLL) module MicrosoftPhoenix Plug-in Custom Phase Clone Detection Phase
example.c C/C++ Front-end csclones.cs C# example.ast csclones.dll Phoenix Back-end Report Clone Detector in Phoenix
Limitations and Future Work • Looks only for exact matches • Currently working on a process called hybrid dynamic programming, which includes the use of suffix trees (k-difference inexact matching) • Looks only at the function level • Enable multiple levels clone detection • Higher: statement level; Lower: program level • Recognizes only C nodes • Coverage for other languages, such as C++ and C# • Another approach: language independent
Thank you…Questions? Phoenix-Based Clone Detection using Suffix Trees http://www.cis.uab.edu/tairasr/clones