500 likes | 621 Views
Improving Programmer Productivity via Mining Program Source Code. Mining SE Data. MAIN GOAL Transform static record-keeping SE data to active data Make SE data actionable by uncovering hidden patterns and trends. Mailings. Bugzilla. Code repository. CVS. Execution traces.
E N D
Improving Programmer Productivity via Mining Program Source Code
Mining SE Data • MAIN GOAL • Transform static record-keeping SE data to active data • Make SE data actionable by uncovering hidden patterns and trends Mailings Bugzilla Code repository CVS Executiontraces T. Xie Mining Program Source Code
Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining classification association/patterns clustering … data mining techniques code bases change history programstates structuralentities bugreports/nl … software engineering data T. Xie Mining Program Source Code
Overview of Mining SE Data 99 ASE 00 ICSE 05 FSE*2 ASE PLDI POPL OSDI 06 PLDI OOPSLA KDD07 ICSE*3 FSE*3 ASE PLDI*2 ISSTA*2 KDD 99 FSE 01 ICSE FSE 02 ISSTA POPL KDD 03 PLDI 04 ASE ISSTA05 ICSE ASE 06 ICSE FSE*2 07 PLDI 99 ICSE02 ICSE 03 PLDI 05 FSE PLDI06 ISSTA07 ISSTA 03 ICSE06 ICSE 06 ASE 07 ICSE SOSP 04 ICSE05 FSE*2 06 ASE 07 ICSE*2 code bases change history programstates structuralentities bugreports/nl … software engineering data T. Xie Mining Program Source Code
Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining classification association/patterns clustering … data mining techniques code bases change history programstates structuralentities bugreports/nl … software engineering data T. Xie Mining Program Source Code
Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining 02 KDD04 ICSE ASE 05 FSE ASE*2 06 KDD 07 ICSE*3 99 ASE 00 ICSE 05 FSE PLDI POPL 06 FSE OOPSLA PLDI 07 FSE ASE ISSTA KDD 01 SOSP 04 OSDI 05 FSE*2 06 ICSE*2 07 ICSE*2 FSE*2 ISSTA PLDI*2 SOSP 03 ICSE PLDI*2 05 ICSE FSE ASE PLDI 06 ICSE FSE 07 ICSE ISSTA PLDI 99 ICSE01 ICSE*2 FSE 02 ICSE ISSTA POPL 04 ISSTA06 ISSTA T. Xie Mining Program Source Code
Overview of Mining SE Data programming defect detection testing debugging maintenance … software engineering tasks helped by data mining classification association/patterns clustering … data mining techniques code bases change history programstates structuralentities bugreports/nl … software engineering data T. Xie Mining Program Source Code
Sample Projects on Mining Program Source Code T. Xie Mining Program Source Code
Some Recent Trends • Data: dynamic execution data +static code bases • Task: productivity (programming) + quality (defect detection, testing, debugging) • Mining algorithm: simple ones (association rule) + frequent itemset/subsequence/ partial order/subgraph • Data scope: local repositories public repositories with code search engines T. Xie Mining Program Source Code
Sample Projects on Mining Program Source Code T. Xie Mining Program Source Code
Mining API Usage Patterns • How should an API be used correctly? • An API may serve multiple functionalities • Different styles of API usage • MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] T. Xie Mining Program Source Code
Example Task -- MAPO • “instrument the bytecode of a Java class by adding an extra method to the class” • org.apache.bcel.generic.ClassGen public void addMethod(Method m) T. Xie Mining Program Source Code
First Try: ClassGen Java API Doc addMethod public void addMethod(Method m) Add a method to this class. Parameters: m - method to add T. Xie Mining Program Source Code
Second Try:Code Search Engine T. Xie Mining Program Source Code
MAPO Approach • Analyze code segments relevant to a given API and disclose the inherent usage patterns • Input: an API characterized by a method, class, or package • Code search engine: used to search relevant source files from open source repositories • Frequent sequence miner: use BIDE [Wang&Han 04] to mine closed sequential patterns from extracted method-call sequences • Output: a short list of frequent API usage patterns related to the API T. Xie Mining Program Source Code
Sequence Extraction • Method sequences: extracted from Java source files returned from code search engines Source code Call sequence public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); … } InstructionList.<init>() genFromISList(InstructionList) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method)PrintStream.println(String) … T. Xie Mining Program Source Code
Sequence Preprocessing • Remove common Java library calls • Inline callees of the same class • Remove sequences that contain no query words: ClassGen and addMethod public void generateStubMethod(ClassGen c) InstructionList il = new InstructionList(); MethodGen m= genFromISList(il); m.setMaxLocals(); m.setMaxStack(); c.addMethod(m.getMethod()); System.out.println(“…”); … } InstructionList.<init>() genFromISList(InstructionList) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method)PrintStream.println(String) … T. Xie Mining Program Source Code
Frequent Seq Postprocessing • Remove sequences that contain no query words: ClassGen and addMethod • Compress consecutive calls of the same method into one, e.g., abbba aba • Remove duplicate frequent sequences after the compression, e.g., aba, aba aba • Reduce a seq if it is a subseq of another, e.g., aba, abab abab T. Xie Mining Program Source Code
Tool Architecture e.g. koders.com T. Xie Mining Program Source Code
Sample Mined API Sequence InstructionList.<init>() InstructionFactory.createLoad(Type, int) InstructionList.append(Instruction) InstructionFactory.createReturn(Type) InstructionList.append(Instruction) MethodGen.setMaxStack() MethodGen.setMaxLocals() MethodGen.getMethod() ClassGen.addMethod(Method) InstructionList.dispose() T. Xie Mining Program Source Code
Sample Projects on Mining Program Source Code T. Xie Mining Program Source Code
Mining API Usage Patterns • MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] • Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07] T. Xie Mining Program Source Code
Usage Patterns as Partial Order a b d e a b d f a c d e a c d f #include <abcdef.h> void p ( ) { b ( ); c ( ); } void q ( ) { c ( ); b ( ); } void r ( ) { e ( ); f ( ); } void s ( ) { f ( ); e ( ); } int main ( ) { int i, j, k; a ( ); if ( i == 1) { f ( ); e ( ); c ( ); exit ( ); } else { if ( j == 1 ) p ( ); else q ( ); d ( ); if ( k == 1 ) r ( ); else s ( ); } } (c) Frequent subseq patterns 1 a f e c 2 a b c d e f 3 a c b d e f 4 a b c d f e 5 a c b d f e a b c (b) Static program traces d e f (d) Frequent partial order R (a) Example code T. Xie Mining Program Source Code
Apiartor Overview Scenario Extractor User-specified APIs Independent Scenarios Trace Generator Miner Trigger Generator Related APIs Triggers Partial Orders Frequent Usage Scenarios Model Checker Specification Extractor Source Code Specifications Traces T. Xie Mining Program Source Code
Example Partial Orders XOpenDisplay XCreateWindow XCreateGC XGetWindowAttributes XSelectInput XMapWindow XSetForeground XGetBackground A usage scenario around XOpenDisplay API as a partial order. Specifications are shown with dotted lines. XChageWindowAttributes XNextEvent XMapWindow XGetAtomName XFreeGC XCloseDisplay T. Xie Mining Program Source Code
Sample Projects on Mining Program Source Code T. Xie Mining Program Source Code
Mining API Usage Patterns • MAPO: “I know what method call I need, but I don’t know how to write code before and after this method call” [Xie&Pei MSR 06] • Apiartor: “I know what possible set of APIs I need, but I don’t know what need to be used and what orders to use” [Acharya et al. FSE 07] • PARSEWeb: “I know what type of object I need, but I don’t know how to write the code to get the object” [Thummalapenta&Xie ASE 07] T. Xie Mining Program Source Code
Example Task - OpenJMS Sun Java Message Services API Spec • Query: “javax.jms.QueueConnectionFactory -> javax.jms.QueueSender” • PARSEWeb Solution: FileName:0_UserBean.java MethodName:ingest Rank:1 NumberOfOccurrences:23 Confidence:True Path: 1 2 3 javax.jms.QueueConnectionFactory,createQueueConnection() ReturnType:javax.jms.QueueConnection javax.jms.QueueConnection,createQueueSession(boolean,javax.jms.Session.AUTO ACKNOWLEDGE) ReturnType:javax.jms.QueueSession javax.jms.QueueSession,createSender(javax.jms.Queue) ReturnType:javax.jms.QueueSender T. Xie Mining Program Source Code
PARSEWeb Overview Code Search Engine Code Downloader Query Open Source Repositories Method Invocation Sequences Local Source Code Repository Code Analyzer Sequence Miner Final Method Invocation Sequences Clustered Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code
PARSEWeb Overview Code Search Engine Code Downloader Query Open Source Repositories Method Invocation Sequences Local Source Code Repository Code Analyzer Sequence Miner Final Method Invocation Sequences Clustered Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code
Code Analyzer • Collect [Source Destination] method sequences invoked by each public method • Deal with local method calls by inlining methods • Deal with conditionals/loops by traversing control flow graphs • Resolve types in sequences • Challenges: downloaded files are partial • Solutions: heuristics are developed T. Xie Mining Program Source Code
Type Heuristics • Heuristic 1: The return type of a method-invocation statement contained in an initialization expression is same as the type of the declared variable. e.g., QueueConnection connect; QueueSession session = connect.createQueueSession(false,int) • Heuristic 2: The return type of an outer most method-invocation contained in a return statement is same as the return type of the enclosing method declaration. e.g., public int test() { ... return connect.createQueueSession(false,int); } T. Xie Mining Program Source Code
PARSEWeb Overview Code Search Engine Code Downloader Query Open Source Repositories Method Invocation Sequences Local Source Code Repository Code Analyzer Sequence Miner Final Method Invocation Sequences Clustered Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code
Sequence Miner • Candidate sequences produced by the code analyzer may be too many Solutions: • Cluster similar sequences • Clustering heuristics are developed • Rank sequences • Ranking heuristics are developed T. Xie Mining Program Source Code
Clustering Heuristics • Heuristic 1: Method-invocation sequences with the same set of statements can be considered similar, although the statements are in different order. e.g., ''2 3 4 5'' and ''2 4 3 5 '' • Heuristic 2: Method-invocation sequences differing by given cluster precision value can be considered similar. e.g., ''8 9 6 7'' and ''8 6 10 7 '' can be considered similar under cluster precision value one. T. Xie Mining Program Source Code
Ranking Heuristics • Heuristic 1: Higher frequency -> Higher rank • Heuristic 2: Shorter length -> Higher rank T. Xie Mining Program Source Code
PARSEWeb Overview Code Search Engine Code Downloader Query Open Source Repositories Method Invocation Sequences Local Source Code Repository Code Analyzer Sequence Miner Final Method Invocation Sequences Clustered Method Invocation Sequences Query Splitter T. Xie Mining Program Source Code
Query Splitter • Lack of code samples in the results of code search engines • Code samples are split among different files Solution: • Split the user query into multiple queries • Compose the results for each split query T. Xie Mining Program Source Code
Query Splitting Example 1. User query: “org.eclipse.jface.viewers.IStructuredSelection->java.io.ObjectInputStream” Results: None 2. Query: “java.io.ObjectInputStream” Results: 3. Most used sources are: java.io.InputStream, java.io.ByteArrayInputStream, java.io.FileInputStream 3. Three Queries to be fired: “org.eclipse.jface.viewers.IStructuredSelection-> java.io.InputStream” Results: 1 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.ByteArrayInputStream” Results: 5 “org.eclipse.jface.viewers.IStructuredSelection-> java.io.FileInputStream” Results: None T. Xie Mining Program Source Code
Eclipse Plugin T. Xie Mining Program Source Code
Evaluations • Real Programming Problems: To address problems posted in developer forums. • Real Projects: To show that solutions recommended by PARSEWeb are • available in real projects • better than solutions recommended by related tools PROSPECTOR, Strathcona, Google Code Search Engine averagely T. Xie Mining Program Source Code
Jakarta BCEL User Forum • Jakarta BCEL user forum, 2001 Problem: “How to disassemble java byte code” Query: “Code Instruction” Solution Sample Code: Code code; InstructionList il = new InstructionList(code.getCode()); Instruction[] ins = il.getInstructions(); T. Xie Mining Program Source Code
Dev2Dev Newsgroups • Dev 2 Dev Newsgroups, 2006 Problem: “how to connect db by sesseionBean” Query: javax.naming.InitialContext java.sql.Connection Solution Sequence: FileName:3 AddressBean.java MethodName:getNextUniqueKey Rank:1 NumberOfOccurrences:34 javax.naming.InitialContext,lookup(java.lang.String) ReturnType:javax.sql.DataSource javax.sql.DataSource,getConnection() ReturnType:java.sql.Connection T. Xie Mining Program Source Code
Challenges in Mining Code • Sometimes too few data samples • Scalability is usually not an issue • Static code bases vs. change histories • Data preparation/preprocessing • Related to traditional program analysis • Pattern postprocessing (filtering and ranking) • Heuristics play important roles • Demand-driven mining vs. any gold mining • Programming vs. bug finding T. Xie Mining Program Source Code
Conclusion • Mining various types of software engineering data to aid software engineering task • Mining program source code to improve programmer productivity • MAPO: mining API usage patterns for a given API • Apiartor: mining API usage patterns for a given set of APIs • PARSEWeb: mining API usage patterns for input-output-type quries T. Xie Mining Program Source Code
Questions? • Mining Software Engineering Data Bibliographyhttp://ase.csc.ncsu.edu/dmse/ • What software engineering tasks can be helped by data mining? • What kinds of software engineering data can be mined? • How are data mining techniques used in software engineering? • Resources
Demand-Driven Or Not T. Xie Mining Program Source Code
Code vs. Non-Code T. Xie Mining Program Source Code
Static vs. Dynamic T. Xie Mining Program Source Code
Snapshot vs. Changes T. Xie Mining Program Source Code