350 likes | 370 Views
Component Rank : Relative Significance Rank for Software Component Search. Katsuro Inoue, Reishi Yokomori, Hikaru Fujiwara, Tetsuo Yamamoto, Makoto Matsushita, and Shinji Kusumoto Osaka University. SourceForge. Large open source software development web site
E N D
Component Rank: Relative Significance Rank for Software Component Search Katsuro Inoue, Reishi Yokomori, Hikaru Fujiwara, Tetsuo Yamamoto, Makoto Matsushita, and Shinji Kusumoto Osaka University
SourceForge • Large open source software development web site • Version control, communication support, ... Hosted Projects: 60,888 Registered Users: 613,792
Motivation • Numerous software systems are being developed day by day • Similar components (libraries, portions of codes, or abstracted algorithms, ...) might be independently developed in different projects • Key factor for high productivity and reliability in today’s software development • Reuse • Exploring large software libraries is not easy • Little support to search components • Consistent management by human hand is difficult
Automated Component Library • Collect software components eagerly without preserving their inherent structures • Analyze relations among components by using various analysis techniques • Rank the components based on their significance • Answer user’s queries according to the rank Component Rank Model
Component Graph System Y System X A B F C G D E H I component use relation
0.1 0.1 0.1 0.2 0.2 0.1 0.1 0.05 0.05 Weight of Nodes System Y System X A B F C G D E H I sum of all node weights = 1 ... (1) weight of node represents significance of node
0.05 0.2 d=1/4 0.05 d=1/4 B 0.05 d=1/4 0.05 d=1/4 0.15 0.05 d: distribution ratio Weights of Edges A 0.4 0.2 • Node weight is distributed to each outgoing edge • Edge weights are collected at the destination node sum of all outgoing edge weights = origin node weight ... (2) sum of all incoming edge weights = destination node weight ... (3)
Definition of Weights • Under constraints (1)~(3), we have a simultaneous equation . = W: node weight vector Dt: transposed matrix of distribution ratios
0.34 0.33 0.17 0.17 0.33 0.33 0.33 Propagating Weights A B C
0.33 0.17 0.175 0.175 0.5 0.17 0.5 Propagating Weights A B C
0.25 0.25 0.345 0.175 Propagating Weights 0.5 0.175 A B 0.345 C
Propagating Weights 0.4 0.2 0.2 A B 0.2 0.4 0.2 0.4 C • Stable weight assignment • next-step weights are the same as previous ones • Component Rank : order of nodes sorted by the weight
0.02 0.01 0.01 0.05 0.03 0.001 0.1 Markov Model • Component rank model can be considered as a Markov Chain of user's focus • User's focus moves from one component to another along a use relation at a fixed time duration • Node weight represents the existence probability of the user's focus at infinite future
Adjustment to Software Products(1)Pseudo Use Relation A B C • Weight computation does not always converge • Add a pseudo edge from a node to another, if there is no 'real' edge • Distribution ratios: pseudo edges << real edges
C G BF AD E clustered component graph Adjustment to Software Products(2)Clustering Components C G B F A D E component graph
Prototype System SMMT measures similarity by clone detection technique • inheritance • method call • attribute access • abstract class impl. input measure similarity by SMMT extract use relation .java file = component similarity criterion t=0.8 (80% statements are the same) construct clustered component graph cluster similar components weight ratio p between real and pseudo edges : 0.85 output de-cluster to original components compute node weights component ranks equal distribution ratios d to outgoing edges
rank class name weight 1 java.lang.Object 0.161262 java.lang.Class 0.087123 java.lang.Throwable 0.055104 java.lang.Exception 0.031035 java.io.IOException 0.013436 java.lang.StringBuffer 0.012147 java.lang.SecurityManager 0.011698 java.io.InputStream 0.010279 java.lang.reflect.Field 0.0094810 java.lang.reflect.Constructor 0.00936 ... ...1256 sunw.util.EventListener 0.00011 ... ...1256 these 622 classes are not used by any other classes Experiment 1JDK1.3.0 575,000 lines, 1877 components 7 minutes on PC (Pentium IV, 2GHz, 2GB) superclass of all classes superclass of any error or exception handler • Very general and core classes : • ranked high • Specific and independent classes: • ranked low
rank class name weight 1 antlr.Token 0.10727 2 antlr.debug.Event 0.06189 2 antlr.debug.NewLineEvent 0.06189 4 antlr.collections.impl.Vector 0.05434 5 jp.gr.java_conf.keisuken.text.html.HtmlParameter 0.05246 6 jp.gr.java_conf.keisuken.net.server.ServerProperties 0.03699 7 Jama.Matrix 0.01564 8 jp.gr.java_conf.keisuken.util.IntegerArray 0.01390 8 jp.gr.java_conf.keisuken.util.LongArray 0.01390 10 jp.ac.osaka_u.es.ics.iip_lab.metrics.parser.IdentifierInfo 0.01365 ... ... 418 cktool_new.examples.Main 0.00050 Experiment 2:Collection of SE Tools and Libraries • CK metrics measurement tools, component rank system • ANTLR, JAMA, Caffe Cappuccino • 582 components Indicator of generality and specialty w.r.t. usage from other classes
Experiment 3:Application to Industry • Daiwa computer: a middle size software company in Osaka • Shared Java application framework for web-based data management • Framework+ 5 applications on framework • 1538 components, 339 clustered nodes • Classes in the framework and definitions of data structure are ranked high
class name weight order sorted by rank method definitions of obtaining node kinds in DOM tree 1(67) enhydra3.1 ... dom.Node 0.029110 2(169) saxon7_0 ... saxon.om.NodeInfo 0.000969 3(275) saxon7_0 ... saxon.pattern.NodeTest 0.000437 4(316) enhydra3.1 ... dom.DocumentImpl 0.000368 5(355) saxon7_0 ... saxon.pattern.Pattern 0.000324 6(382) saxon7_0 ... saxon.Controller 0.000296 7(437) enhydra3.1 ... xslt.XSLTEngineImpl 0.000241 8(446) enhydra3.1 ... dom.ElementImpl 0.000235 9(500) saxon7_0 ... saxon.style.StyleElement 0.000202 10(506) saxon7_0 ... saxon.tree.NodeImpl 0.000198 ... ... 125(4441) enhydra3.1 ... FuncID 0.000029 ... ... 125(4441) Experiment 4:Document Processing Tools and Libraries • JEDIT, jext, Enhydra, saxon, phex, JDK, etc. (7171 components) • Perform string search by grep command with keyword getNodetype We can easily find the core definitions of classes
Discussion 1: Weight Computation Reference Count Model Component Rank Model B B 0.31 0.2 A A 0.6 0.33 E D C E D C 0 0 0.2 0.03 0.03 0.30 Fragile to locally-made references, which may not be important globally More stable to local references
0.25 0.25 A X Clustering B Y 0.25 0.25 same weight arrangement as the case with no duplicated components Discussion 2: Clustering Policy (1) • Eliminate effect of simply duplicated components A A X B B Y original copy others
0.3 0.2 A X Clustering B C Y 0.15 0.15 0.2 A's weight is higher than others Discussion 2: Clustering Policy (2) • Count only reused components which are not simple duplicated A A X B C Y original modified others
Discussion 3: Similarity Criterion and Pseudo Use Relation • Similarity criterion t: 0.8 • Resulting ranks are fairly insensitive to t • Some inherently-different components are in the same cluster if t is less than 0.8 • Pseudo use relation ratios p: 0.85 • Resulting ranks are stable between 0.75 - 0.95
Related Works • Markov models of documentation traversal • Influence Weight: impact factor of journal publication thought incoming references • Page Rank: weight of HTML in the Internet through incoming web links Explicit use relations No clustering (important for software products) • Measurement reusability of components or interfaces • Use various characteristic metrics • Indirect indicator of reusability • Our approach directly reflects usage of components
S P A R S-J Software Product Archiving, Analyzing and Retrieving System for Java Analyzer and Evaluator Component Collector Internet / Corporate Repositories Query Handler Component Archive SPARS-J Software Component Searcher
Conclusion & Future Work • Component Rank: a novel model for software component • Prototype system for Java • Application to various collections of Java programs : promising results • Developing SPARS-J • Statistical evaluation (recall & precision) • Practical evaluation using SPARS-J • Other models (weight distribution, similarity, ...)
Global Analysis of Software Data Data Analysis Data on the Internet Collection Feedback Subsidiary Company Data Company-Wide Project Data
Weight Computation by Eigenvector • W is the eigenvector of eigenvalue 1 • math package for the eigenvector computation can be used, but generally slower then the propagation computation . = W: node weight vector Dt: transposed matrix of distribution ratios