MCS680: Foundations Of Computer Science

int MSTWeight(int graph[][], int size) { int i,j; int weight = 0; for(i=0; i<size; i++) for(j=0; j<size; j++) weight+= graph[i][j]; return weight; } 1 1 O(1) O(1) O(n) O(n) n n Running Time = 2O(1) + O(n2) = O(n2) MCS680:Foundations Of Computer Science Case Study: Automatic Techniques For Software Modularization Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Introduction • This topic reinforces the concepts of set and graph theory by demonstrating a current research area • Algorithms for Automatic Software Modularization • This research was conducted by Drexel faculity: • Brian Mitchell • Spiros Mancoridis • Chris Rorres Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Software Engineering Problem • Software maintenance is an arduous task because of the difficulties associated with understanding the intricate relationships that exist between the source code components • Design document is inaccurate • Original system architect/designer is no longer available for consultation • With no mechanism for gaining insight into the system design and structure, the software maintenance practitioner is often forced to make modifications to the source code without a through understanding of the systems organization • Also, heavily used software systems change rapidly • Use of an “ad-hoc” maintenance approach will negatively affect the system design Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Software Engineering Problem • Software engineers have long known of the difficulties associated with maintaining software systems whose only current documentation is limited to the source code • Leads to decay in the design due to source code changes that are made without an understanding of the system structure • Size of modern day software systems is beyond a programmers cognitive ability to determine the affect of a local change on the entire system • Changes made to the source code without an understanding of it’s organization usually contradict one or more aspects of the original design • Goal is to give the programmer a tool that visualizes the modularization of the system Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Other Work In Field • Top-Down Approaches • Tools such as “Rigi” and “Arch” have been developed to perform a modularization of a software system • Still requires somebody familiar with the system to provide feedback and/or set system-specific parameters • Bottom-Up Approaches • Software Reflection Model • Used to capture and exploit the differences that exist between the actual source code organization and the designers high-level model of the systems modularization • Streamline learning process • The Orphan Adoption Problem • Given the name of a new software resource (an orphan), this tool emits as output the name of the subsystem that has been chosen as the parent for the orphan Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Our Automatic Modularization Tool • Implements algorithms that we developed that • Are fully automatic • Recursively generates a hierarchical view of of the system organization based solely on information extracted from the source code • Fully automatic techniques are not only useful to programmers that lack familiarity with the system, but can also be used by the system architect to compare the documented modularization, with the one created by our tool and learn from the differences Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Software System Organization • Software systems contain a finite set of software components and a collection of relationships that govern how the software components interact with each other • Typical software components • Classes, Modules • Variables, Macros • Structures • Typical software relationships • Import • Export • Inherit • Can represent the system structure as a resource dependency graph • The information required to build this graph can be obtained by parsing the source code Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Example Resource Dependency Graph: Plan9 • The following resource dependency graph was automatically generated by scanning the source code from the file system of the Plan9 operating system • Access to source code provided by AT&T Labs Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Goals of Research • Goal of our research is to automatically partition the components of a system into clusters that maximize cohesion and minimize coupling • The clusters once discovered represent a higher level abstraction of the systems organization by grouping related software components into subsystems • Each subsystem contains a collection of modules that either • Cooperate to perform some high-level function in the overall system • Scanner, parser, code generator • Provide a set of related services that are used throughout the system • Import Library • File manager, memory manger Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Automatically Modularized Visualization of Plan9 OS • The following graph was derived by our clustering utility • Formal definitions for cohesion, coupling and modularization quality must now be developed in order to illustrate our process Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Architecture of our Clustering Environment Parse Source Code Source Code Modules { cout ... } CIA Utility scan generate XREF Database Awk Script - Query - Format Clustering Engine scan read generate DOT File DOTTY Utility Clustered Graph read display Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Quantifying Cohesion • Cohesion is an indication of the strength of the relationships that exist between modules that are grouped into a cluster. • High cohesion = Strong Encapsulation. • We define cohesion (H) as a measurement of intra-edge dependencies between the components in a particular cluster. • Formally, the cohesion Hiof cluster i consisting of Nicomponents and i intra-edge dependencies is: • This measurement is a percentage of intra-edge dependencies, which is Ni2. Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Qualifying Coupling • Coupling (C) is a measurement of inter-edge dependencies between the components of two distinct clusters • The coupling Ci,j between clusters i and j each consisting of Ni and Njcomponents respectively, and i,j inter-edge dependencies is:This measurement is a percentage of the maximum number of inter-edge dependencies between clusters i and j Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Modularization Quality • Modularization Quality (MQ) is defined as the measurement of the “goodness” of a particular system modularization. • Specifically, the MQ of a modularization of k clusters, where Hiis the cohesion of the ithcluster and Ci,j is the coupling between the ithand jth clusters is: • This measurement shows the trade-off between cohesion and coupling by • Rewarding many small highly-cohesive clusters • Penalizing too many inter-edges Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Subsystem 1 Subsystem 2 M M 1 4 M M 3 5 M 2 Subsystem 3 M 6 M 7 M 8 Modularization Quality Example Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Partitions of a Set • Must construct a data model to represent a partition (a clustering) of a software system • Consider the source code organization for system S. • S = {M1, M2, …, Mn} • Let a collection  = {A1, A2, …, An} be a set of non-empty subsets such that each AiS.  is a partition of S if: • The subsets are a covering of S • The subsets are mutually exclusive • Each subset Ai is called a cluster of the partition • A partition of S onto k non-empty clusters is called a k-partition of S Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Number of k-Partititions of a Set • Let S be a set of n elements. The number of k-partitions of an n-set satisifies the recurrence equation: • The entries Sn,k are called Stirling numbers • Striling numbers govern the number of k-partitions of a set. • Stirling numbers grow exponentially with respect to the size of S. Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Clustering: Optimal Solution • Algorithm • Let S = {M1, M2, …, Mn}, where each Mi is a module in the software system • Let G be the graph representing the relationships between the modules in S • Generate every partition of set S • Evaluate MQ for each partition • The partition with the largest MQ is the optimal solution • The algorithm works well for sets of up to 15 elements, beyond that the number of k-partitions becomes too large to enumerate in a reasonable timeframe • Clearly, sub-optimal techniques must be employed for large sets Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

How many k-partitions arethere? • The following table illustrates the number of k-partitions of a system given that the system has N modules. 1 = 1 2 = 2 3 = 5 4 = 15 5 = 52 6 = 203 7 = 877 8 = 4140 9 = 21147 10 = 115975 11 = 678570 12 = 4213597 13 = 27644437 14 = 190899322 15 = 1382958545 16 = 10480142147 17 = 82864869804 18 = 682076806159 19 = 5832742205057 20 = 51724158235372 Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Sub-Optimal Modularization Strategy • The search space required for enumerating all possible partitions is too large in most software systems • We need to develop a search strategy that quickly discovers an acceptable sub-optimal clustering • Generic Sub-Optimal Algorithm • Construct a resource dependency graph G that represents the relationships between the modules in S. • Generate a uniformly distributed random clusterings of S. We use a combinatorial algorithm to accomplish this task because our sub-optimal techniques require the generation of many random clusterings. • Iteratively improve a randomly generated clustering, by measuring its MQ, until no further improvement is possible. This task is accomplished by heuristically moving modules in S between the generated clusters. • Repeat this process until an acceptable sub-optimal result it determined. Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Neighboring Partition • We need a way to improve a partitions MQ • We define a partition NP to be a neighbor of a partition P if and only if: • NP is exactly the same as P except that a single element of P is in a different cluster in partition NP Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Generic Sub-Optimal Algorithm • Algorithm • Let S = {M1, M2, …, Mn}, where each Mi is a module in the software system • Let G be the graph representing the relationships between the modules in S • Generate a random partition P of set S • If possible, find a neighboring partition NP that has an improved MQ over P • If an improved neighboring partition is found • Let P = NP • P is the sub-optimal solution • A variety of algorithms for finding sub-optimal solutions are possible, depending on how “improved” is defined Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Steepest-Ascent Hill Climbing (SAHC Algorithm) • Algorithm • Let S = {M1, M2, …, Mn}, where each Mi is a module in the software system • Let G be the graph representing the relationships between the modules in S • Generate a random partition P of set S • Repeat • Find the best neighboring partition BNP that has MQ(BNP) > MQ(P) • If an improved BNP is found such that MQ(BNP) > MQ(P) • Let P = BNP • Until no further “improved” BNP’s can be found • P is the sub-optimal solution • BNP may be expensive to calculate • All neighboring partitions of P must be examined Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Next-Ascent Hill Climbing (NAHC) Algorithm • Algorithm • Let S = {M1, M2, …, Mn}, where each Mi is a module in the software system • Let G be the graph representing the relationships between the modules in S • Generate a random partition P of set S • Repeat • Find a better neighboring partition bNP that has MQ(bNP) > MQ(P) • If an improved bNP is found such that MQ(bNP) > MQ(P) • Let P = bNP • Until no further “improved” BNP’s can be found • P is the sub-optimal solution • A bNP is discovered by randomly searching the set of neighboring partitions until a partition with a higher MQ is found • Usually, not all NP’s will have to be examined Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

A Genetic Algorithm Framework • Our experimentation with the SAHC and NAHC algorithms have shown that given an initial random starting partition that • The algorithms will converge to a local maximum • However, not all initial partitions converge to an acceptable result • Therefore we must either: • Run the experiment many times using different initial partitions and pick the experiment that results in the largest MQ • Or, Devise an approach that works with a population of randomly generated initial partitions and concurrently improves them until all of the initial samples converge • The partition in the final population with the largest MQ is the sub-optimal solution • This approach lends itself to being implemented with a Genetic Algorithm Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Genetic Algorithms • Genetic algorithms were first developed by John Holland et. al. at the University of Michigan • Genetic algorithms have been applied to many problems that involve exploring large search spaces • Characteristics of GA’s • Combine survival-of-the-fittest techniques with a structured and randomized information exchange • Facilitates innovative algorithms that parallel the natural human selection process • GA are more than a randomized search, instead, they exploit historical data to speculate new information that is expected to yield improved results Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Genetic Search Sub-Optimal Clustering Algorithm • Algorithm • Let S = {M1, M2, …, Mn}, where each Mi is a module in the software system • Let G be the graph representing the relationships between the modules in S • Generate a random partition P of set S • Repeat • Randomly select a percentage of partitions from the population and improve them using the SAHC or NAHC technique • Generate a new population (from the current one) by using a biased wheel that favors partitions with larger MQ • Let P = bNP • Until no improvement is seen for t generations, until the population has converged, or until the max. number of generations has been executed • P in the final generation with the largest MQ is the sub-optimal solution Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Agglomerative Clustering • The prevous algorithms discovered subsystems based on the graph that was formed by recovering the relationships that existed in the source code components • In most systems, however, we are interested in finding a hierarchy of subsystems that capture the higher-order relationships that exist in the software • Wrapping our algorithms with an agglomerative clustering engine solves this problem Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Agglomerative Clustering Algorithm • Algorithm • Let S = {M1, M2, …, Mn} • Let G be the resource dependency graph • Let Q be a queue • Repeat • Find a maximal partition (Pmax) of S using the Optimal, SAHC or NAHC algorithm • Save partition Pmax on Q • Now let S = {C1, C2, …, Cn} where each Ci is a cluster in Pmax • Build a new graph G by treating each cluster in Pmax as a single element. Furthermore if there is at least one edge between any two clusters in Pmax then there is an edge between their representative nodes in G • Until Pmax has coalesced into a single cluster • Q contains a hierarchy of partitions Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Where to Get the Clustering Engine • We have implemented and applied the clustering engines to many examples • The system can be downloaded on the Web from the Drexel University Software Engineering Reasearch Group (SERG) hompeage at: • http://www.mcs.drexel.edu/~serg • The clustering engine was developed using the Java 1.1 programming language Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Compiler Example Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

Boxer (Autolayout Utility)Example Brian Mitchell (bmitchel@mcs.drexel.edu) - Drexel University MCS680-FCS

MCS680: Foundations Of Computer Science