Modeling the Search Landscape of Metaheuristic Software Clustering Algorithms

Modeling the Search Landscape of Metaheuristic Software Clustering Algorithms Dagstuhl – Software Architecture Brian S. Mitchellbmitchell@drexel.edu or http://www.mcs.drexel.edu/~bmitchel Department of Computer ScienceCollege of EngineeringDrexel University Philadelphia, PA, 19104 USA

Understanding Large Systems is HARD (1) ManualAnalysis isTedious andError Prone (2) Source CodeAnalysis ApproachesCreate LargeRepositories Example: RedHat Linux 7.1 (3) Kernel 1,400 modules, 2.5M LOCSystem 350K modules, 30M LOCLanguages: > 19 (including scripting) Software ClusteringApproachesCreate AbstractRepresentations [http://www.dwheeler.com/sloc]

Bunch Tool Software Clustering Requires aRepresentation... …A ClusteringAlgorithm… …And a way toRepresent Results… Researchers Have Examined ManyDifferent Approaches for Software Clustering

Search-Based Software Clustering with Bunch Bunch Uses Metaheuristic Search Algorithms for Software Clustering

Bunch Example The RandomStart Point The MDG The Solution

Evaluating Bunch’s Results • Observation: Bunch produces similar results • This is desirable, but • This is unexpected considering the use of metaheuristic search algorithms • Some evaluation has been done • “Good Enough” via empirical studies • Similarity Analysis [WCRE01,ICSM01] • Comparing to spectral clustering techniques [WCRE02] Bunch ProducesA “Family” ofRelated Results We were intrigued to investigate whyBunch’s results are consistently similar

Structural Landscape Similarity Landscape Can Modeling theSearch Space be usefulfor Evaluation? The Search Landscape MDG Bunch Tool ClusteringResults Search Landscape Modeler What are some commonproperties, if any, in the MDG partitions? How similar are thecontents of theMDG partitions? Cluster a System Many Times, Look for Patterns in theClustering Results that Provide Insight into the Search Space

The Structural Landscape – What do we Expect? We expect to see a relationship between MQ and the number of clusters. Both MQ and the number of clusters in the partitioned MDG should not vary widely across clustering runs. MQ vs Number of Clusters Intra-EdgeDensity We expect a good result to produce a high percentage of intraedges (edges that start and end in the same cluster) consistently. Comparing Bunch’s Final Results against the Initial Random Partitioned MDG We expect repeated clustering runs to produce similar MQ results. MQ Value We expect that the number of clusters remains relatively consistent across multiple clustering runs. Number ofClusters The Structural Landscape is Modeled using a Series of Views

CLUSTER Other Clusters • edges (Intra-Edges) • edges (Inter-Edges) a b c The Similarity Landscape – What do we Expect? • Create a counter C<u,v> for each edge, initialize to zero • Cluster a system many times, For each run: • For each edge, Increment C<u,v>if <u,v> is an Intraedge • After all Runs, determine P<u,v>which is the percentage of times that each <u,v> appeares as an Intraedge Medium High Low None Aggregate the P <u,v>based on the level of agreement LARGE Dissimilarity MODERATE Dissimilarity NOT Similar VERY Similar Our Expectations

Case Study We also looked at 6 randomly generated MDGs

Structural Landscape (1) The independent samples were ordered by MQ to highlightsome relationships that would not be obvious otherwise.

Structural Landscape (2)

Structural Landscape (3) – Random MDGs

Structural Landscape (4) – Random MDGs

Structural Landscape - Observations • There was significant commonality across the clustering results • Many desirable aspects • A lot of commonality between the random and open source systems • Some additional variability in the MQ vs Cluster Size relationship for the random MDGs • More variability in the clustering results for the random graphs with higher edge densities

Similarity Landscape (1) Open Source Systems Random MDGs 100 90 80 70 61 60 54 50 51 47 30 40 35 34 35 30 28 32 25 22 27 20 18 21 13 10 14 12 13 12 9 6 0 7 0 Zero Low Medium High

Open Source Systems Similarity Landscape (2) Random MDGs - Low Random MDGs - High 100 90 80 70 61 60 54 50 51 47 30 40 37 38 38 36 35 32 35 30 28 24 25 32 22 28 29 19 18 18 20 24 25 21 13 10 14 12 13 12 0 9 9 7 0 Zero Low Medium High

Observations – Similarity Landscape • Open Source systems exhibited expected trends • High dissimilarity and high similarity • Low medium similarity • Random MDGs had much higher medium similarity, and almost no high-similarity • We think that this might be due to isomorphism in the clustering results • Why: The variability in the number of clusters with similar MQ that we observed from the structural landscape

Conclusions • Ideally evaluation can be performed by comparing Bunch’s results to a benchmark • Not possible – Graph partitioning is NP-Hard • Empirical feedback indicates that the results are “good enough” • Up to this point and time no investigation has been performed on why Bunch produces consistent results • The Search Landscape model provided a lot of intuition into Bunch’s behavior • We examined both the structural and similarity aspects of the search landscape • The Search Landscape approach seems appropriate for modeling other metaheuristic search algorithms

Modeling the Search Landscape of Metaheuristic Software Clustering Algorithms

Modeling the Search Landscape of Metaheuristic Software Clustering Algorithms

Presentation Transcript

Clustering Algorithms

Comparing Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Fuzzy Clustering Algorithms

An Architecture for Distributing the Computation of Software Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Parallelization of Search Algorithms for Modeling QTES Processes

Clustering Algorithms

Clustering Algorithms

Clustering Algorithms

Evaluating Software Clustering Algorithms

The ACO Metaheuristic