Connected Components in Software Networks

Connected Components in Software Networks Miloš Savić, Mirjana Ivanović, Miloš Radovanović Department of Mathematics and InformaticsFaculty of ScienceUniversity of Novi Sad

Content • Introduction • Data collection • Experiments and results • Conclusions

Introduction - software networks - • Two levels of software complexity:- internal complexity of software entities (classes, functions...)- structural complexity of dependencies between entities • Class collaboration networks:nodes: classes/interfaceslinks: OO relationships • Static call graphs:nodes: functions/procedureslinks: call-return relationships

Introduction- connected components - • Connected component:set of mutually reachable nodes • Giant connected component:contains the vast majority of nodes • Directed networks:strongly connected componentsweakly connected components

Introduction- theory of complex networks - • Random graphs:- Poisson degree distribution- ER model (static + uniform attachment) • Scale-free networks:- power-law degree distribution- BA model (growth + preferential attachment) • Exponential networks:- exponential degree distribution- Model A (growth + uniform attachment)

Introduction- motivations - • Model A: test complementary cummulative in/out/total degree distributions of giant weakly connected components againts a power-law and an exponential distribution • “robust yet fragile”: investigate topological stability of giant weakly connected components • “hierarchical small-worlds, scale-free networks from optimal design”: determine size of strongly connected components

Data collection • Class collaboration networks:- Ant, Tomcat, Lucene, JavaCC, JDK- extractor – Yaccne • Static call graphs:- gcc, kernel component of Linux kernel- extractor – Doxygen + our .dot aggregator

Experiments and results- giant weakly connected components - Comparable networks sampled by ER, BA and Model A contain GWCC.

Experiments and results- degree distribution of GWCCs -

Experiments and results- Implications - • Theoretical implications:model that can reproduce connectivity pattern characteristic to software systems • Related to software engineering:in-degree = degree of class/function reuseout-degree = degree of class/function aggregation

Experiments and results- theoretical implications - • Superposition model (growth + preferential attachment for out-going links + uniform attachment for in-coming links)

Experiments and results- Analytical solution of the superposition model - • Continuum approach:“Mean field theory for scale-free random networks”, (Barabási et al, ’99) Din/Dout – number of in-coming/out-going links introduced by each node

Experiments and results- Implications related to SE - • First combinatorial principle of graph theory:Avg(reuse) = Avg(aggregation) But:Dispersion(reuse)  ∞ as N  ∞Dispersion(aggregation) ~Avg(aggregation)2 • Conslusions:1. Software systems exhibit acharacteristic scale of code aggregation, but there is no characteristic scale of code reuse.2. Highly reused entities tend to be more reused.3. Predictability of code reuse and unpredictability of code aggregation as software system evolve.

Experiments and results- Topological stability of GWCCs - • Experiments:- removal of one node: to check the existence of articulation points- successive removal of preferential nodes: to check the fragility- successive removal of nodes at random: to check the robustness • After each removal, size of the largest weakly connected component is measured • fc-pref/fc-rnd:critical fraction of nodes that needed to be removed in order to destroy giant weakly connected component when preferential/random node removal scheme is applied

Experiments and results- Articulation points - • Software networks contain APs: [2.91% - 15.50%] of network size • BA model:Dtotal – number of links introduced by each nodeDtotal = 1  num(AP) in the range [31% - 35.4%]Dtotal > 1  num(AP) = 0 • BAU model:- Dtotal is not constant value but random variable such that P{Dtotal = 1} > 0- Modification does not affect scale-free properties of degree distributions and produces APs

Experiments and results- preferential node removal - • Software networks are extremely vulnerable:fc(software network) < fc (BAU) < fc (EXP) < fc(RND)

Experiments and results- random node removal- • Software networks (except Linux) never lose GWCCs • The same situation is for comparable networks generated by theoretical models • Linux static call graphs is scale-free, random errors sensitive network:fc(Linux) < fc(RND) < fc(EXP) < fc(BAU) • Large real-world networks: fc(RND) < fc(rw-net)

Experiments and results- strongly connected components - Linux: SCCs as a minor effect Other networks: no GSCC, but have relatively large SCCs topological sort cannot be made  there is no elegant systematic testing strategy

Largest strongly connected component in GCC’s giant weakly connected component containing 116 mutually reachable nodes

Conclusions • Out-degree sequences of software networks can be better modeled with an exponential distribution than a power-law • Scale-free software networks contain articulation points • Software networks are extremely vulnerable to the removal of highest degree nodes, and (except Linux) share the same level of robustness as comparable networks generated by theoretical models

Conclusions • Linux static call graph is an interesting and intriguing example of a scale-free network which does not display tolerance against random errors • Software networks contain relatively large cyclic dependencies - substructures that does not reflect optimal design and hierarchical small-worldliness

Connected Components in Software Networks Miloš Savić, Mirjana Ivanović, Miloš Radovanović Department of Mathematics and InformaticsFaculty of ScienceUniversity of Novi Sad

Connected Components in Software Networks