230 likes | 500 Views
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh Computer Science Dept. NCSU NAACSOS Conference Pittsburgh, PA June 25, 2003 Supported in part by
E N D
Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh Computer Science Dept. NCSU NAACSOS Conference Pittsburgh, PA June 25, 2003 Supported in part by the National Science Foundation – Digital Science & Technology
Outline • Overview • Data collection • Network modeling • Topological statistical analysis • Conclusion
Overview • What is OSS • Free to use, distribution • Unlimited user and usage • Source code available and modifiable • Potential advantages over commercial software • Higher quality • Faster development • Lower cost • Our goal • Understanding the OSS phenomenon • Approach • SourceForge is the source of our empirical data • Modeling as social network • Analysis of topological statistics
Data Collection — Monthly • Web crawler (scripts) • Python • Perl • AWK • Sed • Monthly • Since Jan 2001 • ProjectID • DeveloperID • Almost 2 million records • Relational database PROJ|DEVELOPER 8001|dev348 8001|dev8972 8001|dev9922 8002|dev27650 8005|dev31351 8006|dev12409 8007|dev19935 8007|dev4262 8007|dev36711 8008|dev8972
Modeling as collaboration network • What is collaboration network • A social network representing the collaborating relationships. • Movie actor network and scientist collaboration network • Difference of SourceForge collaboration network • Detachment • Virtual collaboration • Voluntary • Global • Bipartite property of collaboration network
dev[72] dev[67] dev[52] dev[65] dev[70] dev[57] 7597 dev[46] 6882 dev[47] dev[45] dev[64] dev[99] 7597 dev[46] 7597 dev[46] dev[52] dev[72] dev[67] 7597 dev[46] dev[47] 6882 dev[47] dev[55] dev[55] dev[55] 7597 dev[46] 7028 dev[46] dev[70] 7597 dev[46] 7028 dev[46] dev[57] dev[45] dev[51] dev[99] 7597 dev[46] 7028 dev[46] 6882 dev[47] 6882 dev[58] dev[61] dev[51] dev[79] dev[47] dev[58] 7597 dev[46] dev[58] dev[46] 9859 dev[46] dev[54] 15850 dev[46] dev[58] 9859 dev[46] dev[79] dev[58] 9859 dev[46] dev[49] dev[53] 9859 dev[46] 15850 dev[46] dev[59] dev[56] 15850 dev[46] dev[83] 15850 dev[46] dev[48] dev[53] dev[56] dev[83] dev[48] SourceForge developer network OSS Developer Network (Part) Project 7597 Developers are nodes / Projects are links 24 Developers dev[64] 5 Projects 2 hub Developers Project 6882 1 Cluster Project 7028 dev[61] dev[54] dev[49] dev[59] Project 9859 Project 15850
Topological analysis • Statistics inspected • Diameter • Average degree • Clustering coefficient • Degree distribution • Cluster size distribution • Relative size of major cluster • Fitness and lift cycle • Evolution of these statistics
Diameter of developer network vs. time • The average of shortest paths between any pairs of vertices • The values for developer network (30,000 – 70,000) are between 6 and 8
Diameter of project network vs. time • The values for project network (20,000 – 50,000) are between 6 and 7 • Diameter decreasing with time both for developer network and project network
Average degree vs. time • The values for developer network are between 7 and 8 • The values for project network are just between 3 and 4
Degree distribution (developers) • Power law in developer distribution. • R2 = 0.9714
Degree distribution (projects) • Power law in project distribution • R2 = 0.9838
Cluster size distribution • Cluster distribution of developer network • R2 with major cluster is 0.7426 • R2 without major cluster is 0.9799
Relative size of major cluster vs. time • Stable increase of the relative size of the major cluster • Going to slowly converge to some fixed percentage at around 35% • May be an indication of the network evolution
Existence of fitness • Investigation of development of single project can verify the existence of “young upcomer” phenomenon • We tracked the development of every new project in July 2001 until now (total 1660 projects) • Maximal monthly growth per project is 13 while average monthly growth per project is just 0.3639
Summary of results • Power law rules • Degree distributions, cluster distribution • Average degree increasing with time • Diameter decreasing with time • Clustering coefficient decreasing with time • Fitness existed in SourceForge • Projects have life cycle behaviors
Conclusion • Study of SourceForge collaboration network can help us understanding the OSS community • We investigate not only the topological statistics but also the evolution of these statistics. • Simulation is needed to further investigation of SourceForge collaboration network.
Terminology • Degree • The count of edges connected to given vertex • Degree distribution • The distribution of degrees throughout a network • Cluster • The connected components of the network • Diameter • Average length of shortest paths between all pairs of vertices • Clustering coefficient (CC) • CCi: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. • CC: average of all CCi in a network