1 / 23

Analysis and Modeling of the Open Source Software Community

Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh Computer Science Dept. NCSU NAACSOS Conference Pittsburgh, PA June 25, 2003 Supported in part by

Roberta
Download Presentation

Analysis and Modeling of the Open Source Software Community

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science & Engineering University of Notre Dame Vincent Freeh Computer Science Dept. NCSU NAACSOS Conference Pittsburgh, PA June 25, 2003 Supported in part by the National Science Foundation – Digital Science & Technology

  2. Outline • Overview • Data collection • Network modeling • Topological statistical analysis • Conclusion

  3. Overview • What is OSS • Free to use, distribution • Unlimited user and usage • Source code available and modifiable • Potential advantages over commercial software • Higher quality • Faster development • Lower cost • Our goal • Understanding the OSS phenomenon • Approach • SourceForge is the source of our empirical data • Modeling as social network • Analysis of topological statistics

  4. Data Collection — Monthly • Web crawler (scripts) • Python • Perl • AWK • Sed • Monthly • Since Jan 2001 • ProjectID • DeveloperID • Almost 2 million records • Relational database PROJ|DEVELOPER 8001|dev348 8001|dev8972 8001|dev9922 8002|dev27650 8005|dev31351 8006|dev12409 8007|dev19935 8007|dev4262 8007|dev36711 8008|dev8972

  5. Modeling as collaboration network • What is collaboration network • A social network representing the collaborating relationships. • Movie actor network and scientist collaboration network • Difference of SourceForge collaboration network • Detachment • Virtual collaboration • Voluntary • Global • Bipartite property of collaboration network

  6. Collaboration network - bipartite

  7. dev[72] dev[67] dev[52] dev[65] dev[70] dev[57] 7597 dev[46] 6882 dev[47] dev[45] dev[64] dev[99] 7597 dev[46] 7597 dev[46] dev[52] dev[72] dev[67] 7597 dev[46] dev[47] 6882 dev[47] dev[55] dev[55] dev[55] 7597 dev[46] 7028 dev[46] dev[70] 7597 dev[46] 7028 dev[46] dev[57] dev[45] dev[51] dev[99] 7597 dev[46] 7028 dev[46] 6882 dev[47] 6882 dev[58] dev[61] dev[51] dev[79] dev[47] dev[58] 7597 dev[46] dev[58] dev[46] 9859 dev[46] dev[54] 15850 dev[46] dev[58] 9859 dev[46] dev[79] dev[58] 9859 dev[46] dev[49] dev[53] 9859 dev[46] 15850 dev[46] dev[59] dev[56] 15850 dev[46] dev[83] 15850 dev[46] dev[48] dev[53] dev[56] dev[83] dev[48] SourceForge developer network OSS Developer Network (Part) Project 7597 Developers are nodes / Projects are links 24 Developers dev[64] 5 Projects 2 hub Developers Project 6882 1 Cluster Project 7028 dev[61] dev[54] dev[49] dev[59] Project 9859 Project 15850

  8. Topological analysis • Statistics inspected • Diameter • Average degree • Clustering coefficient • Degree distribution • Cluster size distribution • Relative size of major cluster • Fitness and lift cycle • Evolution of these statistics

  9. Diameter of developer network vs. time • The average of shortest paths between any pairs of vertices • The values for developer network (30,000 – 70,000) are between 6 and 8

  10. Diameter of project network vs. time • The values for project network (20,000 – 50,000) are between 6 and 7 • Diameter decreasing with time both for developer network and project network

  11. Average degree vs. time • The values for developer network are between 7 and 8 • The values for project network are just between 3 and 4

  12. Clustering coefficient of developer network vs. time

  13. Clustering coefficient of project network vs. time

  14. Degree distribution (developers) • Power law in developer distribution. • R2 = 0.9714

  15. Degree distribution (projects) • Power law in project distribution • R2 = 0.9838

  16. Cluster size distribution • Cluster distribution of developer network • R2 with major cluster is 0.7426 • R2 without major cluster is 0.9799

  17. Relative size of major cluster vs. time • Stable increase of the relative size of the major cluster • Going to slowly converge to some fixed percentage at around 35% • May be an indication of the network evolution

  18. Existence of fitness • Investigation of development of single project can verify the existence of “young upcomer” phenomenon • We tracked the development of every new project in July 2001 until now (total 1660 projects) • Maximal monthly growth per project is 13 while average monthly growth per project is just 0.3639

  19. Life cycle of project

  20. Summary of results • Power law rules • Degree distributions, cluster distribution • Average degree increasing with time • Diameter decreasing with time • Clustering coefficient decreasing with time • Fitness existed in SourceForge • Projects have life cycle behaviors

  21. Conclusion • Study of SourceForge collaboration network can help us understanding the OSS community • We investigate not only the topological statistics but also the evolution of these statistics. • Simulation is needed to further investigation of SourceForge collaboration network.

  22. Thank you

  23. Terminology • Degree • The count of edges connected to given vertex • Degree distribution • The distribution of degrees throughout a network • Cluster • The connected components of the network • Diameter • Average length of shortest paths between all pairs of vertices • Clustering coefficient (CC) • CCi: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. • CC: average of all CCi in a network

More Related