1 / 43

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software

Patman
Download Presentation

An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg MadeyComputer Science & EngineeringUniversity of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software University of Illinois, Urbana-Champaign October 8-9, 2003 This research was partially supported by the US National Science Foundation, CISE/IIS-Digital Society & Technology, under Grant No. 0222829

  2. Contributors • Vincent Freeh, Computer Science, North Carolina State University (Principal Investigator) • Yongqin Gao, Computer Science and Engineering, University of Notre Dame (Graduate Student) • Jeff Goett, University of Notre Dame (REU Student) • Chris Hoffman, University of Notre Dame (REU Student) • Nadir Kiyanclar, University of Notre Dame (REU Student) • Greg Madey, Computer Science & Engineering, University of Notre Dame (Principal Investigator) • Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator) • Carlos Siu, University of Notre Dame (REU Student) • Renee Tynan, Department of Management, College of Business, University of Notre Dame (Principal Investigator) • Jin Xu, Computer Science & Engineering, University of Notre Dame (Graduate Student)

  3. Outline • Research approach • Tools and definitions: Agents, models, simulations, collaborative social networks, computer experiments • Data collection and analysis • Example research question • Simulation • Computer experiments • Results

  4. One Approach to Researching F/OSSD • Online data • Screen scraping • Database dumps • Modeling • Social network theory • Evolutionary assumptions • Simulation • Verification and validation • Computer experiments • Variation of Classical Scientific Method

  5. Classical Scientific Method • Observe the world • Identify a puzzling phenomenon • Generate a falsifiable hypothesis (K. Popper) • Design and conduct an experiment with the goal of disproving the hypothesis • If the experiment “fails”, then the hypothesis is accepted (until replaced) • If the experiment “succeeds”, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated

  6. The Computer Experiment

  7. Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Agent -Based Simulation (Experiment) Observation

  8. Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Social Network Model of F/OSS Agent -Based Simulation (Experiment) Observation Analysis of Grow Artificial SourceForge SourceForge Data

  9. Agent-Based Modeling and Simulation • Conceptual models of a phenomenon • Simulations are computer implementations of the conceptual models • Agents in models and simulations are distinct entities (instantiated objects) • Tend to be simple, but with large numbers of them (thousands, or more) - i.e., swarm intelligence • Contrasted with higher level AI “intelligent agents” • Foundations in complexity theory • Self-organization • Emergence

  10. Collaborative Social Networks • Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001) • Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003) • Interlocking corporate directorships • Terrorist Networks • Open-source software developers (Madey et al, AMCIS 2002) • Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenon

  11. SourceForge • VA Software • Part of OSDN • Started 12/1999 • Collaboration tools • 70,000 Projects • 90,000 Developers • 700,00 Registered Users

  12. Savannah • SourceForge Software? • Free Software Foundation • 1,600 Projects • 16,000 Registered Users

  13. Observations • Web mining • Web crawler (scripts) • Python • Perl • AWK • Sed • Monthly • Since Jan 2001 • ProjectID • DeveloperID • Almost 2 million records • Relational database PROJ|DEVELOPER 8001|dev378 8001|dev8975 8001|dev9972 8002|dev27650 8005|dev31351 8006|dev12509 8007|dev19395 8007|dev4622 8007|dev35611 8008|dev8975

  14. Collaboration Networks Adapted from Newman, Strogatz and Watts, 2001

  15. dev[72] dev[67] dev[52] dev[65] dev[70] dev[57] 7597 dev[46] 6882 dev[47] dev[64] dev[45] dev[99] 7597 dev[46] 7597 dev[46] dev[52] dev[72] dev[67] 7597 dev[46] 6882 dev[47] dev[47] dev[55] dev[55] dev[55] 7597 dev[46] 7028 dev[46] dev[70] 7597 dev[46] 7028 dev[46] dev[57] dev[45] dev[99] dev[51] 7597 dev[46] 7028 dev[46] 6882 dev[47] 6882 dev[58] dev[61] dev[51] dev[79] dev[47] 7597 dev[46] dev[58] dev[58] dev[46] 9859 dev[46] dev[54] 15850 dev[46] dev[58] 9859 dev[46] dev[79] dev[58] 9859 dev[46] dev[49] dev[53] 9859 dev[46] 15850 dev[46] dev[59] dev[56] 15850 dev[46] dev[83] 15850 dev[46] dev[48] dev[53] dev[56] dev[83] dev[48] F/OSS Developers - Collaboration Social Network Developers are nodes / Projects are links 24 Developers 5 Projects Project 7597 2 Linchpin Developers 1 Cluster dev[64] Project 6882 Project 7028 dev[61] dev[54] dev[49] dev[59] Project 9859 Project 15850

  16. Topological Analysis of the Data • Statistics inspected • Diameter • Average degree • Clustering coefficient • Degree distribution • Cluster size distribution • Relative size of major cluster • Fitness and life cycle • Evolution of these statistics • Dual networks • developer network and project network

  17. Terminology • Diameter • Average length of shortest paths between all pairs of vertices • Degree • The count of edges connected to given vertex • Average degree • Average of the degrees of all vertices in the network • Cluster • The connected components of the network • Clustering coefficient (CC) • CCi: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. • CC: average of all CCi in a network • Degree distribution • The distribution of degrees throughout a network • Major cluster • The largest cluster in the network

  18. Degree Distribution: Developers

  19. Degree Distribution: Projects

  20. Diameter of Developer Network vs. Time • Network size increased from 30,000 to 70,000

  21. Diameter of Project Network vs. Time • Network size increased from 20,000 to 50,000. • Diameter decreasing with time both for developer network and project network

  22. Clustering Coefficient of Developer Network vs. Time

  23. Clustering Coefficient of Project Network vs. Time

  24. Cluster Size Distribution • R2 with major cluster is 0.7426 • R2 without major cluster is 0.9799

  25. Relative Size of Major Cluster vs. Time • Increase of the relative size of the major cluster • Approaching steady-state?

  26. An Example Research Question • What processes can explain the evolution of the project and developer social networks? • Randomly growing network (Erdos-Reyni, 1960)? • Evolving network with preferential attachment (Barabasi-Albert, 1999)? • Evolving network with preferential attachment and fitness (Barabasi-Albert, 2001)? • Others?

  27. Computer Experiments • Agent-based simulations • Java programs using Swarm class library • Validation (docking) exercises using Java/Repast • Grow artificial SourceForge’s (Epstein & Axtell, 1996) • Parameterized with observed data, e.g., developer behaviors • Join rates • New project additions • Leave projects • Evaluation of multiple models (hypotheses) • Verification/validation

  28. Cycles of Modeling & Simulation Modeling (Hypothesis) Social Network Models ER => BA => BA+Fitness => BA+Dynamic Fitness Agent -Based Simulation (Experiment) Observation Degree Distribution Average Degree Diameter Clustering Coefficient Cluster Size Distribution Analysis of Grow Artificial SourceForge SourceForge Data

  29. Model for SourceForge • ABM based on bipartite graph • Model description • Agent: developer • Behaviors: Create, join, abandon and idle • Preference: developer’s and project’s • Fitness • Four models in iterations • ER, BA, BA with constant fitness and BA with dynamic fitness • Comparison of empirical and simulated data

  30. ER Model – Degree Distribution • Degree distribution is normal distribution while it is power law in empirical data • Fit Fails!

  31. ER Model - Diameter • Average degree is decreasing while it is increasing in empirical data • Diameter is increasing while it is decreasing in empirical data • Fit Fails!

  32. ER Model – Clustering Coefficient • Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data. • Fit fails!

  33. ER Model – Cluster Size Distribution • Power law distribution with R2 as 0.6667 (0.9653 without the major cluster) while R2 in empirical data is 0.7426 (0.9799 without the major cluster) • The actual distribution is different from empirical data • Fit Fails!

  34. BA Model – Degree Distribution • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9798 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.6650 and empirical data has R2 as 0.9838. • Partial Fit!

  35. BA Model – Diameter and Clustering Coefficient • Small diameter and high clustering coefficient like empirical data • Diameter and clustering coefficient are both decreasing like empirical data • Good Fit!

  36. BA Model with Constant Fitness • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9742 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.7253 and empirical data has R2 as 0.9838. • Improved fit!

  37. Discovery: Project Life Cycle

  38. BA Model with Dynamic Fitness • Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9695 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.8051 and empirical data has R2 as 0.9838. • Somewhat better fit!

  39. Models of the F/OSS Social Network(Alternative Hypotheses) • General model features • Agents are nodes on a graph (developers or projects) • Behaviors: Create, join, abandon and idle • Edges are relationships (joint project participation) • Growth of network: random or types of preferential attachment, formation of clusters • Fitness • Network attributes: diameter, average degree, degree distribution, clustering coefficient • Four specific models • ER (random graph) - (1960) • BA (preferential attachment) - (1999) • BA ( + constant fitness) - (2001) • BA ( + dynamic fitness) - (2003)

  40. Summary

  41. Summary • Why Agent-Based Modeling and Simulation? • Can be used as components of the Scientific Method • A research approach for studying socio-technical systems • Case study: F/OSS - Collaboration Social Networks • SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness. • Simulations • Computer experiments that tested conceptual models • Provided insight into the phenomenon under study and guided data mining of collected observations

  42. Questions • Validity of approaches • Social networks • Simulation • Value/Utility of approachs • Applicability to other areas of F/OSS research • Project sites, e.g., Mozilla.org • Individual projects, e.g., Linux kernel

  43. Thank you

More Related