430 likes | 703 Views
An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg Madey Computer Science & Engineering University of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software
E N D
An Investigation into the Free/Open Source Software Phenomenon using Data Mining, Social Network Theory, and Agent-Based Greg MadeyComputer Science & EngineeringUniversity of Notre Dame UIUC - NSF Workshop on Continuous (Re)Design of Open Source Software University of Illinois, Urbana-Champaign October 8-9, 2003 This research was partially supported by the US National Science Foundation, CISE/IIS-Digital Society & Technology, under Grant No. 0222829
Contributors • Vincent Freeh, Computer Science, North Carolina State University (Principal Investigator) • Yongqin Gao, Computer Science and Engineering, University of Notre Dame (Graduate Student) • Jeff Goett, University of Notre Dame (REU Student) • Chris Hoffman, University of Notre Dame (REU Student) • Nadir Kiyanclar, University of Notre Dame (REU Student) • Greg Madey, Computer Science & Engineering, University of Notre Dame (Principal Investigator) • Patrick McGovern, Director SourceForge.net, VA Software (Industrial Collaborator) • Carlos Siu, University of Notre Dame (REU Student) • Renee Tynan, Department of Management, College of Business, University of Notre Dame (Principal Investigator) • Jin Xu, Computer Science & Engineering, University of Notre Dame (Graduate Student)
Outline • Research approach • Tools and definitions: Agents, models, simulations, collaborative social networks, computer experiments • Data collection and analysis • Example research question • Simulation • Computer experiments • Results
One Approach to Researching F/OSSD • Online data • Screen scraping • Database dumps • Modeling • Social network theory • Evolutionary assumptions • Simulation • Verification and validation • Computer experiments • Variation of Classical Scientific Method
Classical Scientific Method • Observe the world • Identify a puzzling phenomenon • Generate a falsifiable hypothesis (K. Popper) • Design and conduct an experiment with the goal of disproving the hypothesis • If the experiment “fails”, then the hypothesis is accepted (until replaced) • If the experiment “succeeds”, then reject hypothesis, but additional insight into the phenomenon may be obtained and steps 2-3 repeated
Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Agent -Based Simulation (Experiment) Observation
Agent-Based Simulation as a Component of the Scientific Method Modeling (Hypothesis) Social Network Model of F/OSS Agent -Based Simulation (Experiment) Observation Analysis of Grow Artificial SourceForge SourceForge Data
Agent-Based Modeling and Simulation • Conceptual models of a phenomenon • Simulations are computer implementations of the conceptual models • Agents in models and simulations are distinct entities (instantiated objects) • Tend to be simple, but with large numbers of them (thousands, or more) - i.e., swarm intelligence • Contrasted with higher level AI “intelligent agents” • Foundations in complexity theory • Self-organization • Emergence
Collaborative Social Networks • Research-paper co-authorship, small world phenomenon, e.g., Erdos number (Barabasi 2001, Newman 2001) • Movie actors, small world phenomenon, e.g., Kevin Bacon number (Watts 1999, 2003) • Interlocking corporate directorships • Terrorist Networks • Open-source software developers (Madey et al, AMCIS 2002) • Collaborators are nodes in a graph, and collaborative relationship are the edges of the graph => a framework to model data/phenomenon
SourceForge • VA Software • Part of OSDN • Started 12/1999 • Collaboration tools • 70,000 Projects • 90,000 Developers • 700,00 Registered Users
Savannah • SourceForge Software? • Free Software Foundation • 1,600 Projects • 16,000 Registered Users
Observations • Web mining • Web crawler (scripts) • Python • Perl • AWK • Sed • Monthly • Since Jan 2001 • ProjectID • DeveloperID • Almost 2 million records • Relational database PROJ|DEVELOPER 8001|dev378 8001|dev8975 8001|dev9972 8002|dev27650 8005|dev31351 8006|dev12509 8007|dev19395 8007|dev4622 8007|dev35611 8008|dev8975
Collaboration Networks Adapted from Newman, Strogatz and Watts, 2001
dev[72] dev[67] dev[52] dev[65] dev[70] dev[57] 7597 dev[46] 6882 dev[47] dev[64] dev[45] dev[99] 7597 dev[46] 7597 dev[46] dev[52] dev[72] dev[67] 7597 dev[46] 6882 dev[47] dev[47] dev[55] dev[55] dev[55] 7597 dev[46] 7028 dev[46] dev[70] 7597 dev[46] 7028 dev[46] dev[57] dev[45] dev[99] dev[51] 7597 dev[46] 7028 dev[46] 6882 dev[47] 6882 dev[58] dev[61] dev[51] dev[79] dev[47] 7597 dev[46] dev[58] dev[58] dev[46] 9859 dev[46] dev[54] 15850 dev[46] dev[58] 9859 dev[46] dev[79] dev[58] 9859 dev[46] dev[49] dev[53] 9859 dev[46] 15850 dev[46] dev[59] dev[56] 15850 dev[46] dev[83] 15850 dev[46] dev[48] dev[53] dev[56] dev[83] dev[48] F/OSS Developers - Collaboration Social Network Developers are nodes / Projects are links 24 Developers 5 Projects Project 7597 2 Linchpin Developers 1 Cluster dev[64] Project 6882 Project 7028 dev[61] dev[54] dev[49] dev[59] Project 9859 Project 15850
Topological Analysis of the Data • Statistics inspected • Diameter • Average degree • Clustering coefficient • Degree distribution • Cluster size distribution • Relative size of major cluster • Fitness and life cycle • Evolution of these statistics • Dual networks • developer network and project network
Terminology • Diameter • Average length of shortest paths between all pairs of vertices • Degree • The count of edges connected to given vertex • Average degree • Average of the degrees of all vertices in the network • Cluster • The connected components of the network • Clustering coefficient (CC) • CCi: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood. • CC: average of all CCi in a network • Degree distribution • The distribution of degrees throughout a network • Major cluster • The largest cluster in the network
Diameter of Developer Network vs. Time • Network size increased from 30,000 to 70,000
Diameter of Project Network vs. Time • Network size increased from 20,000 to 50,000. • Diameter decreasing with time both for developer network and project network
Cluster Size Distribution • R2 with major cluster is 0.7426 • R2 without major cluster is 0.9799
Relative Size of Major Cluster vs. Time • Increase of the relative size of the major cluster • Approaching steady-state?
An Example Research Question • What processes can explain the evolution of the project and developer social networks? • Randomly growing network (Erdos-Reyni, 1960)? • Evolving network with preferential attachment (Barabasi-Albert, 1999)? • Evolving network with preferential attachment and fitness (Barabasi-Albert, 2001)? • Others?
Computer Experiments • Agent-based simulations • Java programs using Swarm class library • Validation (docking) exercises using Java/Repast • Grow artificial SourceForge’s (Epstein & Axtell, 1996) • Parameterized with observed data, e.g., developer behaviors • Join rates • New project additions • Leave projects • Evaluation of multiple models (hypotheses) • Verification/validation
Cycles of Modeling & Simulation Modeling (Hypothesis) Social Network Models ER => BA => BA+Fitness => BA+Dynamic Fitness Agent -Based Simulation (Experiment) Observation Degree Distribution Average Degree Diameter Clustering Coefficient Cluster Size Distribution Analysis of Grow Artificial SourceForge SourceForge Data
Model for SourceForge • ABM based on bipartite graph • Model description • Agent: developer • Behaviors: Create, join, abandon and idle • Preference: developer’s and project’s • Fitness • Four models in iterations • ER, BA, BA with constant fitness and BA with dynamic fitness • Comparison of empirical and simulated data
ER Model – Degree Distribution • Degree distribution is normal distribution while it is power law in empirical data • Fit Fails!
ER Model - Diameter • Average degree is decreasing while it is increasing in empirical data • Diameter is increasing while it is decreasing in empirical data • Fit Fails!
ER Model – Clustering Coefficient • Clustering coefficient is relatively low under 0.3 while it is around 0.7 in empirical data. • Fit fails!
ER Model – Cluster Size Distribution • Power law distribution with R2 as 0.6667 (0.9653 without the major cluster) while R2 in empirical data is 0.7426 (0.9799 without the major cluster) • The actual distribution is different from empirical data • Fit Fails!
BA Model – Degree Distribution • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9798 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.6650 and empirical data has R2 as 0.9838. • Partial Fit!
BA Model – Diameter and Clustering Coefficient • Small diameter and high clustering coefficient like empirical data • Diameter and clustering coefficient are both decreasing like empirical data • Good Fit!
BA Model with Constant Fitness • Power laws in degree distributions, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9742 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.7253 and empirical data has R2 as 0.9838. • Improved fit!
BA Model with Dynamic Fitness • Power laws in degree distribution, similar to empirical data (o for simulated data and x for empirical data). • For developer distribution: simulated data has R2 as 0.9695 and empirical data has R2 as 0.9714. • For project distribution: simulated data has R2 as 0.8051 and empirical data has R2 as 0.9838. • Somewhat better fit!
Models of the F/OSS Social Network(Alternative Hypotheses) • General model features • Agents are nodes on a graph (developers or projects) • Behaviors: Create, join, abandon and idle • Edges are relationships (joint project participation) • Growth of network: random or types of preferential attachment, formation of clusters • Fitness • Network attributes: diameter, average degree, degree distribution, clustering coefficient • Four specific models • ER (random graph) - (1960) • BA (preferential attachment) - (1999) • BA ( + constant fitness) - (2001) • BA ( + dynamic fitness) - (2003)
Summary • Why Agent-Based Modeling and Simulation? • Can be used as components of the Scientific Method • A research approach for studying socio-technical systems • Case study: F/OSS - Collaboration Social Networks • SourceForge conceptual models: ER, BA, BA with constant fitness and BA with dynamic fitness. • Simulations • Computer experiments that tested conceptual models • Provided insight into the phenomenon under study and guided data mining of collected observations
Questions • Validity of approaches • Social networks • Simulation • Value/Utility of approachs • Applicability to other areas of F/OSS research • Project sites, e.g., Mozilla.org • Individual projects, e.g., Linux kernel