890 likes | 1.09k Views
Computational Discovery in Evolving Complex Networks. Yongqin Gao Advisor: Greg Madey. Outline. Background Methodology for Computational Discovery Problem Domain – OSS Research Process I: Data Mining Process II: Network Analysis Process III: Computer Simulation
E N D
Computational Discovery in Evolving Complex Networks Yongqin Gao Advisor: Greg Madey
Outline • Background • Methodology for Computational Discovery • Problem Domain – OSS Research • Process I: Data Mining • Process II: Network Analysis • Process III: Computer Simulation • Process IV: Research Collaboratory • Contributions • Conclusion and Future Work
Background • Network research gains more attentions • Internet • Communication network • Social network • Software developer network • Biological network • Understanding the evolving complex network • Goal I: Search • Goal II: Prediction • Computational scientific discovery
Problem Domain • Open Source Software Movement • What is OSS • Free to use, modify and distribute and source code available and modifiable • Potential advantages over commercial software: Potentially high quality; Fast development; Low cost • Why study OSS (Goal) • Software engineering — new development and coordination methods • Open content — model for other forms of open, shared collaboration • Complexity — successful example of self-organization/emergence
Problem Domain • SourceForge.net community • The biggest OSS development communities • 134,751 registered projects • 1,439,773 registered users
Problem Domain • Our Data Set • 25 monthly dumps since January 2003. • Totally 460G and growing at 25G/month. • Every dump has about 100 tables. • Largest table has up to 30 million records. • Experiment Environment • Dual Xeon 3.06GHz, 4G memory, 2T storage • Linux 2.4.21-40.ELsmp with PostgreSQL 8.1
Related Research • OSS research • W. Scacchi, “Free/open source software development practices in the computer game community”, IEEE Software, 2004. • C. Kevin, A. Hala and H. James, “Defining open source software project success”, 24th International Conference on Information Systems, Seattle, 2003. • Complex networks • L.A. Adamic and B.A. Huberman, “Scaling behavior of the world wide web”, Science, 2000. • M.E.J. Newman, “Clustering and preferential attachment in growing networks”, Physics Review, 2001.
Process I: Data Mining • Related Research: • S. Chawla, B. Arunasalam and J. Davis, “Mining open source software (OSS) data using association rules network”, PAKDD, 2003. • D. Kempe, J. Kleinberg and E. Tardos, “Maximizing the spread of influence through a social network”, SIGKDD, 2003. • C. Jensen and W. Scacchi, “Data mining for software process discovery in open source software development communities”, Workshop on Mining Software Repositories, 2004.
Knowledge Algorithm Application Feature Selection Relevant data Data Purging Database Data Preparation Process I: Data Mining Raw data
Process I: Data Mining • Data Preparation • Data discovery • Locating the information • Data characterization • Activity features: user categorization • Network features • Data assembly • Data Purging • Treatment about data inconsistency • Unifying the date presentation by loading into single depository • Treatment about data pollution • Removing “inactive” projects • Feature Selection • This method is used to remove dependent or insignificant features. • NMF (Non-negative Matrix Factorization)
Process I: Data Mining • Result I • Significant features • By feature selection, we can identify the significant feature set describing the projects. • Activity features: “file_releases”, “followup_msg”, “support_assigned”, “feature_assigned” and task related features • Network features: “degrees”, “betweenness” and “closeness”
Process I: Data Mining • Distribution-based clustering (Christley, 2005) • Clustering according to the distribution of features instead of values of individual feature • We assume every entity (project) has an underlying distribution of the feature set (activity features) • Using statistical hypothesis test • Non-parametric test • Fisher’s contingency-table test is used • Joachim Krauth, “Distribution-free statistics: an application-oriented approach”, Elsevier Science Publisher, 1988.
Process I: Data Mining • Procedure: While (still unclustered entities) Put all unclustered entities into one cluster While (some entities not yet pairwise compared) A = Pick entity from cluster For each other entity, B, in cluster not yet compared to A Run statistical test on A and B If significant result Remove B from cluster • Worst case complexity: O(n2)
Process I: Data Mining • Result II • Unsupervised learning • Distribution-based method used to cluster the project history using the activity distribution • We named the clusters using ID and the results are shown in the table • High support and confidence in evaluation
Process I: Data Mining • Two sample distributions from different categories • Unbalanced feature distribution → could be “unpopular” • Balanced feature distribution → could be “popular”
Process I: Data Mining • Discoveries in Process I • Significant feature set selection • Network features are important • Further inspection in next process • Distribution based predictor • Based on the activity feature distribution • Prediction of the “popularity” based on the balance of the activity feature distribution • Benefit of these discoveries • For collaboration based communities, these discoveries can help in resource allocation optimization.
Process II: Network Analysis • Why network analysis • Assess the importance of the network measures to the whole network and to individual entity in the network • Inspect the developing patterns of these network measures • Network analysis • Structure analysis • Centrality analysis • Path analysis
Process II: Network Analysis • Related research: • P. Erdös and A. Rényi, “On random graphs”, Publicationes Mathematicae, 1959. • D.J. Watts and S. H. Strogatz, “Collective dynamics of small-world networks”, Nature, 1998. • R. Albert and A.L. Barabάsi, “Emergence of scaling in random networks”, Science, 1999. • Y. Gao, “Topology and evolution of the open source software community”, Master Thesis, 2003.
Process II: Network Analysis • Structure Analysis • Understanding the influence of the network structure to individual entities in the network • Inspected measures • Approximate diameter • Approximate clustering coefficient • Component distribution
Process II: Network Analysis • Conversion among C-NET, P-NET and D-NET
Process II: Network Analysis • Result I • Approximate Diameters • D-NET: between (5,7) while network size ranged from 151,803 to 195,744. • P-NET: between (6,8) while network size ranged from 123,192 to 161,798. • Approximate Clustering Coefficient • D-NET: between (0.85, 0.95) • P-NET: between (0.65, 0.75)
Process II: Network Analysis • Result I
Process II: Network Analysis • Centrality Analysis • Understanding the importance of individual entities to the global network structure • Inspected measures: • Average Degrees • Degree Distributions • Betweenness • Closeness
Process II: Network Analysis • Result II • Average Degrees • Developer degree in C-NET: 1.4525 • Project degree in C-NET: 1.7572 • Developer degree in D-NET: 12.3100 • Project degree in P-NET: 3.8059
Process II: Network Analysis • Result II (Degree distributions in C-NET)
Process II: Network Analysis • Result II (Degree distributions in D-NET and P-NET)
Process II: Network Analysis • Result II • Average Betweenness • P-NET: 0.2669e-003 • Average Closeness • P-NET: 0.4143e-005 • Normally these two measures yield very small value in large networks (N>10,000).
Process II: Network Analysis • Path Analysis • Understanding the developing patterns of the network structure and individual entities in the network • Inspected measures: • Active Developer Percentage • Average Degrees • Diameters • Clustering coefficients • Betweenness • Closeness
Process II: Network Analysis • Result III (Active entities)
Process II: Network Analysis • Result III (Average degrees in C-NET)
Process II: Network Analysis • Result III (Average degrees in D-NET and P-NET)
Process II: Network Analysis • Result III (Diameters in D-NET and P-NET)
Process II: Network Analysis • Result III (Clustering coefficients for D-NET and P-NET)
Process II: Network Analysis • Result III (Average betweenness and closeness for P-NET)
Process II: Network Analysis • Discoveries in Process II: • Measures of structure analysis and centrality analysis all indicate very high connectivity of the network. • Measures of path analysis reveal the developing patterns of these measures (life cycle behavior). • Benefits of these discoveries • High connectivity in a network is an important feature for information propagation, failure proof. Understanding this discovery can help us improve our practices in collaboration networks and communication networks. • Understanding the developing patterns of these network measures provides us a method to monitor network development and to improve the network if necessary.
Process III: Computer Simulation • Related Research: • P.J. Kiviat, “Simulation, technology, and the decision process”, ACM Transactions on Modeling and Computer Simulation,1991. • R. Albert and A.L. Barabási, “Emergence of scaling in random networks”, Science, 1999. • J. Epstein R. Axtell, R. Axelrod and M. Cohen, “Aligning simulation models: A case study and results”, Computational and Mathematical Organization Theory, 1996. • Y. Gao, “Topology and evolution of the open source software community”, Master Thesis, 2003.
Process III: Computer Simulation • Iterative simulation method • Empirical dataset • Model • Simulation • Verification and validation • More measures • More methods
Process III: Computer Simulation • Previous iterated models (master thesis): • Adapted ER Model • BA Model • BA Model with fitness • BA Model with dynamic fitness • Iterated models in this study • Improved Model Four (Model I) • Constant user energy (Model II) • Dynamic user energy (Model III)
Process III: Computer Simulation • Model I • Realistic stochastic procedures. • New developer every time step based on Poisson distribution • Initial fitness based on log-normal distribution • Updated procedure for the weighted project pool (for preferential selection of projects).
Process III: Computer Simulation • Average degrees
Process III: Computer Simulation • Diameter and CC
Process III: Computer Simulation • Betweenness and Closeness
Process III: Computer Simulation • Degree Distributions
Process III: Computer Simulation • Deficit in the measures
Process III: Computer Simulation • Model II • New addition: user energy. • User energy • the “fitness” parameter for the user • Every time a new user is created, a energy level is randomly generated for the user • Energy level will be used to decide whether a user will take a action or not during every time step.
Process III: Computer Simulation • Degree distributions for Model II
Process III: Computer Simulation • Deficit in the measures