210 likes | 361 Views
Mining Email Social Networks in OSS. Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate School of Management University of California, Davis. Motivation.
E N D
Mining Email Social Networks in OSS Christian Bird, Prem Devanbu, Alex Gourley, and Michael Gertz Department of Computer Science Anand Swaminathan Graduate School of Management University of California, Davis
Motivation • The social process is an important, hard to study, aspect of any software engineering effort • Can be studied in many stable and mature OSS projects • Nearly all communication is done via internet • Records of both communication and development activity are freely available
Apache Communication and Development (since 1996) • 100,000+ messages on dev mailing list • 70,000 CVS commits to files
It is widely believed that OSS communities form a hierarchy Image from Socialization in an Open Source Community, Nicolas Ducheneaut Can we use social network analysis to examine these OSS communities?
Social Networks • A network consisting of actors and their social ties to each other. Network of who dated who in high school. Courtesy of Mark Newman
Alice Alice Alice directed link undirected link undirected link Bob Bob Bob Alice undirected link Bob contribute commit post respond contribute commit submit resolve Mailing List Python foo.c Bug Report Related Work • Xu, Gao, Christley, and Madey looked at developers who worked on the same projects • Crowston & Howison co-ocurrence of developers on a bug-report as a social link • Lopez, Gonzalez-Barahona, & Robles created networks of developers and modules via CVS data. • We believe that responses to emails indicates a strong social link.
Issues with Mailing List Analysis • Extracting conversation threads • Rationalizing Timestamps • Identifying targets in a broadcast medium • Resolving Email Aliases • Extracting Content
Email Aliases • 2,544 different email address aliases have been used on the apache dev mailing list since 1996. • Many of these email addresses belong to the same people. • The following email addresses were all used by Joe Orton. jeo101@york.ac.uk joe@orton.demon.co.uk joe@light.plus.com jorton@redhat.com joe@manyfish.co.uk
Email Alias Analysis Email addresses contain a <name, address> tuple. Often the name is empty. • Preprocess name and address. • Remove commas (“orton, joe” -> “joe orton”) • Normalize whitespace and remove punctuation and common prefixes/suffixes (Mr., jr., etc.) • Remove common email terms (list, admin, root) 2. Use heuristics and fuzzy matching (Levenshtein edit distance) to determine what email aliases are similar. • name-name: “joe orton” vs. “joe e. orton” • email-email: “jorton@foo.com” vs “jorton@bar.org” • name-email:“joe orton” vs. “jorton@foo.com” 3. Manually post process aliases marked as similar to remove the high level of false positives 4. Use similar process to map CVS accounts to email aliases
Alias Results • 2,544 email aliases used • 2,008 unique “identities” used • Many of the high volume participants had a large number of aliases
Creating the Email Social Network • Each email message has a message id. • A response message contains an “in-response-to” header which includes the message id of the previous message. • If Joe posts a message and Bob responds, then there is indication of information flow and we create a directed tie from Joe to Bob. • We have built a tool that will create a directed, valued, adjacency matrix of participants from our mailing list database for any time period.
Intro to Social Network Metrics • In-degree – The number of links whose head is connected to a particular actor • Out-degree – The number of links whose tail is connected to a particular actor • Geodesic – A shortest path between two actors • Betweenness – The number of geodesics that a particular actor lies on.
Example 1 8 10 4 2 3 9 12 6 7 11 5 High Out-Degree High In-Degree High Betweenness
Betweenness more formally For a given vertex i • Whereσstis the number of geodesics between s and t • And σst(i)is the number of those paths passing through vertex i • Normalizing values so that the total of all betweenness sums to 1 is common
Everybody likes a pretty picture! This is the social network of some of the most active participants on the Apache developer mailing list. Each link indicates at least 150 messages between participants. Ryan Bloom has high betweenness in this network. Of the participants shown, he has the highest number of source file commits.
The distribution of in-degree and out-degree both exhibit a power-law character
Status of Developers vs. Non-Developers Largest difference is in betweenness
Correlation between communication and development • High correlation between betweenness and source file changes • Lower correlation between betweenness and document file changes • Similar relationship for in- and out-degree.
Observations from the network • The mailing list activity reflects a typical social network. • Developers are the “key social brokers”. • More active developers tend to be more important. • Results robust: Postgres showed similar results.
Topics of future research • Visualization of software and social data • Who becomes a developer? • Relationship between communication and collaboration networks • Network Evolution • Conway’s Law
Average In-Degree Avg In-Degree Months