590 likes | 689 Views
Thomas Krichel Long Island University & Novosibirsk State University 5 May 2010. Living Network Centrality. sponsors. Nikos Askitas of IZA for inviting me today.
E N D
Thomas Krichel Long Island University & Novosibirsk State University 5 May 2010 Living Network Centrality
sponsors • Nikos Askitas of IZA for inviting me today. • Vincent Bertone Jr of Miteq Corp. for the computation support. We have an 8-CPU machine that runs the calculations and temporarily hosts the web service • Previous version of these slides, prepared for a meeting exactly 4 years ago where joint work with Nisa Bakkalbaşı.
structure of this talk • background on RePEc • RePEc author service • centrality as an incentive device • back to basics • results using the RePEc author service • implementation challenges
RePEc essence and history • It is an open-access abstracting and indexing database about economics. • It goes back to 1993 when Thomas Krichel started to build indeces of printed and online working papers in economics. • Now it also covers journal articles and some other publication types such as books and book chapters.
what is interesting about RePEc • Large • Unfunded • Relational • Evaluation oriented
RePEc is large • Over 550 archives contribute document data to the collection. • There about 350k items described. These are more than in arXiv.org, at some recent count. • There are about 10 different user services that use RePEc data or further process.
RePEc is unfunded • While there are some sponsors for parts of RePEc, neither data collection or service provision is externally sponored. • Most data about publications come from dedicated RePEc archives based at • economics departments at universities • other research centers • some specialized administrative units such as central banks. • Services are mainly run by amateurs.
RePEc is relational • RePEc does not only register documents but also researcher and their institutions. • Institutions are centrally registered by one volunteer, Christian Zimmermann. • People register with the RePEc Author Service RAS. More about this later.
RePEc is evaluation-oriented • Since we have indentified authors, we can aggregate evaluative measures over authors and institutions. • Recently, Christian Zimmermann has built a battery of 22 different indicators for individuals. • This is very rich dataset for scientometric exercise. any questions?
RAS history and essence • It goes back to 1999 when Thomas Krichel directed work by Markus Johannes Richard Klink to build a special author registration web interface. • In 2002 the Open Society Institute contributed $50k to develop a generic software to implements servics such as RAS. • The software is written by Ivan V. Kurmanov. • It is called ACIS (Academic Contribution Information System)
how does RAS work? • Authors contact RAS to let RePEc know what papers they have written. • Registrants create and maintain a personnal profile • Registrants create and maintain a name variations profile • RAS creates and maintains a contributions profile. • Once an initial profile is defined ACIS has a mechanism called ARPU that alerts authors about documents being added to their profile. • The contributions profile contains the name of all documents.
what is interesting about RAS? • Registration of authors solves all problems of trying to indentify authors by their names. • There are many ways to represent the same name. ex Bruno Van Pottelsbergh De la Potterie, proceedings page 128. Some RAS registrants have even longer names! • Many different authors may share the same name or the same way in which the name can be represented. • Solving these problems "manually" is very expensive and only feasible for small sets of authors.
but RAS is not complete • Bakkalbasi and Krichel (2006) http://openlib.org /home/krichel/papers/elba.pdf, (Elba paper) have shown, that, at their time of writing • Roughly every third RePEc document has at least one registered author. • Roughly very fourth RePEc authorship is captured by RAS. • These figures are not likely to change very rapidly. • RAS gets more registrants. • RePEc gets more documents.
RAS and co-authorship • In the Elba paper there is a conjecture that the fact that author A is registered does not significantly increase the chance that the co-authors of A are registered. • This is can not be formally shown without labouring through attempt to identify by name. • One indication is that the graph of formed by co-author relationships in RAS is not dense. This has been found in recent work by Nisa Bakkalbasi.
registration incentive on co-authors • To get authors to register, we need good incentives. • In conventional (Zimmermann's 22) indicators, the positionn of an author depends only on the author's action. • If we use co-authorship, we can devise rankings that depend on co-authorship. • If we have such a ranking, authors will have incentives to get their co-authors to register.
imagine a RAS-CIS • A RAS Collaboration Information System should be built. • RAS-CIS could show the registrants • local information about shortest paths • network summaries via centrality indices • The summary information will improve with more colllaborators of the author registered.
two tasks to build RAS-CIS • We have to select the measures to calculate and develop the tools to calculated them. This is what the paper is about. • We have to build an interface that will allow intuitive access to that data. The data would have to be updated. • Since there has been no similar service before this is a hard task. But not done here.
the job here • We calculate differents centrality rankings of authors. • We compare the rankings among themselves. • We want to select a measure that is best to use in web-based collaboration centrality ranking service. • RAS-CIS is still to be built fully. But I have build a running version under the title collec.repec.org for the meeting today.
collaboration graph • From a social networking perspective, collaboration establishes a graph structure • RAS authors are the nodes. • Collaboration, i.e. common claim(s) of a same paper is the edges between nodes. • If there is no common paper claimed by two authors no edges exists between the nodes. • Specific results depends on how the edge length is calculated from the collaboration structure.
graph components • If there is a path between one author A and another author B along collaboration archs, we say that A and B belong to the same component of the collaboration graph. • It is commonly observed in real work network that the largest component is quite large. It usually has more than 50% of all nodes and it is therefore know as the giant component. • Most centrality measures are only meaningful for the members of the giant components.
face the force of facts in 2010 • 24,000 registrants are found it RAS. • ????? registrants (70% of registrants) are authors, i.e. they have claimed at least one paper. • ???? registrants (66% of authors) are co-authors, i.e. they are authors who have collaborated with at least one other RAS author. • 16000 registrants (83% of co-authors) are in the giant component.
the RAS nodes • 16k authors is still a rather large network. • There are at least 16k times 16k / 2 shortest paths between the authors, and many more other paths. • Calculations of a set of shortest paths takes 8 days on an 8 CPU machine.
network type • Between any two nodes, there is an edge if the authors have ever collaborated. • But the length of the edge depends on your point of view of the strength of collaboration. • Different edge lengths lead to different networks. • We introduce three networks in the following three slides.
network 1: binary network • In the binary network, the collaboration strengh between any two authors is one if the two authors have claimed at least one common paper in RAS. The collaboration strength is zero otherwise. • The edge length is the inverse of the collaboration strength. • If the collaboration strength is zero, there is no edge between the two nodes. • We use an algorithm by Newman to do the calculations.
network 2: symmetric weighted network • In a symmetric weighted network, for each paper that two authors have claimed in common, we increment the collaboration strength between two authors by the number of authors on that paper minus 1. • As a result, the total collaboration strength of an author is the amount of co-authored papers. • We used the Dijkstra algorithm to find the shortest paths. This will find only one shortest path.
network 3: random walk network • In this type of network, we norm the collaboration strength of each author to be one. • This generates an assymetric networks where inward edges are shorter for important authors who have written more papers. • This type of measures is used in SNA to measure prestige. • We used the Dijkstra algorithm to find the shortest paths. This will find only one shortest path.
centrality measures • For each network, we can look at two centrality measures. • closeness centrality: a node is more central if it has shorter average shortest path leading to all other nodes. • betweeness centrality: a node is more central if it lies on the more shortest paths leading from one node to the other. • These centrality measures rank authors from the more central to the least central.
notation for centrality measures • BIC closeness centrality in the binary network • BIB betweeness centrality in the binary network • SYC closeness centrality in the symmetric weighed network • SYB betweeness centrality in the symmetric weighed network • RWC closeness centrality in the random walk netowork • RWB betweeness centrality in the random walk network
pair-wise Spearman rank correlation from paper of 4 years ago BIC BIB SYC SYB RWC RWB BIC 1 .60 .90 .52 .89 .30 BIB .60 1 .54 .81 .61 .57 SYC .90 .54 1 .54 .91 .23 SYB .52 .81 .54 1 .56 .42 RWC .87 .61 .91 .56 1 .41 RWB .30 .57 .23 .42 .41 1
comments • All three closeness measures are produce very similar rankings. • SYB and BIB are close, but RWB is quite far off both of them. • Overall, the choice of betweeness and closeness seem to be more important that the choice between models. This has been a surprise to us. BIC and BIB are close by 60%, the others are even lower.
adding the number of documents • We can add the number of documents as an additional ranking criterion NDO. We get NDO BIC BIB SYC SYB RWC RWB NDO 1 .68 .55 .71 .60 .70 .19 • Overall, the weighed network appears to be best correlated with the number of documents. This should come as no surprise.
why add this alien number NDO? • We can think of NDO as the simplest easiest indication of the personal fame of an author. • If we want to incentivize authors to want to climb the ranks of a collaboration centrality ranking, we need to have people at the top that they do actually realize. • Remember Groucho Marx "I'll never join a club that accepts me as a member". • Thus the symmetric weighed network appears appealling.
symmetric weighed network • If we are using the symmetric value is an interface, the numbers that come out for closeness are not intutive because the total length are fractions. • But the fact that there should be much less path multiplicity makes the presentation simpler. • But the paths may be longer (in simple counts of intermediate nodes) than counts in the binary model.
RAS-CIS • The most difficult aspect is to build the interface when there is no similar service present at this time. • The updating can not be done instantaneous, but ought to be close to it. • If the contributions profile of an author changes, we can recalculate her paths. • We can also recalculate the paths of her co-authors. • But then we end up with an overall network that is no longer symmetric.
more work • RAS authorship are a high-quality dataset that is easy to use. • It is not widely used at this point. • Note in particular that much of the data affecting collaboration has not been worked on • affiliation data • journal/series data • subject classification data • New ideas and partnerships welcome!
More history • In September 2006 I started to work on a document that would describe a general software system to maintain centrality calculations and interface. • This is the Metz paper at http://openlib.org/ho me/krichel/work/metz.html • It was first implemented by Dmitri Ishkov, but in a way that I did not like. • I have recently been rewriting the software and the spec. After 4 years it has become a hard-hat area again.
basic ideas • Software written in Perl for mod_fcgi. • Can support a number of networks but does not automate addition and removal of networks. • Computational intensity controlled by crontab entries. • Perl manipulated XML structures (nuclea). All presentation work done through XSLT. • Verry limited use of database technology.
key concept • A source contains network data. These are descriptions of nodes. • A nettype is a type of network. The nettype determines the structure of the network, i.e. the numbers in the edges matrix. • Every network has a source and a nettype. All icanis functions are parameterized by them.
nodes table • There is a single table for all nodes. • name • homepage • node_tist • path_tist • closeness • closeness_rank • betweenness • betweenness_rank • In addition, there is an URL and nodepage attribute that can be generated using a configuration Perl module.
path calculations • All software that I know basically can calculate the paths from a single start point to all other nodes, as specified in the edges matrix at the time of calculations. • Although the Metz specified some crude instruction for a in incremental recalculation of paths. • I have completely abandoned that approach.
path data store • Historically, our attempts to feed paths into a database came to a sad end. • Now the paths are held in a file, one per node. • This implies that the same information is held twice on disk. • At path search time, software determines the more recent source of path data and uses that one.
closeness update • Closeness centrality can be calculated knowing information about a single node only. • Closeness is immediately updated
betweenness update • This is particularly hard because the basis of calculations is all paths. • Icanis uses a construct called the inter file. At path calculation time, a file called inter, in a directory determined by the handle of the starting node, is created or updated. • It contains, for each node, the number of times it has been seen an as intermediary on the paths for this node. • This enables easy betweenness calculations.
ranking updates • The paths database contains not only values for node criteria, but also their rankings. • The ranking for each criterion is updated when the static ranking pages are calculated. The nodes pages refer to the rankings as calculated at that time.
node visualization • One interface problem is the choice of representation of a node. • All registrants can give us a homepage address, but it is optional and may not be up to date. • We have the node_page, an internal page of an icanis implementer. • We have the URL, an address of an external service we can link to for node information.
static html pages • icanis tries to rely as much as possible on file based responses. • This is implemented for all browsable components. • There is one file per node. • There is one file per batch of criteria rank. The batch size is given as a run-time parameter at page renewal. • Make paths browsable seems difficult. At this time only a seach is supported.
RAS implementation • RAS data forms a source called “ras”. Currently an implementation with the nettype “mans” (weighed symmetric network) exists at the address http://collec.repec.org/ras/mans. • A search for nodes is still to be done. At the moment nodes can be browsed only. Destinations for paths can be searched.
problem with mans • The results appear to lend themselves to a paradox of a shortest path between two collaborators that have written a paper together. • This is the Joseph Pearlman / Thomas Sargent problem. I have obseved it with these two authors.
a new binary network • Since this is to appeal to humans rather than to computers, it appears to be best to return to a binary representation of edges. • Since binary networks tend to produce a vast array of multiple shorted paths, a mans edges length can be used to eliminate all paths that don’t have shortest length in a weighed network. • We can take random selection of the rest.
other application • This technology can be extended to many domains. • For example, I did some analysis using RePEc data into the centrality of JEL classification categories using relationships of classification numbers in actual economics papers. • But that’s a topic for another day!