400 likes | 725 Views
Link Prediction in Co-Authorship Network. Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu. Introduction. Link prediction Introduce future connections within the network scope Co-authorship network A network of collaborations among researchers, scientists, academic writers.
E N D
Link Prediction in Co-Authorship Network Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu
Introduction • Link prediction • Introduce future connections within the network scope • Co-authorship network • A network of collaborations among researchers, scientists, academic writers
Introduction • Potential applications • Recommend experts or group of researchers for individual researcher.
Outline • Problem Background • Related Work • Workflow • Conclusion • Result Analysis • Research plan
Problem Background • What connect researchers together ? • Given an instance of co-authorship network: • A researcher connect to another if they collaborated on at least one paper. X X X Y X 2001 Y 2004
Problem Background • How to predict the link? • Based on criteria: • Co-authorship network topology • Researcher’s personal information • Researcher’s papers • Boost up link predictions performance • Recommend link should be really relevant to the interest of the authors or at least possible for researcher to collaborate.
Related Work • Link prediction problems in Social network • Liben‐Nowell, D., & Kleinberg, J., 2007 • Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013 • In social network, interactions among users are very dynamic with: • Creation of new link within a few days • Deletion or replacement of the existent links • Different features present by the two networks • Characteristics of individual researcher : citations, affiliations , institutions, ... • Characteristics of person : marriage status, ages, working places, …
Three mainstream approaches for link prediction: • Similarity based estimation • Liben‐Nowell, D., & Kleinberg, J., 2007 • Maximum likelihood estimation • Murata, T., & Moriyasu, S., 2008 • Guimerà, R., & Sales-Pardo, M., 2009 • Supervised Learning model • Pavlov, M., & Ichise, R., 2007 • Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006
Similarity Based Estimation • Use metrics to estimate proximities of pairs of researchers • Based on those proximities to rank pairs of researchers • The top pairs of researchers will likely to be the recommendations.
Similarity Based Estimation • Network structure based measurement Some conventions:
Similarity Based Estimation • Common Neighbor: Y X
Similarity Based Estimation • Jaccard’s coefficient: Y X
Similarity Based Estimation • Preferential Attachment: Y X
Similarity Based Estimation • Adamic/Adar: Z Y X
Similarity Based Estimation • Shortest Path: • Defines the minimum number of edges connecting two nodes. • PageRank: • A random walk on the graph assigning the probability that a node could be reach. The proximity between a pair of node can be determined by the sum of the node PageRank.
Maximum Likelihood Estimation • Predefine specific rules of a network • Required a prior knowledge of the network • The likelihood of any non-connected link is calculated according to those rules.
Supervised Learning Model • Construct dimensional feature vectors • Fetch these vectors to classifiers to optimize a target function (training model) • Link prediction becomes a binary classification
Supervised Learning Model • Related work (Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006) using: • Decision Tree • SVM (Linear Kernel) • K nearest neighbor • Multilayer Perceptron • Naives Bayes • Bagging • Combine many classifiers (Pavlov, M., & Ichise, R., 2007) • Decision stump + AdaBoost • Decision Tree + AdaBoost • SMO + AdaBoost
Summary • Similarity based estimation • Not quite well-perform • Maximum likelihood • Depend on the network • Supervised learning model • Perform better than similarity based estimation
Workflow Classifier Model Features
Graph Description • Co-authorship graph: • Undirected graph G (V , E) • Node or Vertex ( Author ) • Author ID • Author Name • Link or Edge (Co-authorship) • Pair of author ID • List of publication year followed by paper title (Ex: 2004 :”Introduction to …” )
Setting up data • Dataset is separated into 2 timing spans: 2000 – 2010 and 2010 – 2013 • The first is for training, the latter is for testing. • Currently, there are 134,307 researchers in the network 2000 – 2013. • Crop out authors who are not available in testing period, remaining 104,265 researchers
Setting up data • Choose a subset from 104,265 researchers • Experiment on 937 researchers
Baseline Features • Extract features from the network structure: • Local similarity • Common Neighbor • Adamic/Adar • Preferential Attachment • Jaccard’s coefficient • Global similarity • Shortest Path • PageRank
Baseline Features • Feature for co-authorship network • Keywordmatching (Cohen, S., & Ebel, L., 2013 ) A suggested metric to measure the textual relavancy uses a TF-IDF based function to determine.
Proposed Features • Productivity of the authors Observe the “history” of an author • For example, at a particular node A: T0 = 2000 T1 = 2004 T2 = 2005 T3= 2006 n=3 m=1 n=4 m=2 n=6 m=2 n=7 m=3 n : No. of shared paper m: No. of collaborators i=0 i=1 i=2 i=3
Proposed Features • Productivity of the authors Observe the “history” of an author The “productivity” of node A: α : a constant to assign the weight of each time period
Training set • Set up training data • Withn nodes, there is possible links. • Among those, separate two links • Positive link: links appear in training years. • Negativelink: the remaining non-existent link in training years. Note: Avoid bias training by balancing the number of instances between trueand false label. • Classify all the non-existent links • Compare with the testing data
Experimental Results • New links to predict: 57 links • Measurement of performance • Precision: • Recall: • Harmonic mean:
Result Analysis • Possible reasons • Features • Small set of data – sampling problem • Instances of the negative links used for training
Research Plan • Use weighted graph with parameters: • No. of papers • No. of neighbor • No. of citations • Focus on features that specifically target the co-authorship network: • Citations • Institutions • Enlarge the experiment dataset size Thank you
References • Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social networks, 25(3), 211-230. • Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised learning. In SDM’06: Workshop on Link Analysis, Counter-terrorism and Security. • Liben‐Nowell, D., & Kleinberg, J. (2007). The link‐prediction problem for social networks. Journal of the American society for information science and technology, 58(7), 1019-1031. • Pavlov, M., & Ichise, R. (2007). Finding Experts by Link Prediction in Co-authorship Networks. FEWS, 290, 42-55. • Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), 245-257. • Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), 22073-22078. • Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S. (2013). An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks. arXiv preprint arXiv:1304.6257. • Cohen, S., & Ebel, L. (2013). Recommending collaborators using keywords. In Proceedings of the 22nd international conference on World Wide Web companion 959-962.
Link per year of training set is greater than link per year of testing set: • In testing period, only consider “new” collaborations. • Any collaborations between researchers that already has a link will be disregarded.
Proposed Feature • The reason for proposing this feature: • Keep track of the researcher tendency • Give “bonus” to researcher who tend to collaborate with “new” colleagues rather than “old” ones • Also give high score for prolific researchers (based on number of published paper)
Stochastic Block Model • Guimerà, R., & Sales-Pardo, M., 2009
Stochastic Block Model 6 1 7 2 4 X 3 Y 5 The reliability of an individual link is: