1 / 37

Link Prediction in Co-Authorship Network

Link Prediction in Co-Authorship Network. Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu. Introduction. Link prediction Introduce future connections within the network scope Co-authorship network A network of collaborations among researchers, scientists, academic writers.

ciel
Download Presentation

Link Prediction in Co-Authorship Network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Link Prediction in Co-Authorship Network Le Nhat Minh ( A0074403N) Supervisor: Dongyuan Lu

  2. Introduction • Link prediction • Introduce future connections within the network scope • Co-authorship network • A network of collaborations among researchers, scientists, academic writers

  3. Introduction • Potential applications • Recommend experts or group of researchers for individual researcher.

  4. Outline • Problem Background • Related Work • Workflow • Conclusion • Result Analysis • Research plan

  5. Problem Background • What connect researchers together ? • Given an instance of co-authorship network: • A researcher connect to another if they collaborated on at least one paper. X X X Y X 2001 Y 2004

  6. Problem Background • How to predict the link? • Based on criteria: • Co-authorship network topology • Researcher’s personal information • Researcher’s papers • Boost up link predictions performance • Recommend link should be really relevant to the interest of the authors or at least possible for researcher to collaborate.

  7. Related Work • Link prediction problems in Social network • Liben‐Nowell, D., & Kleinberg, J., 2007 • Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S., 2013 • In social network, interactions among users are very dynamic with: • Creation of new link within a few days • Deletion or replacement of the existent links • Different features present by the two networks • Characteristics of individual researcher : citations, affiliations , institutions, ... • Characteristics of person : marriage status, ages, working places, …

  8. Three mainstream approaches for link prediction: • Similarity based estimation • Liben‐Nowell, D., & Kleinberg, J., 2007 • Maximum likelihood estimation • Murata, T., & Moriyasu, S., 2008 • Guimerà, R., & Sales-Pardo, M., 2009 • Supervised Learning model • Pavlov, M., & Ichise, R., 2007 • Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006

  9. Similarity Based Estimation • Use metrics to estimate proximities of pairs of researchers • Based on those proximities to rank pairs of researchers • The top pairs of researchers will likely to be the recommendations.

  10. Similarity Based Estimation • Network structure based measurement Some conventions:

  11. Similarity Based Estimation • Common Neighbor: Y X

  12. Similarity Based Estimation • Jaccard’s coefficient: Y X

  13. Similarity Based Estimation • Preferential Attachment: Y X

  14. Similarity Based Estimation • Adamic/Adar: Z Y X

  15. Similarity Based Estimation • Shortest Path: • Defines the minimum number of edges connecting two nodes. • PageRank: • A random walk on the graph assigning the probability that a node could be reach. The proximity between a pair of node can be determined by the sum of the node PageRank.

  16. Maximum Likelihood Estimation • Predefine specific rules of a network • Required a prior knowledge of the network • The likelihood of any non-connected link is calculated according to those rules.

  17. Supervised Learning Model • Construct dimensional feature vectors • Fetch these vectors to classifiers to optimize a target function (training model) • Link prediction becomes a binary classification

  18. Supervised Learning Model • Related work (Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M., 2006) using: • Decision Tree • SVM (Linear Kernel) • K nearest neighbor • Multilayer Perceptron • Naives Bayes • Bagging • Combine many classifiers (Pavlov, M., & Ichise, R., 2007) • Decision stump + AdaBoost • Decision Tree + AdaBoost • SMO + AdaBoost

  19. Summary • Similarity based estimation • Not quite well-perform • Maximum likelihood • Depend on the network • Supervised learning model • Perform better than similarity based estimation

  20. Workflow Classifier Model Features

  21. Graph Description • Co-authorship graph: • Undirected graph G (V , E) • Node or Vertex ( Author ) • Author ID • Author Name • Link or Edge (Co-authorship) • Pair of author ID • List of publication year followed by paper title (Ex: 2004 :”Introduction to …” )

  22. Setting up data • Dataset is separated into 2 timing spans: 2000 – 2010 and 2010 – 2013 • The first is for training, the latter is for testing. • Currently, there are 134,307 researchers in the network 2000 – 2013. • Crop out authors who are not available in testing period, remaining 104,265 researchers

  23. Setting up data • Choose a subset from 104,265 researchers • Experiment on 937 researchers

  24. Baseline Features • Extract features from the network structure: • Local similarity • Common Neighbor • Adamic/Adar • Preferential Attachment • Jaccard’s coefficient • Global similarity • Shortest Path • PageRank

  25. Baseline Features • Feature for co-authorship network • Keywordmatching (Cohen, S., & Ebel, L., 2013 ) A suggested metric to measure the textual relavancy uses a TF-IDF based function to determine.

  26. Proposed Features • Productivity of the authors Observe the “history” of an author • For example, at a particular node A: T0 = 2000 T1 = 2004 T2 = 2005 T3= 2006 n=3 m=1 n=4 m=2 n=6 m=2 n=7 m=3 n : No. of shared paper m: No. of collaborators i=0 i=1 i=2 i=3

  27. Proposed Features • Productivity of the authors Observe the “history” of an author The “productivity” of node A: α : a constant to assign the weight of each time period

  28. Training set • Set up training data • Withn nodes, there is possible links. • Among those, separate two links • Positive link: links appear in training years. • Negativelink: the remaining non-existent link in training years. Note: Avoid bias training by balancing the number of instances between trueand false label. • Classify all the non-existent links • Compare with the testing data

  29. Experimental Results • New links to predict: 57 links • Measurement of performance • Precision: • Recall: • Harmonic mean:

  30. Result Analysis • Possible reasons • Features • Small set of data – sampling problem • Instances of the negative links used for training

  31. Research Plan • Use weighted graph with parameters: • No. of papers • No. of neighbor • No. of citations • Focus on features that specifically target the co-authorship network: • Citations • Institutions • Enlarge the experiment dataset size Thank you

  32. References • Adamic, L. A., & Adar, E. (2003). Friends and neighbors on the web. Social networks, 25(3), 211-230. • Al Hasan, M., Chaoji, V., Salem, S., & Zaki, M. (2006). Link prediction using supervised learning. In SDM’06: Workshop on Link Analysis, Counter-terrorism and Security. • Liben‐Nowell, D., & Kleinberg, J. (2007). The link‐prediction problem for social networks. Journal of the American society for information science and technology, 58(7), 1019-1031. • Pavlov, M., & Ichise, R. (2007). Finding Experts by Link Prediction in Co-authorship Networks. FEWS, 290, 42-55. • Murata, T., & Moriyasu, S. (2008). Link prediction based on structural properties of online social networks. New Generation Computing, 26(3), 245-257. • Guimerà, R., & Sales-Pardo, M. (2009). Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 106(52), 22073-22078. • Bliss, C. A., Frank, M. R., Danforth, C. M., & Dodds, P. S. (2013). An Evolutionary Algorithm Approach to Link Prediction in Dynamic Social Networks. arXiv preprint arXiv:1304.6257. • Cohen, S., & Ebel, L. (2013). Recommending collaborators using keywords. In Proceedings of the 22nd international conference on World Wide Web companion 959-962.

  33. Link per year of training set is greater than link per year of testing set: • In testing period, only consider “new” collaborations. • Any collaborations between researchers that already has a link will be disregarded.

  34. Results with different classifiers

  35. Proposed Feature • The reason for proposing this feature: • Keep track of the researcher tendency • Give “bonus” to researcher who tend to collaborate with “new” colleagues rather than “old” ones • Also give high score for prolific researchers (based on number of published paper)

  36. Stochastic Block Model • Guimerà, R., & Sales-Pardo, M., 2009

  37. Stochastic Block Model 6 1 7 2 4 X 3 Y 5 The reliability of an individual link is:

More Related