220 likes | 321 Views
Informetric methods seminar. Tutorial 2: Using Matlab for network construction, ranking, clustering, topic modeling, and path finding Erjia Yan. Contents. Network construction Ranking C lustering T opic modeling P ath finding. Contents. Network construction Ranking C lustering
E N D
Informetric methods seminar Tutorial 2: Using Matlab for network construction, ranking, clustering, topic modeling, and path finding Erjia Yan
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
From data to networks • Bibliographical data
Web of Science format • Paper-to-paper citation network is the base • Web of Science cited references format: • First Author, Year Of Publication, Abbreviated Journal Name, Volume Number, Beginning Page Number • AANESTAD M, 2011, J STRATEGIC INF SYST, V20, P161 • All fields can be found in “full record + cited references” downloading option Some of the newer records may also have DOI. For a better match, it is better to remove the DOI from the cited references
Citation matching • For citing papers, extract these fields and format them into Web of Science cited reference format. • Now we have citing papers and cited references that have the same format • Use these two fields, construct an internal citation network that only contains those cited references that are cited by the citing papers in the data set
Procedures • If you can write an app for this, it would be great! • Otherwise, you can follow these instructions • Converting into • Use Access to construct the network • Have a table for citing papers • Import the converted citation pairs to Access • Use query to extract those pairs whose papers are in the table • Now you have the node info and link info • Import both into Matlab
Adjacent matrices • Now we have paper-to-paper citation networks, but in order to construct for instance author-to-author citation or author co-citation networks, we need to use adjacent matrices. Authors a cell number 1 (i,j)=1 indicates paper i is written by author j Papers
Procedures • Convert into • Add to the beginning of the file • Use Txt2Pajek on the linkage file • Import the edge section of the .net file to Matlab • Select M(1:n,n+1:m) where m is the col size. The selection is our author-paper adjacent matrix
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
PageRank • By David Gleich of Purdue University • http://www.mathworks.com/matlabcentral/fileexchange/11613-pagerank • pagerank(M,options) • options.c: the teleportation coefficient [double | {0.85}] • options.v: the personalization vector [vector | {uniform: 1/n}]
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
Built-in functions • K-means • IDX = kmeans(X,k) • http://www.mathworks.com/help/stats/kmeans.html • Hierarchical clustering • http://www.mathworks.com/help/stats/hierarchical-clustering.html
Modularity-based clustering • By MIT Strategic Engineering • http://strategic.mit.edu/downloads.php?page=matlab_networks • [modules,module_hist,Q] = newmangirvan(adj,k) • [groups_hist,Q]=newman_comm_fast(adj)
VOSviewer clustering • By Nees van Eck and Ludo Waltman of Leiden University • http://www.vosviewer.com/relatedsoftware/ • A variant of the modularity-based clustering technique • [X, cluster_size, V] = VOS_clustering(A, P)
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
Matlab Topic Modeling Toolbox • By Mark Steyvers of University of California Irvine • http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm • Input: The input is a bag of word representation containing the number of times each words occurs in a document.
Contents • Network construction • Ranking • Clustering • Topic modeling • Path finding
Bioinformatics toolbox • http://www.mathworks.com/help/bioinfo/ref/graphshortestpath.html • [dist, path, pred]=graphshortestpath(G,S,T) • from S to T in graph G • [dist] = graphallshortestpaths(G) • find all shortest path in graph G; dist is a distance matrix for the shortest path of each pair of nodes