Scholarly Impact in Semantic Web Data: Analysis of Top Journals and Researchers

Field Profiling

Productivity Top Journals Top Researchers Measuring Scholarly Impact in the field of Semantic Web Data: 44,157 papers with 651,673 citations from Scopus (1975-2009), and 22,951 papers with 571,911 citations from WOS (1960-2009)

Impact through citation Impact Top Journals Top Researchers

Rising Stars • In WOS, M. A. Harris (Gene Ontology-related research), T. Harris (design and implementation of programming languages) and L. Ding (Swoogle – Semantic Web Search Engine) are ranked as the top three authors with the highest increase of citations. • In Scopus, D. Roman (Semantic Web Services), J. De Bruijn (logic programming) and L. Ding (Swoogle) are ranked as top three for the significant increase in number of citations. Ding, Y. (2010). Semantic Web: Who is Who in the field, Journal of Information Science, 36(3): 335-356.

Section 1 Data collection

Steps • Step 1: • Data collection • Using journals • Using keywords • Example • INFORMATION RETRIEVAL, INFORMATION STORAGE and RETRIEVAL, QUERY PROCESSING, DOCUMENT RETRIEVAL, DATA RETRIEVAL, IMAGE RETRIEVAL, TEXT RETRIEVAL, CONTENT BASED RETRIEVAL, CONTENT-BASED RETRIEVAL, DATABASE QUERY, DATABASE QUERIES, QUERY LANGUAGE, QUERY LANGUAGES, and RELEVANCE FEEDBACK.

Web of Science • Go to IU web of science • http://libraries.iub.edu/resources/wos • For example, • Select Core Collection • search “information Retrieval” for topics, for all years

Web of Science

Output

Python • Download Python: https://www.python.org/downloads/ • In order to run Python flawlessly, you might have to change certain environment settings in Windows. • In short, your path is: • My Computer ‣ Properties ‣ Advanced ‣ Environment Variables • In this dialog, you can add or modify User and System variables. To change System variables, you need non-restricted access to your machine (i.e. Administrator rights). • User variable: C:\Program Files (x86)\Python27\Lib; • Or go to command line using “Set” and “echo %path%”

Python Script for conversion #!/usr/bin/env python # encoding: utf-8 """ conwos.py convert WOS file into format. """ import sys import os import re paper = 'paper.tsv' reference = 'reference.tsv' defsource = 'source' def main(): global defdestination global defsource source = raw_input('What is the name of source folder?\n') if len(source) < 1: source = defsource files = os.listdir(source) fpaper = open(paper, 'w') fref = open(reference, 'w') uid = 0 for name in files: if name[-3:] != "txt": continue fil = open('%s\%s' % (source, name)) print '%s is processing...' % name first = True Conwos1.py

Python Script for conversion for line in fil: line = line[:-1] if first == True: first = False else: uid += 1 record = str(uid) + "\t" refs = "" elements = line.split('\t') for i in range(len(elements)): element = elements[i] if i == 1: authors = element.split('; ') for j in range(5): if j < len(authors): record += authors[j] + "\t" else: record += "\t" elifi == 29: refs = element refz = getRefs(refs) for ref in refz: fref.write(str(uid) + "\t" + ref + "\n") continue record += element + "\t" fpaper.write(record[:-1] + "\n") fil.close() fpaper.close() fref.close()

Python Script for conversion defgetRefs(refs): refz = [] reflist = refs.split('; ') for ref in reflist: record = "" segs = ref.split(", ") author = "" ind = -1 if len(segs) == 0: continue for seg in segs: ind += 1 if isYear(seg): record += author[:-2] + "\t" + seg + "\t" break else: author += seg + ", " ind += 1 if ind < len(segs): if not isVol(segs[ind]) and not isPage(segs[ind]): record += segs[ind] + "\t" ind += 1 else: record += "\t" else: record += "\t"

Python Script for conversion if ind < len(segs): if isVol(segs[ind]): record += segs[ind][1:] + "\t" ind += 1 else: record += "\t" else: record += "\t" if ind < len(segs): if isPage(segs[ind]): record += segs[ind][1:] + "\t" ind += 1 else: record += "\t" else: record += "\t" if record[0] != "\t": refz.append(record[:-1]) return refz

Python Script for conversion defisYear(episode): pattern = '^\d{4}$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True defisVol(episode): pattern = '^V\d+$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True defisPage(episode): pattern = '^P\d+$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True if __name__ == '__main__': main()

Convert output to database • Using python script: conwos1.py • Output: paper.tsv, reference.tsv

Convert output to database • Paper.tsv

Convert output to database • Reference.tsv

Load them to Access • Import data from external data at Access

Access Tables • Paper table

Access Tables • Citation table

Section 2 Productivity & impact

Productivity • Top Authors • Find duplicate records (Query template)

Productivity • Top Journals • Find duplicate records (Query template)

Productivity • Top Organizations • Find duplicate records (Query template)

Impact • Highly cited authors • Find duplicate records (Query template)

Impact • Highly cited journals • Find duplicate records (Query template)

Impact • Highly cited articles • Find duplicate records (Query template)

Other indicators • What are other indicators to measure productivity and impact: • Time • Journal impact factor • Journal category • Keyword • … think about something in-depth, what are your new indicators?

Section 3 Author-cocitation network

Top 100 highly cited authors • First select the set of authors with whom you want to build up the matrix • Select top 100 highly cited authors

Author Cocitation Network

Load the network to SPSS

Section 4 clustering

Clustering Analysis • Aim: create clusters of items that have similarity with others in the same cluster and differences with those outside of the cluster. • So to create similarity within cluster and difference between clusters. • Items are called cases in SPSS. • There are no dependent variables for cluster analysis

Clustering Analysis • The degree of similarity and dissimilarity is measured by distance between cases • Euclidean Distance measures the length of a straight line between two cases • The numeric value of distance should be at the same measurement scale. • If it is based on different measurement scales, • Transform to the same scale • Or create a distance matrix first

Clustering • Hierarchical clustering does not need a decision on the number of cluster first, good for a small set of cases • K-means does need # of clusters first, good for a large set of cases

Hierarchical Clustering

Hierarchical Clustering: Data • Data. • The variables can be quantitative, binary, or count data. • Scaling of variables is an important issue--differences in scaling may affect your cluster solution(s). • If your variables have large differences in scaling (for example, one variable is measured in dollars and the other is measured in years), you should consider standardizing them (this can be done automatically by the Hierarchical Cluster Analysis procedure).

Hierarchical Clustering: Data • Case Order • Cluster solution may depend on the order of cases in the file. • You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution.

Hierarchical Clustering: Data • Assumptions. • The distance or similarity measures used should be appropriate for the data analyzed. • Also, you should include all relevant variables in your analysis. • Omission of influential variables can result in a misleading solution. Because hierarchical cluster analysis is an exploratory method, results should be treated as tentative until they are confirmed with an independent sample.

Hierarchical Clustering: Method • Nearest neighbor or single linkage • The dissimilarity between cluster A and B is represented by the minimum of all possible distances between cases in A and B • Furthest neighbor or complete linkage • The dissimilarity between cluster A and B is represented by the maximum of all possible distances between cases in A and B • Between-groups linkage or average linkage • The dissimilarity between cluster A and B is represented by the average of all possible distances between cases in A and B • Within-groups linkage • The dissimilarity between cluster A and B is represented by the average of all the possible distances between the cases within a single new cluster determined by combining cluster A and B.

Hierarchical Clustering: Method • Centroid clustering • The dissimilarity between cluster A and B is represented by the distance between the centroid for the cases in cluster A and the centroid for the cases in cluster B. • Ward’s method • The dissimilarity between cluster A and B is represented by the “loss of information” from joining the two clusters with this loss of information being measured by the increase in error sum of squares. • Median clustering • The dissimilarity between cluster A and cluster B is represented by the distance between the SPSS determined median for the cases in cluster A and the median for the cases in cluster B. All three methods should use squared Euclidean distance rather than Euclidean distance

Measure for Interval • Euclidean distance. The square root of the sum of the squared differences between values for the items. This is the default for interval data. • Squared Euclidean distance. The sum of the squared differences between the values for the items. • Pearson correlation. The product-moment correlation between two vectors of values. • Cosine. The cosine of the angle between two vectors of values. • Chebychev. The maximum absolute difference between the values for the items. • Block. The sum of the absolute differences between the values of the item. Also known as Manhattan distance. • Minkowski. The pth root of the sum of the absolute differences to the pth power between the values for the items. • Customized. The rth root of the sum of the absolute differences to the pth power between the values for the items.

Transform values • Z scores. Values are standardized to z scores, with a mean of 0 and a standard deviation of 1. • Range -1 to 1. Each value for the item being standardized is divided by the range of the values. • Range 0 to 1. The procedure subtracts the minimum value from each item being standardized and then divides by the range. • Maximum magnitude of 1. The procedure divides each value for the item being standardized by the maximum of the values. • Mean of 1. The procedure divides each value for the item being standardized by the mean of the values. • Standard deviation of 1. The procedure divides each value for the variable or case being standardized by the standard deviation of the values.

Scholarly Impact in Semantic Web Data: Analysis of Top Journals and Researchers