Usage-based models of science: applications to community mapping and scholarly assessment

Usage-based models of science: applications to community mapping and scholarly assessment Johan Bollen Digital Library Research & Prototyping Team Los Alamos National Laboratory - Research Library jbollen@lanl.gov Acknowledgements: Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL), Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL) Research supported by the Andrew W. Mellon Foundation.

Scholarly evaluation • Qualitative/subjective: • Peer review • Tenure committees • Networks • Quantitative: • Citation counts • Many proposals, little clarity

Low influence High influence Journal x All (2003) 2001 2002 2003 Impact evaluation from citation data. Citation data: • Golden standard of scholarly evaluation • Citation = scholarly influences. • Extracted from published materials. • Main bibliometric data source for scholarly evaluation. IF is part of Journal Citation Reports (JCR) JCR = citation graph 2005 journal citation network • +- 8,560 journals • 6,370,234 weighted citation edges Impact Factor = mean 2 year citation rate

Citation graph: other metrics? Possibility to calculate other metrics of impact. How about PageRank? • IF = normalized indegree in citation graph • Popularity • Favors review journals • Are all citations created equal? • Transfer citer influence to citee • Normalize edge weight • Modulate transfer for citer influence • Indicator of node prestige Pinski, G., & Narin, F. (1976). Citation influence for journal aggregates of scientific publications: theory, with application to the literature of physics. Information processing and management, 12(5), 297-312. Chen, P., Xie, H., Maslov, S., & Redner, S. (2007). Finding scientific gems with Google. Journal of Informetrics, 1(1), arxiv.org/abs/physics/0604130.

Popularity vs. prestige rho=0.61 Outliers reveal differences in aspects of “status” IF ~ general popularity PR ~ prestige, influence Johan Bollen, Marko A. Rodriguez, and Herbert Van deSompel. Journal status. Scientometrics, 69(3), December 2006 (DOI: 10.1007/s11192-006-0176-z) Philip Ball. Prestige is factored into journal ratings. Nature439, 770-771, February 2006 (doi:10.1038/439770a)

Domain specific

Low influence High influence A few issues… On the basis of citation graph, many alternatives: • Normalized citation statistics • Social network indicators (centrality) • Semantic web hybrids • Which one to choose? Semantics: • Validity: does indicator express what it is intended to express? • Reliability: sensitivity to changes in network structure? Another question: need they be based on citation data? • Other means of expressing “influence” • Readership, services requested, etc. • Usage!

Usage data Citation data pertain to 4 levels in the scholarly communication process: • Community: authors of journal articles. • Artifacts: journal articles. • Data: citation data (+1 year publication delay). • Metrics: mean citation rate rules supreme. • Scale: expensive to extract. However, for usage data: • Community: all users including most authors. • Artifacts: all that is accessible. • Data: recorded upon publication. • Metrics: a range of web and web2.0 inspired metrics, e.g. clickstream and datamining. • Scale: automatically recorded at point of service. Hence, various initiatives focused on usage data: COUNTER, IRS, SUSHI, CiteBase. But where are the metrics?

Challenges to usage-based metrics. Usage data is here: • Routinely recorded by library, publisher and aggregators services • Large-scale and longitudinal • Highly detailed • Requester • Referent • Sessions • Service type Usage-based metrics have lagged development. Here’s why: • Multiple communities • Multiple collection (artifacts) • Data: usage data limited to particular sub-communities and collections of artifacts. • Metrics: various metrics studied. Different results because of sample, collection or metric definition? Aspects of scholarly status?

Our experience: divergence and convergence. Convergence! • Is this guaranteed? • To what? A common-baseline? • What we do know: • Institutional perspective can be contrasted to baseline. • As aggregation increases in size, so does value. • Cross-validation is key.

MESUR1: Metrics from Scholarly Usage of Resources. Andrew W. Mellon Foundation funded study of usage-based metrics (2006-2008) Executed at the Digital Library Research and Prototyping team, Los Alamos National Laboratory Research Library Objectives: • Create a model of the scholarly communication process. • Create a large-scale reference data set (semantic network) that relates all relevant bibliographic, citation and usage data according to (1). • Characterize reference data set. • Survey usage-based metrics on basis of reference data set. 1. Pronounced “measure”

The MESUR project. Johan Bollen (LANL): Principal investigator Herbert Van de Sompel (LANL): Architectural consultant Marko Rodriguez (LANL): PhD student (Computer Science, UCSC) Ryan Chute (LANL): Software development and database management Lyudmila Balakireva (LANL): Database management and HCI Aric Hagberg (LANL): Mathematical and statistical consultant Luis Bettencourt (LANL): Mathematical and statistical consultant “The Andrew W. Mellon Foundation has awarded a grant to Los Alamos National Laboratory (LANL) in support of a two-year project that will investigate metrics derived from the network-based usage of scholarly information. The Digital Library Research & Prototyping Team of the LANL Research Library will carry out the project. The project's major objective is enriching the toolkit used for the assessment of the impact of scholarly communication items, and hence of scholars, with metrics that derive from usage data.”

Project data flow and work plan. 4 2 3 1

Project timeline. We are here!

Presentation structure:an update on the MESUR project • Usage data characterization • Analysis • Usage graphs • Metrics analysis • Results • Discussion 1 2 4 3

Presentation structure:an update on the MESUR project • Usage data characterization • Analysis • Usage graphs • Metrics analysis • Results • Discussion

A tapestry of usage data providers: Each represent different, and possibly overlapping, samples of the scholarly community. Institutions: • Institutional communities • Many collections Aggregators: • Many communities • Many collections Publishers: • Many communities • Publisher collection Main players: • Individual institutions • Link resolver data • EZ proxy • Aggregators • Ad hoc formats • COUNTER reports • Publishers • Ad hoc • COUNTER reports

Negotiation results • Data: > 1B usage events and 1B citations • At this point, 247,083,481 usage events loaded • Another +1,000,000,000 on the way • Documents: > 50M documents • Journals: 326,000 • Includes newspapers, magazines • Professional magazines • Obscure material • Community: > 100M users and authors combined

Data acquired: timelines Span: • Majority: -1 years • Some minor 2002-2003 data Sharing models: • Historical data for period [t-x,t] • Periodical updates • Main issues: • Restoration of archives • Digital preservation issues • All data fields intact • Integration of various sources of usage data

Data flow http://www.mesur.org/schemas/2007- 01/mesur/

journal publishes authors document document person cites uses ontology instances journal1 Bibliographic data isA document3 publishes person1 authors Citation data publishes uses cites authors document4 document2 person4 person2 Usage data uses cites uses document1 person5 person3 An ontology of the scholarly communication process?

Modeling the scholarly communication process:the MESUR ontology. Previous efforts: The ScholOnto project, ABC ontology, VA. Tech: Goncalves (2002), Web Scholars, etc. Requirements: - Combined representation of usage data with bibliographic and citation data. - Fine granularity - Pragmatism in modeling Basic concepts: OWL RDF/XML representation Three basic notions: Documents, Agents and Contexts Context: n-ary relationship between documents and agent. Subclassed to Events and States to express action, e.g. “Uses” vs. continuous state, e.g. “hasImpact”

Modeling the scholarly communication process:the MESUR ontology. http://www.mesur.org/schemas/2007-01/mesur/ Examples: An author (Agent) publishes (Context:Event) an article (Document) A user (Agent) uses (Context:Event) a journal (Document) Based on OntologyX5 framework developed by Rights.com Rodriguez, Bollen & Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage. JCDL07

MESUR’s usage data representation framework. Assumptions: Sessions identify sequences of events (same user - same document) Documents tied to aggregate request objects Request objects consist of series of service requests and the date and time at which request took place. Implications: Sequence preserved. Most usage data and statistics can be reconstructed from framework Lends itself to XML and RDF format Permits request type filtering Example: COUNTER stats= aggregate request counts for each document (journal) grouped by date/time (month) Usage graph: overlay sessions with same document pairs • Out of 13 MESUR providers so far, only 3 natively follow this model. • The usage data of another 8 contains the necessary information for conversion

Implications for structural analysis of usage data • Sequence preservation allows: • Reconstruction of user behavior • Usage graphs! • Statistics do not allow this type of analysis BUT are useful for: • validating results • rankings

Documents are associated by co-occurrence in same session Same session, same user: common interest Frequency of co-occurrence in same session: degree of relationship Normalized: conditional probability Usage data: Works for journals and articles Anything for which usage was recorded Options: Strict pair-wise sequence? All within session? Take “distance” into account? Note: not something we invented. Association rule learning in data mining. Beer and diapers! How to generate a usage graph.

journal1 journal2 Usage graphs MESUR graph created: • 200M usage events • Usage restricted to 2006 • Journals clipped to 7600 2004 JCR journals • Pair-wise sequences • Within session, only consecutive pairs • Raw frequency weights Network analysis now on-going • Network properties • Clustering

Lay of the land: flow of information.

Metric types Note: • Metrics can be calculated both on citation and usage data • Structural metrics require graphs • Citation graph, e.g. 2004 JCR • Usage graph, e.g. created by MESUR

Frequentist metrics Raw cites: • Count number of citations to document or journal • Count number of times document or journal was accessed • Normalized: • Journal Impact factor: • Number of citations to journal • Divided by number of articles published in journal • Usage Impact Factor • Number of request for journal or article • Divided by number of articles published in journal Johan Bollen. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. Journal of the American Society for Information Science and Technology, 59(1), 2008

Structural metrics calculated from usage graph Classes of metrics: • Degree • Shortest path • Random walk • Distribution Degree • In-degree • Out-degree Shortest path • Closeness • Betweenness • Newman Random walk • PageRank • Eigenvector Distribution • In-degree entropy • Out-degree entropy • Bucket Entropy Each can be defined to take into account weights by e.g. means of weighted shortest path definition

Social network metrics: different aspects of impact I Degree metrics Degree centrality In-degree/IF Closeness centrality Shortest path metrics Betweenness centrality

Social network metrics: different aspects of impact II Random walk Metrics, e.g. PageRank • Basic idea: • Random walkers follow edges • + Probability of random teleportation • Visitation numbers converge ~ PageRank • “Stationary Probability distribution” From wikipedia.org

List of metrics: JCR 2004 CITE-BE CITE-ID CITE-IE CITE-IF CITE-OD CITE-OE CITE-PG CITE-UBW CITE-UBW-UN CITE-UCL CITE-UCL-UN CITE-UNM CITE-UNM-UN CITE-UPG CITE-UPR CITE-WBW CITE-WBW-UN CITE-WCL CITE-WCL-UN CITE-WID CITE-WNM CITE-WNM-UN CITE-WOD CITE-WPR Usage-based metrics: MESUR 2006 USES-BE, USES-ID USES-IE USES-OD USES-OE USES-PG USES-UBW USES-UBW-UN USES-UCL USES-UCL-UN USES-UNM USES-UNM-UN USES-UPG USES-UPR USES-WBW USES-WBW-UN USES-WCL USES-WCL-UN USES-WID USES-WNM USES-WNM-UN USES-WOD USES-WPR Set of metrics calculated on MESUR data set Usage graph creation: Wenzhong Zhao Metrics: Marko Rodriguez and Aric Hagberg

Overlaps and discrepancies Rankings and correlation structure will reveal components of notion of “scholarly impact” across citation and usage data.

Citation rankings 2004 Impact Factor value journal 1 49.794 CANCER 2 47.400 ANNU REV IMMUNOL 3 44.016 NEW ENGL J MED 4 33.456 ANNU REV BIOCHEM 5 31.694 NAT REV CANCER Citation Pagerank value journal 1 0.0116 SCIENCE 2 0.0111 J BIOL CHEM 3 0.0108 NATURE 4 0.0101 PNAS 5 0.006 PHYS REV LETT betweenness value journal 1 0.076 PNAS 2 0.072 SCIENCE 3 0.059 NATURE 4 0.039 LECT NOTES COMPUT SC 5 0.017 LANCET Closeness value journal 1 7.02e-05 PNAS 2 6.72e-05 LECT NOTES COMPUT SC 3 6.43e-05 NATURE 4 6.37e-05 SCIENCE 5 6.37e-05 J BIOL CHEM In-Degree value journal 1 3448 SCIENCE 2 3182 NATURE 3 2913 PNAS 4 2190 LANCET 5 2160 NEW ENGL J MED In-degree entropy Value journal 1 9.849 LANCET 2 9.748 SCIENCE 3 9.701 NEW ENGL J MED 4 9.611 NATURE 5 9.526 JAMA

Usage rankings 2004 Impact Factor value journal 1 49.794 CANCER 2 47.400 ANNU REV IMMUNOL 3 44.016 NEW ENGL J MED 4 33.456 NNU REV BIOCHEM 5 31.694 NAT REV CANCER Pagerank value journal 1 0.0016 SCIENCE 2 0.0015 NATURE 3 0.0013 PNAS 4 0.0010 LECT NOTES COMPUT SC 5 0.0008 J BIOL CHEM betweenness value journal 1 0.035 SCIENCE 2 0.032 NATURE 3 0.020 PNAS 4 0.017 LECT NOTES COMPUT SC 5 0.006 LANCET In-Degree value journal 1 4195 SCIENCE 2 4019 NATURE 3 3562 PNAS 4 2438 J BIOL CHEM 5 2432 LECT NOTES COMPUT SC Closeness value journal 1 0.670 SCIENCE 2 0.665 NATURE 3 0.644 PNAS 4 0.591 LECT NOTES COMPUT SC 5 0.587 BIOCHEM BIOPH RES CO In-degree entropy Value journal 1 9.364 MED HYPOTHESES 2 9.152 PNAS 3 9.027 LIFE SCI 4 8.939 LANCET 5 8.858 INT J BIOCHEM CELL B

Metrics relationship

Metrics relationships • Citation and usage metrics reveal an entirely different pattern • Citation is split in 2 section: • Degree metrics (right) • Shortest path and random walk (left) • Usage is split in 4 clusters: • Degree metrics • PageRank and entropy • Closeness • Betweenness • Usage pattern can be caused • Noise in usage graph • Higher density of usage/nodes

Hierarchical cluster analysis Citation PageRank Usage degree Usage Closeness Usage betweenness Usage PageRank Citation closeness Citation Degree Citation betweenness Impact Factor

MESUR: an update • Usage data: • Creation of single largest reference data set of usage, citation and bibliographic data • +1,000,000,000 usage events loaded in next month • Usage data obtained from multiple publishers, aggregators and institutions • Infrastructure for a continued research program in this domain • Results will guide scholarly evaluation and may help produce standards for usage data representation • Usage graphs: • Adequate data model for item-level usage data naturally leads to this • Reduced distortion compared to raw usage: structure counts, not raw hits • Several options on how to create: MESUR investigates option • Metrics: • Frequentist and structural metrics • Each can represent different facets of scholarly impact • Simple metrics can produce adequate results. Law of diminishing returns? • Hybrid metrics based on triple store functionality • Note increasing convergence of usage-metrics to citation metrics as sample increases. • Reference data set will provide years of exciting research: • Let me know what you think.

Johan Bollen, Herbert Van de Sompel, and Marko A. Rodriguez. Towards usage-based impact metrics: first results from the MESUR project. In Proceedings of the Joint Conference on Digital Libraries, Pittsburgh, June 2008 Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007 Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (cs.DL/0610154) Johan Bollen and Herbert Vande Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006. Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006. Johan Bollen, Marko A. Rodriguez, and Herbert Van deSompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030) Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005. Some relevant publications.

Usage-based models of science: applications to community mapping and scholarly assessment