Attribution and Credit: beyond print & citations Johan Bollen Indiana University

Attribution and Credit: beyond print & citations Johan Bollen Indiana University School of Informatics and Computing Center for Complex Networks and System Research jbollen@indiana.edu Acknowledgements: Herbert Van de Sompel (LANL), Marko A. Rodriguez (LANL), Ryan Chute (LANL), Lyudmila L. Balakireva (LANL), Aric Hagberg (LANL), Luis Bettencourt (LANL) Research supported by the NSF and the Andrew W. Mellon Foundation.

Low influence High influence Journal x All (2003) 2001 2002 2003 Impact evaluation from citation data. • Citation data: • Golden standard of scholarly evaluation • Citation = scholarly influences. • Extracted from published materials. • Main bibliometric data source for scholarly evaluation. • IF is part of Journal Citation Reports (JCR) • JCR = citation graph • 2005 journal citation network • +- 8,560 journals • 6,370,234 weighted citation edges • Impact Factor = mean 2 year citation rate

Clearly something’s wrong Digital era: digital scholars Technology has changed Scholarly communication has changed Habits and minds have changed Scholars have changed But scholarly review and assessment? Still stuck in the paper era: Peer review, print, citations, journals… Party like it’s 1899!

The scientific process: the importance of early indicators (Egghe &Rousseau, 2000; Wouters, 1997) (Brody, Harnad, & Carr 2006), • Usage data • Scale, cf. Elsevier downloads (+1B) vs. Wos citations (650M) • Immediate, early stages • Variety of resources and actors • Citation: final products • Publication delays • Focus on publications • Focus on authors

* Recorded for all digital scholarly content, i.e. papers, journals, preprints, blog postings, data, chemical structures, software, … • Not just for ~ 10,000 journals • Not only for peer-reviewed, scholarly articles • Fine-grained: • Information on types of user behavior, sequences, timing, context, clickstreams, … • All users of scholarly information, not only of scholarly authors • Students • Practitioners • Scholars in domains with different citation practices • Interactions are recorded starting immediately after publication • Not after read, published, cited, citation DB (publication delays) • No need to retrieve from publication after the fact • Very large-scale data: • Billions of interactions recorded for millions of users by tens of thousands of scholarly communication services The Promise of Usage Data

Representativeness: • Recorded for particular service • Recorded for particular user community • Challenge: large-scale aggregation of representative data • Attribution and credit: • Citation is explicit, intentional expression of influence • Usage is behavioral, implicit measurement of “attention” • Challenge: turn clickstreams into attribution, credit, and influence • Community acceptance: • Unproven in terms of metrics, services, etc. • Lack of applications and community services • Challenge: creation of framework for community services • This presentation: overview of our work in (1), (2) and (3) Challenges

Presentation structure MESUR’s work on usage data aggregation Turning clickstreams into credit and attribution Community-services Discussion

Timeline and development • 2006-2010 (Los Alamos National Laboratory and Indiana University): • MESUR project: funded by Andrew W. Mellon Foundation • Models of scholarly communication • Large-scale aggregation of usage data • Large-scale survey of usage-based metrics • 2010-present (Indiana University) • MESUR 2.0: funded by NSF and Andrew W. Mellon Foundation • Work on development of community-supported, open and sustainable framework with NISO (Todd Carpenter)

MESUR data flow http://www.mesur.org/schemas/2007- 01/mesur/

Modeling the scholarly communication process:the MESUR ontology. Marko. A Rodrguez, Johan Bollen and Herbert Van de Sompel. A Practical On- tology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries 2007, Vancouver, June 2007 Examples: An author (Agent) publishes (Context:Event) an article (Document) A user (Agent) uses (Context:Event) a journal (Document) Based on OntologyX5 framework developed by Rights.com

MESUR usage DB • 2006-2010: Collaborating publishers, aggregators and institutional consortia: • BMC, Blackwell, UC, CSU (23), EBSCO, ELSEVIER, EMERALD, INGENTA, JSTOR, LANL, MIMAS/ZETOC, THOMSON, UPENN (9), UTEXAS • Scale: • > 1,000,000,000 usage events, and growing… • +50M articles, +-100,000 serials • Period: 2002-2010…

From usage to clickstreams • Minimal requirements for all usage data • Unique usage events (article level) • Fields: unique session ID, date/time, unique document ID and/or metadata, request type • Note difference with usage statistics

Same session ~ documents relatedness Same session, same user: common interest Frequency of co-occurrence = degree of relationship Normalized: conditional probability (P(A|B) != P(B|A), directional Clickstream, breadcrumbs, paths, flow… Usage data is on article level: Works for journals and articles In fact, anything for which usage was recorded Note: not something we invented: association rule learning in data mining. Beer and diapers! How to generate a usage network.

Johan Bollen, Herbert Van de Sompel, Aric Hagberg,Luis Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila Balakireva. Clickstream data yields high-resolution maps of science. PLoS One, February 2009.

Betweenness centrality : Number of geodesics between vi and vj PageRank PR(vi): PageRank of node vi O(vj): out-degree of journal vj N: number of nodes in network L: dampening factor Network science for impact metrics.

Set of metrics calculated on MESUR data set

The MESUR Metrics Map BETWEENNESS PAGERANK(S) USAGE METRICS TOTAL CITES Johan Bollen, Herbert Van de Sompel, Aric Hagberg and Ryan Chute. A Principal Component Analysis of 39 Scientific Impact Measures. PLoS ONE, June 2009. URL: http://dx.plos.org/10.1371/journal.pone.0006022. RATE METRICS

MESUR Mapping and ranking services

Seeking increasing community involvement “ Registration is now open for "Scholarly Evaluation Metrics: Opportunities and Challenges", a one-day NSF-funded workshop that will take place in the Renaissance Washington DC Hotel on Wednesday, December 16th 2009. Participation in this workshop is limited to 50 people. Registration is free at http://informatics.indiana.edu/scholmet09/registration.html. The topic of the workshop is the future of scholarly assessment approaches, including organizational, infrastructural, and community issues. The overall goal is to identify requirements for novel assessment approaches, several of which have been proposed in recent years, to become acceptable to community stakeholders including scholars, academic and research institutions, and funding agencies. The impressive group of speakers and panelists for the workshop includes representatives from each of these constituencies. Further details are available at http://informatics.indiana.edu/scholmet09/announcement.html Workshop organizers: Johan Bollen (jbollen@indiana.edu), Herbert Van de Sompel (hvdsomp@gmail.com) and Ying Ding (dingying@indiana.edu) “

Moving towards community involvement Planning process underway to establish sustainable, open, community supported infrastructure. New support from Andrew W. Mellon foundation to figure it all out. Close collaboration with NISO (Todd Carpenter) Logistics: Data aggregation Normalization Data-related services Data management Science: Metrics Analysis Prediction • =More than sum of parts: • Each component supports the other • Various business and funding models • Generate added value on all levels • Can fundamentally change scholarly communication Services Ranking Assessment Mapping

New, interesting initiatives • Microsoft/MSR: http://academic.research.microsoft.com/ • Altmetrics: http://altmetrics.org/ • Mendeley-based analytics • Publisher-driven initiatives: Elsevier’s SciVal, mapping of science • Google Scholar • Katy Borner at Indiana University: Science of Science Cyberinfrastructure

Johan Bollen, Herbert Van de Sompel, Aric Hagberg, Luis Bettencourt, Ryan Chute, Marko A. Rodriguez, Lyudmila Balakireva. Clickstream data yields high-resolution maps of science. PLoS One, March 2009 (In Press) Johan Bollen, Herbert Van de Sompel, Aric HagBerg, Ryan Chute. A principal component analysis of 39 scientiﬁc impact measures.arXiv.org/abs/0902.2183 Johan Bollen, Marko A. Rodriguez, and Herbert Van deSompel. Journal status. Scientometrics, 69(3), December 2006 (arxiv.org:cs.DL/0601030) Johan Bollen, Herbert Van de Sompel, and Marko A. Rodriguez. Towards usage-based impact metrics: first results from the MESUR project. In Proceedings of the Joint Conference on Digital Libraries, Pittsburgh, June 2008 Marko A. Rodriguez, Johan Bollen and Herbert Van de Sompel. A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and their Usage, In Proceedings of the Joint Conference on Digital Libraries, Vancouver, June 2007 Johan Bollen and Herbert Van de Sompel. Usage Impact Factor: the effects of sample characteristics on usage-based impact metrics. (cs.DL/0610154) Johan Bollen and Herbert Vande Sompel. An architecture for the aggregation and analysis of scholarly usage data. In Joint Conference on Digital Libraries (JCDL2006), pages 298-307, June 2006. Johan Bollen and Herbert Van de Sompel. Mapping the structure of science through usage. Scientometrics, 69(2), 2006. Johan Bollen, Herbert Van de Sompel, Joan Smith, and Rick Luce. Toward alternative metrics of journal impact: a comparison of download and citation data. Information Processing and Management, 41(6):1419-1440, 2005. Some relevant publications.

ISI ingenta Publisher sites Link Resolver Full text DBs EhostEJS British Library A&I service Google Local aggregation of usage data: linking servers • Linking servers can record activities across multiple OpenURL-enabled information sources of a specific digital library environment • Linking server logs are representative of the activities of a particular user population • Global scholarly information space compliant with linking servers • Allows recording of clickstream data: other methods of log aggregation can not connect “same user, different system” streams

Global aggregation of usage data Log Repository 1 Link Resolver Usage logs • Aggregation of linking server logs leads to data set representative of large sample of scholarly community • Global really means different samples of scholarly community • Can be finetuned for local communities • Possibility of truly global coverage Log Repository 2 Aggregated Usage Data Usage logs Link Resolver Log DB Aggregated logs Log Repository 3 Usage logs Link Resolver

Log Repository 3 Log Repository 2 CO CO CO CO CO CO CO CO CO Link Resolver Link Resolver bX project:standards-based aggregation of usage data Log Repository 1 Link Resolver OpenURL ContextObjects Usage log aggregation via OAI-PMH Log Repository properties: • OAI-PMH metadata record: • linking server event log for specific document in specific session • expressed using OpenURL XML ContextObject Format • OAI-PMH identifier: UUID for event • OAI-PMH datestamp: datetime the event was added to the Log Repository Aggregated Usage Data Log DB Aggregated logs Log harvester

bX project: OpenURL ContextObject to represent usage data <?xml version=“1.0” encoding=“UTF-8”?> <ctx:context-object timestamp=“2005-06-01T10:22:33Z” … identifier=“urn:UUID:58f202ac-22cf-11d1-b12d-002035b29062” …> … <ctx:referent> <ctx:identifier>info:pmid/12572533</ctx:identifier> <ctx:metadata-by-val> <ctx:format>info:ofi/fmt:xml:xsd:journal</ctx:format> <ctx:metadata> <jou:journal xmlns:jou=“info:ofi/fmt:xml:xsd:journal”> … <jou:atitle>Toward alternative metrics of journal impact… <jou:jtitle>Information Processing and manage…/jou:jtitle> … </ctx:referent> … <ctx:requester> <ctx:identifier>urn:ip:63.236.2.100</ctx:identifier> </ctx:requester> … <ctx:service-type> … <full-text>yes</full-text> … </ctx:service-type> … Resolver… Referrer… …. </ctx:context-object> Event information: * event datetime * globally unique event ID Referent * identifier * metadata Requester * User or user proxy: IP, session, … ServiceType Resolver: * identifier of linking server

Implications for structural analysis of usage data • Sequence preservation allows: • Reconstruction of user behavior • Usage graphs! • Statistics do not allow this type of analysis BUT are useful for: • validating results • rankings

Attribution and Credit: beyond print & citations Johan Bollen Indiana University