Scholarly communication today

Scholarly communication today

Scholarly articles haven't really changed much in 346 years 4th Aug 1666 1st Jan 1888 19th March 2012

Scholarly communication – an analogy • Scholarly communication, at this mid-point in the digital revolution,is in an ill-defined transitional state—a ‘horseless carriage’ state—that lies somewhere between the world of print and paper and the world of the web and computers, with the former still exercising significantly more influence than the latter • We started here: • We’re now here (online): • Great – that’s a significant start

Scholarly communication – an analogy • . . . but this is really where we need to be!

The importance of citations

What is a citation? • The performative act of citing a published work that is relevantto the currentwork, typically made by including areferencein a reference list Why are citations important? • The act of bibliographic citation is central to scholarly communication – bibliographic references are the links that knit together independent scholarship • Citations unify the whole world of scholarship into a giant citation network • Citation networks reveal the development of academic disciplines • Sir Isaac Newton: “If I have seen a little further, it is by standing on the shoulders of Giants”

How is the present situation imperfect? • The present scholarly citation system inadequately exposes the knowledge networks that exist within the scholarly literature, linking papers, authors, funders, research projects and datasets • Citation data are hidden behind subscription firewalls of commercial companies • Academics are not free to use their own citation data as they please • In this Open Access age, it is a scandal that reference lists from journal articles, the core elements of the academic data cycle, are not freely available for use by the scholars who created them • Citation data now need to be recognized as a part of the Commons – those works that are freely and legally available for sharing

Nomenclature and metadata

Current citation practice • Well-formed references in reference lists • . . . relate to clearly defined entities • But extreme ambiguity in terminology! “a reference” “a reference” “a reference” “a reference”

Recommended nomenclature for references and citations Citing article c4o:InTextReferencePointer c4o:denotes biro:BibliographicReference • This is the nomenclature used in our SPAR (Semantic Publishing and Referencing) Ontologies http://purl.org/spar/ cito:cites biro:references Cited article

Generic structured metadata required to record a citation type entities e.g. Journal article Citing paper bibliographic metadata Title relationship Publication date cito:cites Bibliographic citation Unique identifier provenance Cited paper Source of citation info, e.g. CrossRef

The Open Citations Corpus

The original Open Citations Corpus • An open repository of bibliographic citation data created in 2011 • available at http://opencitations.net • Created with JISC funding of the Open Citations Project • project blog: http://opencitations.wordpress.com/ • Originally populated with ~6.4 million individual references from the reference lists of ~200,000 articles in the Open Access Subset of PubMed Central (as of January 2011) • These reference >3 million unique papers • ~ 20% of all PubMed papers published between 1950 and 2010, including allthe highly cited papers in every biomedical field • Multiple citations of the same well-cited papers permitted us to perform error correction of the harvested citations (approx 1% erroneous) • These citations are encoded as Linked Open Data using the SPAR ontologies, and are freely available under a CC0 waiver from http://opencitations.net/data/

Viewing citation networks at http://opencitations.net

The outward citation network of Reis et al. (2008)

Limitations of the original Open Citations Corpus • A snapshot in time of the citation data in PubMed Central as of January 2011 • becoming increasingly out of date • Contains references from open access articles only • Limited to the biomedical domain

Expanding the Open Citations Corpus

Expanding the Open Citations Corpus - Objectives • Redesign the OCC data model • Update the current ingest • Increase the domain coverage • Include reference lists from subscription-access journals • Harvest references on a continuing ongoing basis, as articles are published • Improve the user interface and the user experience • Publish the citation data both in BibJSON and in RDF as Linked Open Data • Build added value services over the citation data

Redesigning the Open Citations Corpus data model • Three record types: Entity Records, Personal Records and Citation Records • A clear separation is made between potentially erroneous citation information 'as received’ in text strings from article reference lists • ReferenceTextRecordscontaining NameTextRecords(of authors, editors) and authoritative bibliographic metadata derived from trustworthy sources such as CrossRef, PubMed and the web pages of published articles • BibliographicRecords and PersonalRecords(of authors, editors) • A distinction is also made between an UnmatchedCitationRecord • where no BibliographicRecord exists within the OCC for the cited entity and a MatchedCitationRecord • where the cited entity has a BibliographicRecord within the OCC • A unique internal identifier is created for each OCC record • Provenance information details the source of each citation, the date it was acquired, its format, and the name of the curator responsible for its ingestion

Reconfiguring the Open Citations Corpus • Underlying technical implementation being revised • Bibliographic information encoded in BibJSON • Data stored in BibServer, that handles BibJSON natively • Data from different sources brought into a common BibJSON format as soon as possible • Processing the whole ingest from either source takes over 24 hours • Work still to be done on the ingest pipeline, since the parsing of citation information from the reference list entries is not yet 100% accurate

Matching citation strings to bibliographic records • When a new reference has been extracted from a reference list • a ReferenceTextRecord is created for the citation target, and • an UnmatchedCitationRecordis created between the BibliographicRecordof the citing paper and the citation target’s ReferenceTextRecord • The ReferenceTextRecord is then compared with existing BibliographicRecords • If a match is found, a new MatchedCitationRecordis created within the OCC between the BibliographicRecordsof the citing and cited entities, and • the pre-existing UnmatchedCitationRecord between the citing and cited entities is deprecated • Similarly, a new NameTextRecordis created for each author and editor named in the new ReferenceTextRecord, and the OCC is then searched for matches to existing PersonalRecords within the OCC

Citation error correction • Examples of errors in reference list entries vary • from the trivial – a non-English name with incorrect accents or an article title containing “beta” instead of the correct “β” • to the serious – two papers in the same reference list with the same DOI • Such errors can be detected by comparing a new ReferenceTextRecordwith pre-existing BibliographicRecords, and of a new NameTextRecordwith pre-existing PersonalRecords • Where there are several OCC ReferenceTextRecordsreferencing the same multiply-cited paper for which an authoritative OCC BibliographicRecorddoes not yet exist, we use voting algorithms for reference disambiguation and error correction, enabling the creation of a reliable BibliographicRecordfor that entity even when we can find no external authority to provide it • In future, we wish to offer an automated OCC reference correction service to third parties such as authors and journal editors, enabling them to spot and correct errors in the reference lists of submitted papers before publication

New relationship types in the Open Citations Corpus Entity type relationships • The nature of the source entity and the target entity (e.g. journal article, book, dataset) are separately recorded in the OCC. We can thus infer the nature of each entity type relationship, for example: • Article-to-article bibliographic citation • Article-to-database data citation • Data_repository-to-article bibliographic citation Relationships other than bibliographic citations • Additional relationship types between entities in the OCC may be encoded using CiTO, the Citation Typing Ontology, if that information is available: • Citation :EntityAcito:cites :EntityB . • Shared authorship :EntityAcito:sharesAuthorsWith :EntityB . • Common funding :EntityAcito:sharesFundingAgencyWithEntityB . • Common institution :EntityAcito:sharesAuthorInstitutionWith :EntityB . • Related :EntityAdcterms:relation :EntityB .

Expansion of the Open Citations Corpus coverage • Ingest from the Open Access Subset of PubMed Central is being updated from ~200,000 articles in Jan 2011 to the current ~658,000 articles in September 2013 • Domain coverage is being expanded to include the physical sciences and mathematics, by the ingest of the reference lists from all ~872,000 preprints in the arXivpreprint repository at Cornell University Library • This will bring the total number of references from ~6.4 million to ~40 million • We then intend to ingest all the references in CiteSeer and from Wikipedia, marking these with clear provenance information • To this we will add citations from data repositories such as Dryad, that contain literature references associated with the datasets they hold • and from DataCite, that issues DOIs for datasets, and harvests metadata that contain literature references

Citations from heritage literature – ‘The Future of the Past’ • Funding application just submitted to harvest references from the pre-digital biodiversity / biological taxonomy literature, where papers have lasting value • We will use the Biodiversity Heritage Library (http://www.biodiversitylibrary.org/) as a source of references • David King, a text mining colleague at the Open University, will use advanced text mining techniques to dig references out of ‘dirty’ OCR’d page images • We will then ingest these data into the Open Citations Corpus and make them freely available • This will be the only source of digital citation data from a major fraction of the world's heritage literature in the field of biodiversity / biological taxonomy, that is simply not available in digital form anywhere else

Additional citations from PubMed Central • There are ~2.2 million articles in PubMed Central that are not part of the Open Access Subset, presently missing from the Open Citations Corpus • These contain citations not only to other papers, but also to datasets, typically in the form of database accession numbers, buried within the full text or footnotes • Recent text mining initiatives undertaken by Europe PubMed Central (EPMC) have extracted both the bibliographic citations and the data citations from all ~2.8 million PubMed Central articles, which are now freely available • We propose to ingest all these EPMC literature and data citations into the expanded and improved Open Citations Corpus • This will increase the number of PMC articles for which the OCC holds citation information by about 330% • In addition, it will further expand the nature of the citation data held to include the data citations contained within these PMC articles • However, these are just a fraction of the total scholarly citations, most of which are locked behind the pay walls of commercial providers

Reference lists from subscription–access articles • All fully open access publishers already publish article reference lists openly • I am working to persuade other major scholarly publishers to do the same • i.e. to put article reference lists outside the subscription pay-wall, in the same way as abstracts and bibliographic metadata are freely available • Last January, I published an Open Letter to Publishers requesting this • Claire Redhead kindly distributed it to all OASPA members • The letter is available at http://imageweb.zoo.ox.ac.uk/pub/2013/letters/Letter_to_all_scholarly_journal_publishers_re_open_citations.pdf • A number of leading STM publishers have expressed their willingness to open the reference lists from subscription-access journal articles • Nature, Science, Taylor & Francis, Royal Society Publishing, Portland Press, MIT Press and Oxford University Press are among the first • another has expressed willingness verbally, but has yet to commit formally

http://opencitations.wordpress.com

Opening article reference lists via CrossRef • How can these be ingested into the Open Citations Corpus? Most publishers already submit their reference lists to CrossRef as part of its CitedBy Linking Service • If you do not at present, you should use this free service! • With publisher’s permission, CrossRef can enable reference lists to be ‘opened’ • on a publisher-by-publisher basis based on DOI prefixes • on a journal-by-journal basis • on an article-by-article basis for hybrid journals • References are then available via the CrossRef API for ingest into the OCC • However, because the default CrossRef CitedBy Linking Service agreement is not to publish reference lists, even Open Access publishers must specifically inform CrossRef that the reference lists of their journal articles should be open • Geoff Bilder has a new CrossRef Metadata Best Practice Document that I will circulate, explaining how to specify this choice in your article metadata,

Summary - Benefits of the Open Citations Corpus • Created by scholars for scholars using scholarly data • No profit motive constraining free publication of the data • Will bring particular benefit to those who are NOT members of First World academic institutions whose libraries subscribe to commercial citation data from Thomson-Reuters or Elsevier • Will provide integrated access to citation data from a variety of sources, both inside and outside traditional scholarly publishing, with provenance information • Data are semantically described using the SPAR bibliographic ontologies • Citations thus become part of the Web of Linked Open Data • Data available in a variety of formats including BibJSON, BibTex and RDF for download by third parties for their own use or to build into cool services • indexing, search and browse (in prototype) • timeline visualizations (in prototype) • analysis of citation networks, co-authorship networks, etc. • trend identification, recommendation services, etc.

Sustainability

Sustainability • The development of the Open Citations Corpus has been enabled by short-term grant funding, but this does not provide a sustainable financial model • For the future, we seek one of the following long-term arrangements: • Adoption by a major institutional or national library • Adoption by a publishing organization such as CrossRef, with indirect support from publishers • Direct support by the scholarly publishing community • Social investment, i.e. the provision of capital to generate social as well as financial returns, to support open access to scholarly information • Income support by charging for added-value services over the open data • I would be grateful for your views on the value of the Open Citations Corpus and the manner in which its ongoing development might be supported

Acknowledgements and thanks • Alex Dutton, who developed the original Open Citations Corpus • Richard Jones, Martyn Whitwell and Mark MacGillivrayof Cottage Labs, who have undertaken more recent development work • Silvio Peroni, my colleague in developing the suite of SPAR (Semantic Publishing and Referencing) Ontologies • The JISC, who have funded the development of the Open Citations Corpus

Scholarly communication today