Transition from Gutenberg to Berners-Lee: Importance of Metadata in Research Publications

FromGutenberg to Berners-Lee The needfor metadata Ed Simons

Somepersonal data • Dr. Eduard (Ed) Jozef Simons • Senior Staff Radboud University Nijmegen: http://www.ru.nl • Board Member EuroCRIS: http://www.eurocris.org • IT-project manager: • - International Federation of Catholic Universities (IFCU): http://www.fiuc.org • - OPUS-College project Mozambique/Zambia: http://www.opus-college.net • - KE-interoperabillity project: https://infoshare.dtv.dk/twiki/bin/view/KeCrisOar

Intro: the nature of the presentation • This presentation concentrates on metadata for (research) publications, and as such only covers a part of the Academic Information Domain, but nevertheless the most important part from the point of view of this conference. • The presentation will be “generalistic” in nature, so will have more the character of a vision on current and future developments than deal with concrete problems and technologies regarding the use of publication metadata. • Along the way a new concept will be introduced which possibly could be of use or inspiration for future discussions on this matter.

Importance of metadata • Metadata allow us to describe and classify information in a systematic way, and as such they are indispensable for searching and finding academic information and outcomes of research (publications). • Metadata are often called “data about data”, which is an appealing, “catchy” sentence, but perhaps a better description would be “data about objects” in contrast to “data of objects”.

Is there really a need for (more) metadata? • As the title of this presentation suggests, there seems to be a need for (more, appropriate) metadata regarding publications, resulting from the shift from paper (printed) to electronic (on line) publication of research information.

Is there really a need for (more) metadata? • In contrast with the gentlemen in the previous sheet, I am of the opinion that our on line, networked research information space positively requires an extended set of metadata (compared to the ones up to now commonly used in IR’s) in order to be optimally used. • Why is this so and what should then these new or extended metadata be? • To answer this questions, let’s first take a closer look on this allegedly important shift, mentioned in the presentation’s title.

We are in transit.... • Currently we are in a transition from the “Gutenberg Era” (started in the 1450's) – the past 5 centuries of paper-based publication of research information - to the “Berners-Lee Era” (1980’s) of on-line storage and supply of information. • Characteristic for the Gutenberg Era was (is) that research information in a given field is laid down in a collection of individual “stand-alone” research information objects (publications), not directly linked – or better linkable - to one another. • With the coming into being of the Berners-Lee Era, say the Internet, the possibility arises to go beyond the individual publications, and concentrate on networks ofpublications as a new focal point. We’ll see further in this presentation that such networks and more notably a specific type of network of publications, brings significant added value to a user looking for scientific information.

The mind of passengers in transit (still) lives in the two worlds…. • As is the case for any transition period, also this one already implements parts of the new era, but still bears characteristics of the previous era. • Applied to research information: publications – as in the IR's - are already put on line (according to the Berners-Lee Era), but with metadata which still reflect and stem from the Gutenberg Era. • A big challenge for the near future will be to complete the transition, in other words: implement the metadata which are fully suited for and optimally use the possibilities of the Internet.

A Tale of Two Cultures… • Before going into more detail on the needed new and extra metadata for the future, let’s first focus a bit on the two communities, which deal with research information metadata, and their (different) cultures, the Library community on the one hand and the CRIS-community on the other. • Library Community: creators of repositories and using metadata stemming from the paper library culture, in principle: an electronic version of the old catalogue cards. • CRIS-Community: creators of administrative or management research information systems from within a culture which focus on the context of research and its publications (e.g. the project it results from, the unit who produced it, etc…). • Both communities, up to very recently, were isolated from each other, not communicating with one another and often even not knowing of each other’s existence. Luckily in the last few years this is changing and gradually the two cultures are coming together and discovering each other, as this workshop illustrates. • Important difference: The CRIS’s set of metadata is substantially more extended than that of the Library Community’s Repositories, which is fully understandable since the former not only deal with publication metadata as such, but also (have to) take the context of the publication into account.

From SIO to NETRINO • The “traditional” metadata as used by the Repositories are still mainly focused on the search for and detection of (a collection of) stand-alone publications, not linked to one another, in short: SIO’s (Stand-alone Information Objects). • In contrast to this we need metadata which are really getting the best out of (the possibilities of the) Internet, meaning metadata which are able to detect (the already previously mentioned) networks of publications and especially networks of related publications, which hold a special value to the user and could be generically called: NETRINO: Network of Related Information Objects • This fact, the power of the Internet – in combination with the appropriate metadata - to detect these kind of networks of publications can hardly be overestimated. • The added value lies in the term “related”. The fact that in a NETRINO the publications are, one way or another, directly related to one another (based on parameters we will discuss in a moment) enhances substantially the probability for the user that a given publication in the collection of the NETRINO will be of relevance to him/her.This as compared to a harvested list of not directly related publications.

NETRINO: a new focal point in the Research Information Space. • When it comes to research publications there are two types or sets of metadata on which networks of related information objects can be based: citation metadata on the one hand and context metadata on the other. Corresponding to this there are two types of publication netrino’s. • Citation NETRINO: based on the mutual (incoming and outgoing) citations of the publications involved. To be able to detect this kind of networks, a prerequisite is that the citation data of the publications is available in a structured way (e.g. a database) or a “citation harvesting” tool exists which automatically can extract the citation information from the publications. • Context NETRINO: based on parameters regarding the research context in which the publication came about, e.g.: co-researchers, co-authors, the research project the publication resulted from, etc... The metadata needed for the detection of these context networks is available in CRIS's

Graphicalrepresentation of a CitationNETRINO

Graphicalrepresentation of a Context NETRINO

Citation and context compared

Whereshould the metadata bestored? • Given: • It’s extensive metadata set, including already all the context metadata needed. • It’s well thought-out architecture and metadata model. • It’s granular, relational (database) structure. The metadata needed for the description and detection of NETRINO’s should be stored in a CERIF-CRIS, which then should be the driving entity for the Repositories. So, the model to develop in the future should be that of “CRIS-driven Repositories”.

METIS: firing at ISI • Within the framework of the Dutch research information system METIS, an experiment is currently being carried out to automatically and continuously get citation data from ISI concerning publications registered in METIS. • For this 24/7 and continuously requests are “fired” at a web service of ISI by which citation data on the publications can automatically be added to the metadata already in METIS for the publication in question. For the moment only quantitative citation data (numbers of citations) are obtained this way, but in the future also information on the publications themselves could be harvested. • This is an example of registering citation data in a CRIS from which a Citation NETRINO could be created.

Work to do… In order to set up an efficient system for the detection of NETRINO’s, the following steps and activities are in (my view) necessary: • Define an optimal set of context metadata (what should be on the list and what not) • Make sure these metadata are stored in your CRIS, and create an automatic transfer of these metadata to the repositories. (CRIS-driven repositories). • Create a metadata standard for interoperability which allows flexible granularity (cfr. the KE CRIS-OAR project). • Create unique identifiers for the main objects in the research information space (being the core CERIF entities): apart from DOI and DAI (see the NARCIS project in the NL), we need probably also DII’s and DPI’s (Digital Institution Id’s and Digital Project id’s) • Make the automatic detection of citation metadata within publications possible (cfr. the previously mentioned experiments going on in METIS these days). • Work out / implement various controlled vocabularies (content classifications) for the information objects in the Academic Information Domain (AID). In other words: work out (or fill) the CERIF semantic layer.

Summarizing • The transition from paper-based to on line publications offers the possibility to concentrate on networks of publications as a meaningful unit of information, instead of just the individual publications. • These networks, and especially a typical kind, notably NETRINO (Network of Related Information Objects) brings important added value to the user, since the probability that a publication in a collection of a NETRINO is of significance for him/her is substantial. • One could distinguish between two types of networks of related information objects: Citation NETRINOand Context NETRINO. The metadata for the latter are commonly stored in a CRIS. • In order for NETRINO’s to work a definition of a standardized metadata set for Repositories is necessary, as well as a mechanism (technology) to automatically subtract citation metadata from publications.

Thank you for your attention! (Images taken from “Head First” IT-training books: “Head First Java” and “Servlets and JSP”)

Transition from Gutenberg to Berners-Lee: Importance of Metadata in Research Publications