540 likes | 554 Views
This workshop discusses the need for a Nereus XML schema, the interoperability issues with DSpace and Earth Observation (EO) data, and provides solutions for making DSpace EO interoperable. It also addresses the compatibility of DISpace with other repositories like RePEc, OAISTER, and NEEO.
E N D
Institutional Repositories Workshop DISpace 1.0 Benoit PAUWELS Université Libre de Bruxelles (ULB) Brussels
DISpace 1.0 • Why do we need a Nereus XML Schema? • Why is DSpace not (sufficiently) interoperable with EO? • How can I make my DSpace system EO interoperable within one month? • Can we support « complex object » formats in DISpace? • Is DISpace interoperable with RePEc, OAISTER, NEEO, …?
DSpace ARNO Fedora Eprints EO general infrastructure Service Provider Data Providers OAI iPort EO Harvester End User SRU APA references
Capron, Henri, & Cincera, Michele (2003). Industry-university S&T transfer: Belgian evidence on CIS data. Brussels economic review, 46(3). Q.Service = f(Q.Metadata) Quality of service -- level 6 APA references -- QuaLevel 6 Service Provider Data Provider AU: Capron, Henri^ Cincera, Michele TI: Industry-university S&T transfer: Belgian evidence on CIS data BC: Brussels economic review, 46(3) PD: 2003 Authors: Capron, Henri; Cincera, Michele Title: Industry-university S&T transfer: Belgian evidence on CIS data BibCit: Brussels economic review, 46(3) Pub date: 2003
Data Provider Service Provider Author: Capron, HenriAuthor: Cincera, Michele, 124468 Title: Industry-university S&T transfer: Belgian evidence on CIS data BibCit: Brussels economic review, 46(3) Pub date: 2003 AU: Capron, HenriAU: Cincera, Michele, 124468 TI: Industry-university S&T transfer: Belgian evidence on CIS data BC: Brussels economic review, 46(3) PD: 2003 APA references -- QuaLevel 7 Prof. Dr. Michele Cincera Capron, Henri, & Cincera, Michele (2003). Industry-university S&T transfer: Belgian evidence on CIS data. Brussels economic review, 46(3). Q.Service = f(Q.Metadata) Quality of service -- level 7 Department of Applied Economics (DULBEA)
Data Provider Service Provider Author: Capron, HenriAuthor: Cincera, Michele, 124468 Title: Industry-university S&T transfer: Belgian evidence on CIS data JTitle: Brussels economic reviewVolume: 46Issue: 3 Pub date: 2003 AU: Capron, HenriAU: Cincera, Michele, 124468 TI: Industry-university S&T transfer: Belgian evidence on CIS data JT: Brussels economic reviewVO: 46IS: 3 PD: 2003 APA references -- QuaLevel 8 Prof. Dr. Michele Cincera Capron, Henri, & Cincera, Michele (2003). Industry-university S&T transfer: Belgian evidence on CIS data. Brussels economic review, 46(3). Q.Service = f(Q.Metadata) Quality of service -- level 8 Department of Applied Economics (DULBEA)
National Repository of Academic Output RePEc DSpace DSpace DSpace DP, exchange, SP formats DSpace End User iPort EO Harvester ARNO EO SRU Fedora APA references Eprints
National Repository of Academic Output RepEC DP, exchange, SP formats Who will do the mapping? DSpace End User iPort EO Harvester ARNO EO SRU Fedora APA references Eprints
DP, exchange, SP formats National Repository of Academic Output RepEC DSpace End User iPort EO Harvester ARNO EO SRU Fedora APA references Eprints
DP, exchange, SP formats Data Provider Service Provider DSpace EO Harvester SP mapping DP mapping DP internal metadata-format Exchange metadata-format SP internal metadata-format • choose appropriate internal metadata-format • high quality service … • of several service providers • map to exchange format … • of several SP • publish exchange metadata-format • choose appropriate internal metadata-format • quality of metadata should be retained • map from exchange format • build service(s) based on internal metadata-format
Author: Capron, HenriAuthor: Cincera, Michele, 124468 Title: Industry-university S&T transfer: Belgian evidence on CIS data JTitle: Brussels economic reviewVolume: 46Issue: 3 Pub date: 2003 <record> </record> EO exchange metadata-format • Let’s say we go for the international standard QDC QDC + at minimum one title and the (co-)authors
Author: Capron, HenriAuthor: Cincera, Michele, 124468 Title: Industry-university S&T transfer: Belgian evidence on CIS data JTitle: Brussels economic reviewVolume: 46Issue: 3 Pub date: 2003 EO exchange metadata-format • QDC + at minimum one title and the (co-)authors <record><dc:contributor>Capron, Henri </dc:contributor><dc:contributor> Cincera, Michele </dc:contributor> <dc:title>Industry-university S&T transfer: Belgian evidence on CIS data</dc:title> </record> • EO : we want to produce publication lists per author • + unique identifier per author
QDC + at minimum one title and the (co-)authors + unique identifier per author DP can deliver « author unique id » in various formats: (124468) |124468 [ uniqueid: 124468 ] Impose format: rewrite QDC XML Schema Author: Capron, HenriAuthor: Cincera, Michele, 124468 Title: Industry-university S&T transfer: Belgian evidence on CIS data JTitle: Brussels economic reviewVolume: 46Issue: 3 Pub date: 2003 EO exchange metadata-format <record><dc:contributor>Capron, Henri </dc:contributor><dc:contributor> Cincera, Michele (124468) </dc:contributor> <dc:title>Industry-university S&T transfer: Belgian evidence on CIS data</dc:title> </record>
Nereus QDC XML Schema EO: we want nice APA style structured bibliographic citations impose format: “Guidelines for Encoding Bibliographic Citation Information in Dublin Core Metadata” (MIMAS) OpenURL 1.0 ContextObject Author: Capron, HenriAuthor: Cincera, Michele, 124468 Title: Industry-university S&T transfer: Belgian evidence on CIS data JTitle: Brussels economic reviewVolume: 46Issue: 3 Pub date: 2003 EO exchange metadata-format <record><dc:contributor>Capron, Henri </dc:contributor><nereus:author id=“124468”> Cincera, Michele </nereus:author> <dc:title>Industry-university S&T transfer: Belgian evidence on CIS data</dc:title> </record>
Nereus QDC XML Schema + OpenURL 1.0 ContextObject Author: Capron, HenriAuthor: Cincera, Michele, 124468 Title: Industry-university S&T transfer: Belgian evidence on CIS data JTitle: Brussels economic reviewVolume: 46Issue: 3 Pub date: 2003 EO exchange metadata-format <record><dc:contributor>Capron, Henri </dc:contributor><nereus:author id=“124468”> Cincera, Michele </nereus:author> <dc:title>Industry-university S&T transfer: Belgian evidence on CIS data</dc:title> <dcterms:bibliographicCitation> info:ofi/fmt:kev:mtx:ctx… &rft_val_fmt:journal &rft.btitle=Brussels economic review &rft.volume=46 &rft.issue=3</dcterms:bibliographicCitation> </record> • BUT: certain bib. cit. metadata doesn’t map to OpenURL ContextObject • Book chapter: authors/editors of book • Part of research report: sponsor, authors/editors of report, volume • …
EO exchange metadata-format • In general: imposing QDC + OpenURL ContextObject without additional guidelines (profile) results in: • unstructured and incomplete metadata at the SP • low(er)-quality services • EO: { Nereus QDC + OpenURL ContextObject + Nereus profile } is not good enough • certain metadata fields for certain document types don’t find their place • EO service could be more ‘user-friendly’ with the introduction of complex object metadata structures NEEO • Need for a common standard for • bibliographic metadata • QDC + EPrints Application Profile? • MODS? • complex object metadata structure • METS? MPEG21/DIDL?
DP metadata-format • Which one to choose? • some variant of QDC • some flavour of MARC (MARC21, UNIMARC,…) • MODS • … • Can be whatever but • should be able to describe all types of academic output • high enough granularity and well-defined semantics so that mapping to the different exchange formats is possible without loss of quality • define metadata structure, irrespective of services ! • object file related metadata ! • follow standards ! • Unfortunately • a lot of IR softwares do not support granular, semantically well-defined metadata formats
DSpace uses QDC as internal metadata format; this is not good enough for high(er)-quality services (like EO) Cincera, Michele, 1960, 1965 Brussels economic review 46, 122 <dc:contributor> Capron, Henri</dc:contributor><dc:contributor>Cincera, Michele, 124468<dc:contributor> <dc:title> Industry-university S&T transfer: Belgian evidence on CIS data</dc:title> <dcterms:bibliographicCitation>Brussels economic review 46(3)</dcterms:bibliographicCitation><dcterms:issued> 2003</dcterms:issued> DISpace DP format
DISpace DP format Some other examples: • Bibliographic citation for book chapter <dcterms:bibliographicCitation>The national innovation system of Belgium. Capron, H. ; Meeusen, Wim. Berlin Springer-Verlag 2000, 73- 100. 790813087</dcterms:bibliographicCitation> • Bibliographic citation for book chapter <dcterms:bibliographicCitation>Croissance et convergence économique des régions théorie, faits et déterminants. Beine, M. ; Docquier, F. Bruxelles De Boeck Université 2000, 345-384. 2804133435 </dcterms:bibliographicCitation> • Bibliographic citation for a conference contribution<dcterms:bibliographicCitation>Proceedings of the international seminar on exchange of technology and know-how, 13-15 October 1999. Prague 1999 </dcterms:bibliographicCitation>
DISpace DP format • add granularity: split up DSpace QDC fields into subfields • define semantics of fields and subfields for each document type: • based on ISO-690 / Z44-005 • irrespective of aimed-at services • see document « DISpace @ ULB: Input template fields, QDC fields and subfields -- Version 1.0 – March 2006 » _
DISpace DP format • Supported document types (15): • book • bookitem • article • proceedings • conference lecture • unpublished communication • unpublished theses and dissertations • unpublished research report • part of an unpublished research report • working paper • patent • interview – emission • web site • bibliography • course – sound – video – image – database – software - others
DISpace DP format • Example – bookitem: <type>|atype-level1|btype-level2|ctype-level3</type> <title>|amaintitle|bsubtitle</title> <contributor>|aname|=DAI</contributor> <dcterms:issued>|adate|tpubstatus</dcterms:issued> pubstatus = { ULB4PUB | ULB2BPUB | ULBPUB | ULBNPUB } <dcterms:bibliographicCitation> |amaintitle|bsubtitle|hauthors|eedition|upublisher |cplace|vvolume|ppages|wcollection|icollectionnumber|sisbn </dcterms:bibliographicCitation>
<dc:contributor> Capron, Henri</dc:contributor><dc:contributor>Cincera, Michele, 124468<dc:contributor> <dc:title> Industry-university S&T transfer: Belgian evidence on CIS data</dc:title> <dcterms:bibliographicCitation>Brussels economic review 46(3)</dcterms:bibliographicCitation><dcterms:issued> 2003</dcterms:issued> DISpace DP format
With subfields + definition of semantics <dcterms:bibliographicCitation>|aBrussels economic review|v46|p122</dcterms:bibliographicCitation><dcterms:bibliographicCitation>|aCroissance et convergence économique des régions|bthéorie, faits et déterminants|hBeine, M|hDocquier, F.|cBruxelles|uDe Boeck Université|p345-384|s2804133435 </dcterms:bibliographicCitation> DISpace DP format <dc:contributor>|aCapron, Henri</dc:contributor><dc:contributor>|aCincera, Michele|=124468<dc:contributor> <dc:title>|aIndustry-university S&T transfer|bBelgian evidence on CIS data</dc:title> <dcterms:bibliographicCitation>|aBrussels economic review|v46|i3</dcterms:bibliographicCitation><dcterms:issued>|a2003|tULBPUB</dcterms:issued>
DISpace DP format • generic solution: • all DISpace fields are subfielded • a subfield is denoted through the ‘|’ character followed by 1 character • every DISpace field has at least an ‘a’ subfield; this subfield doesn’t have to be explicitly entered in the DISpace field: • American journal of sociology|v12|i3|p123-345|d2004|s1234-5678 • |aAmerican journal of sociology|v12|i3|p123-345|d2004|s1234-5678 • every DISpace field can have a different list of valid subfields • all subfields can be repeated within a DISpace field DISpace internal record format can be extended: • new QDC fields can be defined (through new qualifiers) • additional subfields can be defined
DISpace Nereus crosswalk • We have: • DISpace « QDC+subfields » format • Nereus XML Schema + Nereus profile • Next step: • Map (crosswalk) between the two metadata formats
DISpace Nereus crosswalk <record><dc:contributor>Capron, Henri </dc:contributor><nereus:author id=“124468”>Cincera, Michele </nereus:author> <dc:title>Industry-university S&T transfer: Belgian evidence on CIS data</dc:title> <dcterms:bibliographicCitation> info:ofi/fmt:kev:mtx:ctx… &rft_val_fmt:journal &rft.btitle=Brussels economic review &rft.volume=46 &rft.issue=3</dcterms:bibliographicCitation> </record> <dc:contributor>|aCapron,Henri</dc:contributor><dc:contributor>|aCincera,Michele|=124468<dc:contributor> <dc:title>|aIndustry-university S&T transfer|bBelgian evidence on CIS data</dc:title> <dcterms:bibliographicCitation>|aBrussels economic review|v46|i3</dcterms:bibliographicCitation><dcterms:issued>|a2003|tULBPUB</dcterms:issued>
DISpace Nereus crosswalk • oaicat.properties Crosswalks.nereus_qdc=org.dspace.app.oai.NereusQdcCrosswalk • NereusQdcCrosswalk.java dispace_dc = item.getDC(); dispace_citation = dispace_dc.getCitation(); eo_bibcit.append(‘&rft.jtitle’,dispace_citation.getSubfield(‘a’)); eo_bibcit.append(‘&rft.volume’,dispace_citation.getSubfield(‘v’)); eo_bibcit.append(‘&rft.issue’,dispace_citation.getSubfield(‘i’)); eo_bibcit.append(‘&rft.pages’,dispace_citation.getSubfield(‘p’));
DISpace EO OAI set • DSpace collections • all items in DSpace reside in at least one collection • items can reside in more than one collection • every DSpace collection == OAI set • Problem: • EO items reside in different collections • all of these collections can contain non-EO items • Solution: • create ‘virtual collection’ for EO items • copy appropriate items into EO collection • appropriate items? • all items written by authors that participate in the EO project • <dc:contributor>|aCincera,Michele|=124468<dc:contributor> • obtain item through Lucene search on DAI of author
8 7 6 2 4 5 4 3 2 1 5 1 3 DISpace EO OAI set DISpace EO Harvester Collection 1 EO Collection Collection 3 OAI Collection 2 EO non-EO
DISpace EO OAI set • script: map-items • Java program: ItemMapperManager.java • configuration file: itemmapper.xml <virtual-collection> <collection-handle>2013/2269</collection-handle> <collection-description>Economists Online</collection-description> <lucene-query>((author:124468))</lucene-query> <lucene-query>((author:341421))</lucene-query> <lucene-query>((author:562814))</lucene-query> <lucene-query>((author:410846))</lucene-query> <lucene-query>((author:649475))</lucene-query> <lucene-query>((author:1077459))</lucene-query> </virtual-collection>
DISpace bulk ingest • bulk upload and update of bibliographic metadata and object files in DISpace • researchers/departments already have their CV/publist in an MS-Excel or MS-Access database • bulk manipulation of bibliographic metadata • load/update offline collection into a DISpace collection • offline collection: • unique collection name • item: unique ‘offline’ item id • TAB delimited file (export from Excel or Access) • should respect specific structure • object files • file naming convention == f(offline item id) directory on DISpace server
Example of offline collection ‘mc’ with 3 items; 2 items with FT mc.txt mc-0001 Firms’ productivity growth and R&D spillovers|ban analysis of alternative technological proximity measures Cincera, Michele|=649475 Article|bArticle dans une revue|cAvec comité de lecture Economics of innovation and new technology|d2004|v14|i7|s10438599 en ULB2BPUB sciences économiques et de gestion mc-0002 … mc-0003 … DISpace bulk ingest /mc/ mc.txt mc-0001.pdf mc-0002.pdf
DISpace bulk ingest • Perl script: dspace_upload.pl • configuration file: config.xml <collections> <collection> <name>mc</name> <id>2013/645</id> </collection> </collections> • Run script: perl dspace_upload.pl –n mc adds/updates items mc-0001, mc-0002, mc-0003 in DISpace collection with handle 2013/645
EO DISpace starters kit • Find an IT person with Java experience • Define (sub-)communities + collections • Define document types and data dictionary • datadict.xml • doctypes.xml • Install DISpace Java software • Load and manipulate datadict and doctypes: • LoadDSpaceConfig • DataDict.java • DocTypes.java • Manipulate subfields: • DCValue.java • Lucene indexing: • Item.java • WebUI rendering: • ItemTag.java • ItemListTag.java • [ Bulk ingest – Offlinecollections ] • Itemmapper • Nereus OAI crosswalk All DISpace source code and documentation is available on http://www.bib.ulb.ac.be/RDIB/DISpace/technical.html_
DISpace bitstream features EO exchange metadata format = Nereus QDC
DISpace bitstream features EO exchange metadata format = Nereus QDC
DISpace bitstream features EO exchange metadata format = Nereus QDC Problems (for the SP): • <dc:identifier> : ambiguous way of pointing to object file(s) • Service at the mercy of each data provider’s capacity to come up with a comprehensive jump-off page • Redundancy of screens (same metadata) • Non-localized links in data provider’s jump-off page
DISpace bitstream features EO exchange metadata format = MPEG21/DIDL
DISpace bitstream features • MPEG21/DIDL • Digital Item Description Language • language that permits to describe complex objects • bibliographical metadata • object files metadata: • Location • Format/Size • DRM • Description of content
DISpace bitstream features EO exchange metadata format = Nereus QDC <record> <nereus:author id=“ulb.ac.be.649475“> Cincera, Michele </nereus:author> <dc:title> Brain Drain and MNEs </dc:title> <dc:type xsi:type=NEREUSType> Part of book – chapter </dc:type> <dc:identifier> http://bib17.ulb.ac.be:8080/dspace/handle/2013/2801 </dc:identifier> </record>
DISpace bitstream features EO exchange metadata format = MPEG21/DIDL <record> <container>## for bibliographic metadata <nereus:author id=“ulb.ac.be.649475“> Cincera, Michele </nereus:author> … <dc:identifier> http://bib17.ulb.ac.be:8080/dspace/handle/2013/2801 </dc:identifier> </container> <container> ## for object file metadata <objfile> <location>http://bib17.ulb.ac.be:8080/dspace/bitstream/2013/2801/1/text.pdf</location> <description>Full document</description> <format>application/pdf</format> </objfile> </container> </record>
DISpace bitstream features • Solution in DISpace: • We need a possibility to define multiple characteristics of a bitstream (object file) • DSpace comes with one « description » metadata field per bitstream • DISpace: • add granularity to this field through the introduction of subfields + semantics • per document type • currently: 3 features • subfield a -- general description of content of bitstream • subfield s -- accessibility: { ULBINTERNET | ULBINTRANET | ULBINVISIBLE } • subfield v -- versioning: { ULBPREPRINT | ULBPOSTPRINT | ULBPUBPRINT } • example: |aChapter1|sULBINTRANET|vULBPUBPRINT • XML config file: bsfeatures.xml
Need to reflect granularity and semantics of metadata fields and subfields in the DISpace submission interfaces: Ease of submission process for the researcher self-archiver Guarantee quality of metadata Journal title DISpace submission interfaces
DISpace submission interfaces • Submission interface is completely configurable through XML configuration file « doctemplates.xml »: • submission template per document type (#15) • mandatory | optional • repeatable | non-repeatable • instructions / information texts • mapping with fields / subfields in database • define « validators / processors » per input element • types of input elements: • text / textarea / select / constant / helper • size of text(area) boxes / list of valid values / value of constant
DISpace submission interfaces <dspace-doctemplate> <internal_doctype>article</internal_doctype> <input_areas> <input_area name="title"> <status>M</status> <help_txt>Enter the main title of the item.</help_txt> <repeatable> <yn>N</yn> … </repeatable> <map element="title" qualifier=""/> <inputelems> <inputelem> <id>maintitle</id> <status>M</status> <map subfield="a" /> <input_type>text</input_type> <hint_text><i>Main title</i></hint_text> <text_size>90</text_size> <text_max_size>90</text_max_size> </inputelem> <inputelem> <id>subtitle</id> <status>O</status> <map subfield="b" /> <input_type>text</input_type> … </inputelem> …
DISpace submission interfaces <input_area name="publicationstatus"> <map element="date" qualifier="issued"/> <inputelems> <inputelem> <id>pubstatus</id> <map subfield="t" /> <input_type>select</input_type> <select_vlist>-1,ULB4PUB,ULB2BPUB,ULBPUB</select_vlist> <select_dlist>,Submitted for publication,To be published, Published </select_dlist> </inputelem> . . .
DISpace submission interfaces • « helper » input elements <inputelem> <id>fullname</id> <status>M</status> <input_type>helper</input_type> <helper> <name>inputAuthors</name> <datas> <data><id>last</id> <map subfield="a" /> </data> <data><id>uniqueid</id> <map subfield="=" /> </data> </datas> …
DISpace submission interfaces « helper » input elements • dedicated JSP • permits to generate vocabulary-controlled, well-formed metadata in the submission interface • experimental: AJAX technology (DWR Java open source library) • currently: • author/DAI lookup • based on list of potential ULB authors: name + local unique identifier (DAI) • non-ULB co-authors are added to this list by submitter • journal/ISSN lookup • based on official ISSN database • plans: • extend with Sherpa/Romeo information • use Z39.50 access • planned: • date helper • ontology helper