280 likes | 399 Views
Subject Repositories European collaboration in the international context 28-29 January 2010. Workshop Technical infrastructure & interoperability Benoit Pauwels Université Libre de Bruxelles, Belgium. Workshop plan. Theme 1: The Economists Online network of data providers
E N D
Subject RepositoriesEuropean collaboration in the international context 28-29 January 2010 Workshop Technical infrastructure & interoperability Benoit Pauwels Université Libre de Bruxelles, Belgium
Workshop plan • Theme 1: The Economists Online network of data providers • General infrastructure of the EO solution • DIDL/MODS: the EO metadata exchange format • RDF/XML Admin file: decentralized administration • Enrichment of metadata • Theme 2: Economists Online and RePEc • Pullingmetadatafrom RePEc • Pushing metadata to RePEc • Contribute to LogEC • Use CitEC
Workshop plan • Theme (45’) • Introduction (BP, 20’) • 3 topics for brainstorming (breakout groups,10’) • Breakout groups reporting back (all, 15’)
The Economists Online network of data providers • Theme 1: The Economists Online network of data providers • General infrastructure of the EO solution • DIDL/MODS: the EO metadata exchange format • RDF/XML Admin file: decentralized administration • Enrichment of metadata
Metadata Logs Objects OAI-PMH HTTP Meresco Harvester Crawler Metadata Lucene SRU RePEc OAI-PMH RSS EO portal Homemade - FOSS Exporter engine Homemade - FOSS Other portals
Metadata Logs Objects OAI-PMH HTTP Metadata exchange format DIDL / MODS NEEO specs Meresco Harvester Crawler Metadata Usage metadata exchange format SWUP OFI Comm Profile Lucene SRU RePEc OAI-PMH RSS EO portal Homemade - FOSS Exporter engine Homemade - FOSS Other portals
Metadata exchange format • XML container structure that can hold semantically distinct metadata • descriptive metadata • object files (by-ref) • splash page • enriched metadata • JEL • full text (by-ref) • datasets (by-ref) • [ references ] • RePEc handle and metadata (by-ref) • DIDL • Based on existing container structure defined by SurfShare • “info:eu-repo” vocabularies (objectfileaccessRights, version, ...)
Metadata exchange format • Granular descriptive metadata • MODS (3.2) • Based on existing metadata structure defined by SurfShare • “info:eu-repo” vocabularies (publication type, • Unambiguous identification of authors • DAI – Digital Author Identifier • National or institution-unique persistent identifier • Solutions not specific to the NEEO project; continuous aim of standardization at a level that surpasses the project
DIDL[1] Item[1] Descriptor/Identifier (persistent identifier) Descriptor/modified Item[1..∞] (of type descriptiveMetadata) Descriptor/type (« descriptiveMetadata ») Descriptor/Identifier (persistent identifier) Descriptor/modified Component/Resource -- representation by value (XML) Item[0..∞] (of type objectFile) Descriptor/type (« objectFile ») Descriptor/Identifier (persistent identifier) Descriptor/modified Component/Resource -- representation by ref. (URL) Item[0..1] (of type humanStartPage) Descriptor/type (« humanStartPage ») Component/Resource -- representation by ref. (URL) • EO Data model • Publication isdescribed as a complex (compound) object • persistent identifier • Aggregation of 3 types of components • descriptiveMetadata (MODS) • objectFiles • humanStartPage • Extensible • additional items canbestoredwithin the complexobject • MODS • contains Digital Author Identifier (DAI) of EO author
Metadata exchange format • Implementations in NEEO • DIDL application profile • MODS application profile • Vocabularies in DIDL and MODS • Technical guidelines for project partners • Solutions: home-made or with external support • ARNO: home-made • Dspace: home-made, AtMire • Eprints: home-made, ECS-University Of Southampton • Fedora: METS/MODS -> DIDL/MODS • DigiTool: METS/MARC -> DIDL/MODS
Decentralized registry service • XML-RDF file • FOAF + NEEO-specific vocabulary • maintained by each data provider on a local web server • information of institution : name, description, ... • OAI baseURL + OAI sets to harvest • EO authors: photograph, full name, affiliation, DAI • HTTP get and validated by EO Gateway at regular intervals • Automated harvesting process • Made visible through portal • New partner • Create admin file • Ask for registration at economistsonline@uvt.nl , declaring location and validating admin file • If valid, you’re in
Metadata Logs Objects OAI-PMH HTTP Meresco Harvester Crawler Metadata Lucene SRU RePEc OAI-PMH RSS EO portal Homemade - FOSS Exporter engine Homemade - FOSS Other portals
Metadata Logs Objects OAI-PMH HTTP Meresco Enrichment service Harvester Crawler OAI-PMH Metadata Lucene SRU SRU RePEc OAI-PMH RSS/Atom EO portal Homemade - FOSS Exporter engine Homemade - FOSS Other portals
Metadata enrichment • “Automated” enrichment – JEL, full-text • ES gets records to be enriched from EO, over SRU • Based on date of request for enrichment of certain type and version • Based on flag set in EO record • ES creates enrichment record(s) • ES makes enrichment records available to EO, over OAI-PMH • EO harvests enrichment records from ES and integrates into original record • EO reuses enrichment information in its services: index & present • “Manual” enrichment – datasets • Partner enters permalink of publication on DVN platform • EO PMH-harvestsDDI fromDVN, and stores by-ref information
Enriched publication IR / ES EO DIDL[1] Item[1] PDF HTML Descriptor/Identifier (persistent identifier) TXT Descriptor/modified Item[1..∞] (of type descriptiveMetadata) Dataset DDI Item[0..∞] (of type objectFile) Item[0..1] (of type humanStartPage) LinkedData / SemanticWeb / ORE ready Item[0..∞] (of type text) Item[0..∞] (of type enrichedMetadata) Item[0..∞] (of type dataset) Review Descriptor/Identifier (persistent identifier) Descriptor/modified Item[0..∞] (of type review) Item[1..∞] (of type descriptiveMetadata) Item[0..∞] (of type objectFile)
Theme 1: The Economists Online network of data providers • BO Group 1: DIDL/MODS • Scalable? Implementation by 100s of partners • Local experiences from existing partners: implementation issues you want to share? • Can this become a standard for exchange of metadata of IR contained publications? Where does this stand next to (flavours of) DC, SWAP,...? • BO Group 2: XML Admin file • Scalable? Implementation by 100s of partners • Local experiences from existing partners: implementation issues you want to share? • DAI? • BO Group 3: Enrichment model • Extensibility: vocabulary for semantics of components • Manual enrichment: need for enriched submission form, making it easy for people to make enriched publications • Automated (JEL, full text): sustainable?
Workshop plan • Theme 2: Economists Online and RePEc • Pullingmetadatafrom RePEc • Pushing metadata to RePEc • Contribute to LogEc • Use CitEc
RePEc model • RePEc archives contain RePEc series contain Working papers, Articles, Books, Book chapters, Software • Manually maintained by research centres, journal publishers, university departments all over the world • +/- 900 archives, more than 4000 series • ReDIF metadata format • Network accessible over FTP or HTTP • Aggregation by RePEc services: • EconPapers • IDEAS • Central PMH-accessible aggregated archive of AMF formattedmetadata
RePEc model • Template-type: ReDIF-Paper 1.0 • Author-Name: Capron, Henri • Author-Email: hcapron@ulb.ac.be • Author-Name: Meeusen, Wim • Author-Email: wim.meeusen@ua.ac.be • Author-Name: Dumont, Michel • Author-Person: pdu51 • Author-Name: Cincera, Michele • Author-Person: pci5 • Title: National innovation systems: pilot study of the Belgian innovation system • Creation-Date: 1998 • Publication-Status: Published as a report for the Belgian Federal Office for Scientific, Technical and Cultural Affairs (OSTC) • File-URL: http://bib17.ulb.ac.be:8080/dspace/bitstream/2013/941/1/mc-0048.pdf • File-Format: application/pdf • Handle: RePEc:dul:ecoulb:2013-941
RePEc model compared to IR model • Very similar • BUT • RePEc model: • Harvests only from “official” publisher repositories • Therefore: 1 work exists once in RePEc and it is guaranteed the one and only “official” manifestation of the work • IR model: • holds publications for which institution is typically not the publisher • 1 work 1 official manifestation + multiple author manifestations • one work can exist in: • one or more repositories • as different publication types • with different descriptive metadata • with different object files attached • with different object file metadata • Pushing and pullingmetadata records from RePEc and IR into one system isbound to raiseproblems
Pull metadata from RePEc • EO harvests AMF formatted metadata records from http://oai.repec.openlib.org/ • Overlap !! • Same records are harvested from IR and RePEc • Solution: • XML Admin file contains directive <not-from-repec-series> • Permits to specify which RePEc series do not need to be harvested from RePEc, since already delivered through IR • BUT: • IR contains articles produced by its authors • These articles are contained in a journal RePEc series • Overlap in EO cannot be avoided
Push metadata to RePEc • EO sets up “RePEc:ner” archive, containing ReDIF-X formatted records • ReDIF-X • All records are delivered as “ReDIF-Paper”, but with extra fields denoting the “real” publication status and version of text • Overlap !! • Most institutions already maintain RePEc series: these records must not be pushed by EO • XML Admin file controls which series to feed in this “ner” archive • <feed-repec> • boolean: to feed or not to feed • <feed-repec-series> • If not given: all records with fulltext that are not working papers are mapped to one series for that institution • RePEc series OAI setspec of DIDL/MODS record • BUT • IR inherent problem of multiple copies/versions is pushed to RePEc
Push metadata to RePEc: ReDIF-X Template-type: ReDIF-Paper 1.0 Title: Block investments and the race for corporate control in Belgium Author-Name: Chapelle, Ariane Language: en Note: info:eu-repo/semantics/published X-PublishedAs-Type: article X-PublishedAs-Article-Year: 2004 X-PublishedAs-Article-Journal: CorporateOwnership & Control X-PublishedAs-Article-Volume: 2 X-PublishedAs-Article-Issue: 1 Order-URL: http://dipot.ulb.ac.be:8080/dspace/handle/2013/9943 File-URL: http://dipot.ulb.ac.be:8080/dspace/bitstream/2013/9943/1/ac-0007.pdf File-Format: application/pdf File-Version: authorVersion Handle: RePEc:ulb:ecoulb:2013/9943
LogEc • Aim: track abstract views and download clicks of publications presented through RePEc services (EconPapers, IDEAS, ... Economists Online) • NOT: tracking of usage at the level of the archives • Downloads of publications contained in RePEc archives, initiated through a Google user do not show up in LogEc • How: • EO logs clicks abstract views and download clicks of object files • On a monthly basis, EO transforms these log entries into requested LogEc format, using “rstat.pl” • 2009-10 EconomistsOnline RePEc:aah:aarhec:1987-21 a: 65.55.207.69 66.235.124.10 d: 66.235.124.10 • RePEc handle of publication is necessary • EO partners delivering content to RePEc directly (and that EO therefore doesn’t harvest from RePEc but from the IR) must include the RePEc handle in the DIDL/MODS record
LogEc RePEc EO DIDL[1] Item[1] Descriptor/Identifier (persistent identifier) Descriptor/modified Item[1..∞] (of typedescriptiveMetadata) Item[0..∞] (of type objectFile) Item[0..1] (of type humanStartPage) RePEc (AMF metadata) Item[0..∞] (of type descriptiveMetadata) RePEc handle Descriptor/modified byRef
CitEc • Aim: citation analysis for RePEc publications • How: • Analyze text: extract and parse list of references from publications • References are checked whether available in RePEc • Cites: • references to other RePEc publications • Textual references • CitedBy • Co-citations • EO publications (from our IRs) are pushed to RePEc and are therefore pulled through the CitEc processing • EO has access to the resulting CitEc data, and presents this through the EO portal (not yet, will be in Feb 2010) • RePEc handle of publication is necessary • EO partners delivering content to RePEc directly (and that EO therefore doesn’t harvest from RePEc but from the IR) must include the RePEc handle in the DIDL/MODS record
Theme 2: Economists Online and RePEc • BO Group 1 : Push/pull to/from RePEc • ReDIF-X data structure • Duplicates; different versions of identical publication • BO Group 2: Publishing models • Advantages/disadvantages of RePEc publishing model as opposed to IR publishing model • Push the twomodelstogether? Do weneed to foreseespecific services in the gateway or portal to makethesetwo live together in peace? • BO Group 3: Future RePEc/EO services • What services should EO and RePEc jointly be looking at in the future in the interest of the economics researcher ?