170 likes | 293 Views
NEEO project EC Final review meeting Gateway and portal 23 March 2010. Benoit Pauwels Université Libre de Bruxelles, Belgium. Plan. Overview of technical infrastructure EO as a network of data providers – descriptive metadata EO as a network of data providers – usage statistics
E N D
NEEO project EC Final review meeting Gateway and portal 23 March 2010 Benoit Pauwels Université Libre de Bruxelles, Belgium
Plan • Overview of technical infrastructure • EO as a network of data providers – descriptive metadata • EO as a network of data providers – usage statistics • Added value services • Publication lists • Enrichedmetadata • Full-textsearching • Multilinguality • Collaboration with RePEc • EO gateway and portal
Metadata Logs Objects OAI-PMH HTTP DIDL / MODS SWUP Meresco Enrichment service Harvester Crawler OAI-PMH Metadata SRU Lucene SRU RePEc OAI-PMH RSS/Atom EO portal Homemade - FOSS Exporter engine Homemade - FOSS Other portals
Descriptive metadata exchange format • DIDL – XML container structure that can hold semantically distinct metadata • Descriptive, object files (by-ref), splash page, enriched metadata • Based on existing container structure defined by SurfShare • MODS(3.2) – granular descriptive metadata • Based on existing metadata structure defined by SurfShare • DAI– Unambiguous identification of authors • National or institution-unique persistent identifier • Continuous aim of standardization at a level that surpasses the NEEO project • NEEO adaptations fed back to SurfShare
DIDL[1] Item[1] Descriptor/Identifier (persistent identifier) Descriptor/modified Item[1..∞] (of type descriptiveMetadata) Descriptor/type (« descriptiveMetadata ») Descriptor/Identifier (persistent identifier) Descriptor/modified Component/Resource -- representation by value (XML) Item[0..∞] (of type objectFile) Descriptor/type (« objectFile ») Descriptor/Identifier (persistent identifier) Descriptor/modified Component/Resource -- representation by ref. (URL) Item[0..1] (of type humanStartPage) Descriptor/type (« humanStartPage ») Component/Resource -- representation by ref. (URL) • EO descriptive metadata model • Publication isdescribed as a complex (compound) object • persistent identifier • Aggregation of 3 types of components • descriptiveMetadata (MODS) • objectFiles • humanStartPage • Extensible • additional items canbestoredwithin the complexobject • MODS contains DAI of EO author • Semantic Web - Linked Data – OAI-ORE ready
Descriptive metadata exchange format • Central EO gateway • DIDL and MODS application profiles • Vocabularies in DIDL and MODS • Technical guidelines for project partners • All documentation is OA available • Partner solutions: home-made or with external support • ARNO home-made • Dspace home-made, AtMire • Eprints home-made, ECS-University Of Southampton • Fedora METS/MODS -> DIDL/MODS • DigiTool METS/MARC -> DIDL/MODS • All original partners + 2 new partners
Decentralized registry service • Aim: sustainable solution for big network with many partners • Decentralized Admin file • Format XML-RDF | FOAF + NEEO-specific vocabulary • Decentralized file sits on local web server of project partner • Content - information of institution : name, description, ... • - OAI baseURL + OAI sets to harvest • - EO authors: DAI, photograph, full name, affiliation • EO gateway HTTP gets and validates at regular intervals • Used for - information in EO portal screens • - publication lists (match on DAI) • - automated harvesting process
Usage statistics – EO use case • EO use case: present download rates through EO portal per publication, scholar, institution • Normalization of exchange format and communication protocol • OAI-PMH exchange of SWUP OpenURL ContextObjects (ScholarlyWorks Usage CommunityProfile) • Specialconsiderations: • Enryption of IP address of requester (MD5) • Filtering out robot requests (list of 50 regular expressions) • Filtering out double clicks • Similar initiatives come together at Knowledge Exchange workshop, Berlin 29-30 March 2010 • JISC (Usage StatisticsReviewproject), Pirus2, SurfSure, Counter, Mesur, OA-Statistik, Economists Online
Usage statistics – implementation status • Central EO Gateway – DoDoCo (Document DownloadCounter) • PMH harvesting of SWUP ContextObjects into SQL database • Enrichwith information on item, scholar, institution • Web service level (item, scholar, institution) + date range • Technical guidelines for project partners (OA available) • Partners • Implementation - for all major IR platforms • - solution for Combined Log Format web logs • Registration through Admin file • 7 original + 1 new partner • Not enough data available • Not visible through EO portal yet, although DoDoCo software is ready
Added value services • Publication lists • Per DAI of authors who are registered in Admin file • SRU extract publications from EO gateway and Format • APA+ in HTML • with links to full text in EO partner repository • with links to publisher sites (through OpenURL resolution) • APA in PDF • APA in RTF • RIS • BibTex
Added value services • Enriched descriptive metadata • JEL classification • Enrichment service (ES) gets records to be enriched from EO, over SRU • ES creates enrichment record(s), using text mining technology • ES makes enrichment record(s) available to EO, over OAI-PMH • EO harvests enrichment records from ES and integrates into original record • EO reuses enrichment information in its services: index & present • Bibliographicreferences • Through collaboration with RePEc/CitEc • Visible through EO portal
Added value services • Full-text search service • Process • Full-text indexer component in Meresco fetches relevant records from EO Gateway over SRU • Follow links to PDF object files • Textisextractedfrom PDF, and added to record through SRU Update • EO can now index & present • Prototype exists • Not yetfullydeployed in EO portal
Added value services • Multilinguality (EN, FR, GE, ES) • Complete EO portal interface • JEL classification • MLIA functionality in EO portal • Student thesis – Prof. Bouillon (Univ. Of Geneva -- multilingual information processing department ) • (uncustomized) Systran and Google Translate show equivalent results • Contacts with CACAO (also through Europeana) • comes as a complete portal solution, not as an add-in for existing portals like EO • Considerations: • Lingua franca in economics = EN • NEEO = NOT research project in linguistics, aim: reuse best existing technology • Use “Google Translate” for translation of queries
Collaboration with RePEc • Harvesting metadata from RePEc into EO • AMF to DIDL/MODS mapping • Push metadata from EO to RePEc • “RePEc:ner” archive, with separate series for each EO institution • According to agreed-upon reviewed ReDIF format • Admin file directives in order to limit overlap • Contribute to LogEc • Reuse CitEc data in EO portal
EO gateway and portal • Gateway – metadata store and search engine • Choice between Summa, SOLR/Lucene, Meresco • Open source solution, based on Lucene search engine • Support available from software developers (CQ2 company) • Has proven its qualities in the past (DARENet) • Portal • First version: home-made • Final version: • outsourced design to private company • HTML, CSS, JavaScript, all images