200 likes | 295 Views
Distributed Open Archives. Dr. Heinrich Stamerjohanns Institute for Science Networking at the University of Oldenburg. Goals of PhysDoc. dissemenation of articles (objects) low-barrier interoperability framework. Approaches (I) unstructured data. self-archiving (on web servers)
E N D
Distributed Open Archives Dr. Heinrich Stamerjohanns Institute for Science Networking at the University of Oldenburg
Goals of PhysDoc • dissemenation of articles (objects) • low-barrier interoperability framework
Approaches (I) unstructured data • self-archiving (on web servers) • crawl, harvest articles by Harvester • make searchable through unified search interface • try to extract metadata and/or extend unstructured data by metadata • approaches taken by Harvest, mnogosearch
PhysDoc together with Harvest documents documents documents WWW-Client Gatherer Search Interface Summarizer Filesystem with SOIF Records
Approaches (II)structured data in homogenous environment • store data in relational databases • replicate data with proprietary protocol • can either be synchronous ar asynchronous • or use distributed database • but • same definitions everywhere • same data layout everywere
Approaches (III)structured data in heterogenous environment • collect data from different databases through web search interfaces • meta-search engines • succesful implementation has been done: MetaPhys • but: • relies heavily on layout of presented data • a lot of adjusting needs to be done
Approaches (IV)structured data in heterogenous environment • should: • be in machine-readable format not for humans • use strict formats which can be validated • support various content-models (metadata formats) • use existing technologies • easy to implement • easy to adopt
Low-barrier framework • Transport protocol • HTTP • Data format • XML • Metadata format • interoperability • at least Dublin Core • extensibility • communities can use the metadata format which fits their needs
Open Archive Initiative • OAI defines such a protocol: OA-PMH • is not intended to replace more complete interoperability protocols such as Z39.50 • distinguishes between two classes • Data providers expose metadata about their content • Service providers harvest metadata from data providers by using OA-PMH and offer value-added services such as the possibility to search through the collected data
PhysDoc as OAI-Data-Provider • PMH v2.0 has been implemented by us • phpoai2 written in PHP • open source (GNU license) • supports various SQL databases through PEAR (PHP Extension and Application Repository) • supports on-the-fly XML output compression, which greatly reduces bandwith needs • easily configurable and adaptable to different metadata standards
Metadata container as SQL Database documents documents Gatherer Summarizer Mapper Filesystem with SOIF Records OAI-Gateway Mapper Quality function Normalizer DC, MARC PhysDoc as OAI Data-Provider XML on-the-fly offline
PhysDoc together with ??? • Use of metadata container yields many advantages • consistency check of data • quality assurance • static HTML export • any desired export metadata format besides DC possible is prepared for any other exchange protocols than OAI
OADPhysDoc as Service-Provider • PhysDoc will offer services to the physics community through Open Archives • articles are collected through OAI from various OAI Data-Providers • other publishers are and will be incorporated through proprietary interfaces. • these interfaces do not depend on layout of the offered data
Metadata Container as SQL DB Mapper Normalizer Scheduler XML Parser WWW Search Interface OAI Data-Provider OAI Data-Provider OADPhysDoc as Service-Provider
OADPhysDoc as Service-Provider • uses expat library to parse XML • currently supports only PMH-1.1 • cannot be easily adapted by other sites • support for PMH-2.0 is in progress
Technical Details • local development • implementation also written in PHP • scheduler is based on database • expat library is used as XML-Parser for OAI and proprietary interfaces • database is again mySQL • with “tricks” • full text extensions • cannot be easily adapted by other sites • support for PMH-2.0 is in progress
Technical Details • successful implementation by testing on the local data-provider • Added another data-provider within five minutes • normalization is again necessary (might raise further technical, textual and legal problems) • [but yet problems • vagueness in protocol definition • 503 flow control… • bad choice, because it depends on layout]
Thank you • OAI am Institute for Science Networking, Oldenburg: http://physnet.uni-oldenburg.de/oai/ • stamer@uni-oldenburg.de
OADDistributed Open Archives • joint project by Virginia Tech and University of Oldenburg Aims: • setup prototype service based on Open Archives which focuses on physics • design and implementation of prototype implementations which run the OAI protocol for metadata harvesting (PMH) • enable establishment and scalable interoperation of hundreds of Open Archives