350 likes | 443 Views
Lecture 14 Interoperability for information discovery. CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel herbertv@cs.cornell.edu. Acknowledgements: Carl Lagoze. Why interoperability?.
E N D
Lecture 14 Interoperability for information discovery CS 502 Computing Methods for Digital LibrariesCornell University – Computer ScienceHerbert Van de Sompelherbertv@cs.cornell.edu Acknowledgements: Carl Lagoze
Why interoperability? The distributed information environment creates enormous challenges regarding the provision of coherent services. Addressing these challenges requires some form of interoperability.
OPAC FTXT FTXT e-print A&I A&I Information resources: distributed distributed herbert van de sompel
OPAC FTXT FTXT e-print A&I A&I Information resources: different authorities, technologies range of authorities, technologies herbert van de sompel
OPAC FTXT FTXT e-print A&I A&I Information resources: challenges re coherent services ¡¡challenges re integrated access !! herbert van de sompel
Interoperability helps! • UNICODE • XML • XML Schema, XML Namespaces • HTTP • Metadata formats (MARC, Dublin Core, …) • This lecture: interoperability for resource discovery (searching) • Next week: interoperability for reference linking
Interoperability: technical & organizational • reaching interoperability is an organizational challenge (not only technical) • the challenge is to create incentives for independent digital libraries to adopt specifications
Functionality versus cost of acceptance This picture does not show the eventual benefit of a protocol in the creation of coherent services Cost of acceptance Z39.50 SDLIP Metadata Harvesting OAI Functionality
A&I image FTXT OPAC e-print federated searching
Z39.50 http://www.loc.gov/z3950/agency/
Aims of z39.50 • Permits one computer (z3950 client) to search and retrieve information on another computer serving a database (z3950 target) • Important both technically and for its wide use in library systems, A&I databases • Most development has concentrated on bibliographic data • Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records
Technical history of z39.50 • Developed for X.25 networks (connection orientation), conversion to run over TCP fitted later • Original concept in days when repeating a search was expensive computation (about 1980) • NISO standard
z39.50 - principles • Abstract view of database searching. • Server stores a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server, carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction.
z39.50 - state • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.
z39.50 - services • init -- client connects to the server and exchanges initial information, e.g., preferred message size • explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, etc. • search -- client presents a query to a database; choices of syntax for specifying searches • • Boolean queries widely implemented manipulation of results sets -- e.g., sort or delete present -- requests the server to send specified records from the results set to the client in a specified format scan – browse indexes
z39.50 – sample query • In the database named Books find all records for which the access pointtitle contains the value evangeline and the access pointauthor contains the value longfellow. • Z39.50 defines a rich variety of search access points (use attributes) that can be extended by implementers http://lcweb.loc.gov/z3950/agency/defns/bib1.html
z39.50 – issues • No general acceptance (for instance not supported by web browsers => httpd/z3950 gateways) • Merging results for distributed search • Semantic interoperability ~ z39.50 profiles
Simple Digital Library Interoperability Protocol http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/
SDLIP • Compromise between a full-scale, all encompassing approach such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web • Developed jointly by Stanford, Berkeley, and UC Santa Barbara • Heavily influenced by DASL from IETF
SDLIP - interfaces • Search Interface – defines simple query language, protocol can then include other languages • Result Interface – parking meter metaphor (server decides on length of session) • Source Metadata Interface – allows to query a library proxy about its capabilities (subcollections, attributes that can be searched, …)
Open Archives Initiative Metadata Harvesting Protocol http://www.openarchives.org
OAMH protocol • Low-barrier framework for repository interoperability • Minimal burden for parties providing databases • Plug-in concept to allow community and service specialization (placeholders – about, descriptor ; parallel metadata formats; …)
A&I image OPAC e-print harvester FTXT OAMH - metadata harvesting metadata FTXT
A&I image FTXT e-print OPAC Author Title Abstract Identifer OAMH – federated services metadata
Requests repos i tory harves ter Replies OAMH – protocol service provider data provider 6
Reply • XML Schema • Self contained OAMH – core concepts • low-barrier interoperability • data-provider & service-provider model • metadata harvesting model OAMH protocol HTTP based • shared metadata format and parallel, community-specific metadata formats Dublin Core • authentication etc. : on purpose outside of the protocol
repos i tory harves ter OAMH – protocol tools service provider data provider Datestamp Identifier Set Records
repos i tory harves ter OAMH – protocol tools service provider data provider • Supporting protocol requests: • Identify • ListMetadataFormats • ListSets • Harvesting protocol requests: • ListRecords • ListIdentifiers • GetRecord
repos i tory harves ter OAMH – a supporting protocol request service provider data provider ListMetadataFormats • ListMetadataFormats / Time / Request • REPEAT • Format prefix • Format XML schema • /REPEAT
repos i tory harves ter OAMH – a harvesting protocol request service provider data provider * from=a * until=b * set=klm ListRecords * metadataPrefix=dc • ListRecords / Time / Request • REPEAT • Identifier • Datestamp • Metadata • /REPEAT
OAMH – applications? • federated services [S&R, SDI, alerting, linking, ...] • database synchronization • harvesting the deep Web • ...
OAMH – issues • sync between data provider and service provider (cf. Web page harvesters) • everyone harvesting from everyone? (cf. brokers) • CAN attractive cross-community services be built? • gaining critical mass
Some thoughts • There is (and will never be) one right solution (technical vs. cost vs. complexity vs. ??) (cf metadata) • Distributed technical solutions have organizational ramifications • A revival of the simplicity approach: specifications that require little effort by the digital libraries involved, but that can eventually lead to appealing services (when combined with brute force computing and intelligent tools)