440 likes | 594 Views
Models, Architectures, and Technologies of Digital Libraries (2). Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang. 1. Important protocols for digital libraries Rhyno (2004): Ch. 2 Important protocols for digital libraries and OSS options for using them.
E N D
Models, Architectures, and Technologies of Digital Libraries (2) Session 4 LIS 60639 Implementation of Digital Libraries Dr. Yin Zhang
1. Important protocols for digital libraries Rhyno (2004): Ch. 2 Important protocols for digital libraries and OSS options for using them
What is a protocol and why? • Digital libraries usually are called on to communicate with many different external systems. • These duties can range from delivering Web-based interfaces for remote users to exposing content to third-party applications. • Certain interactions are so common or have so many requirements that a protocol has been established for standardizing and streamlining the process. • A protocol is a set of ground rules for how systems carry out specific activities. • Protocols often define which format and syntax systems use for exchanging information and what one system must indicate to another before any data is made available.
Core protocols for DL projects (1) • The Hypertext Transfer Protocol (HTTP) powers the Web and is the protocol that most Web users interact with when using a Web browser. • HTTP's ability to be plugged into many different types of technologies is shown in Figure 2.3. • Most Web users are unaware of how many hoops the content delivered to their browsers has been through. With the use of a gateway, HTTP also can be the basis for interacting with many other types of protocols. • A gateway takes the results of one protocol and translates them to fit the requirements of a different protocol or application; for example, taking the results of an HTML form and using the values to formulate a query to a remote database. • For example, CGI (Common Gate Interface) is a specification introduced in 1994 to allow HTML content to be created dynamically. The ubiquitous nature of HTTP is a testimony to both its simplicity and extensibility. A more complex protocol would be harder to map to other applications. As a result, HTTP became firmly entrenched in the toolkits of application developers at an early stage of the Web's development and remains there today. http://www.w3.org/Protocols/
HTTP Software Examples • Web server software guide: http://webdesign.about.com/cs/webservers/bb/abwebservers.htm • Free web server software http://en.wikipedia.org/wiki/Category:Free_web_server_software • Apache: • Apache exists to provide a robust and commercial-grade reference implementation of the HTTP protocol • Apache dominates the Web server world
Core protocols for DL projects (2) • OAI-PMH - Open Archives Initiative (OAI) Protocol for Metadata Harvesting (PMH) • It has been called the "HTTP of digital libraries” even though the protocol actually uses HTTP as a transport mechanism between digital collections. • OAI-PMH is several years younger than HTTP, with origins in a 1999 meeting in Santa Fe, New Mexico, to address a series of problems that were occurring in the e-print server world. • As disciplinary e-print servers became more common, it was difficult to support searching across multiple repositories. • Repositories needed greater capabilities to automatically identify and copy papers that had been deposited in other repositories • The solution was the definition of an interface to permit an e-print server to expose metadata for the papers it held. This would allow the metadata to be picked up by programs on the Web called harvesters. • Harvestingprograms travel around a network gathering, or harvesting, content by copying it to a central site. More in Reading 4.
Core protocols for DL projects (3) • Z39.50has roots that stretch back to the early 1970s and the Linked Systems Project for searching bibliographic databases and transferring records among the major library institutions (e.g., Library of Congress, OCLC, etc.). • Z39.50 is a protocol that allows a client machine (called an origin) to search a server machine (called a target). • Despite its close association with the library community, Z39.50 is a relatively generic protocol with a rich set of functions for search and retrieval, including the ability to sort result sets and registries of objects such as attribute sets that specify search points. • Thesesearch points can be mapped onto the indexes and search capabilities of the underlying server. • Perhaps the best-known attribute set is Bib-1, originally designed for bibliographic resources. but now commonly used for a wide range of applications
Z39.50 (cont. 1) • Bib-1 Attribute Set: • http://www.loc.gov/z3950/agency/defns/bib1.html • http://www.loc.gov/z3950/agency/bib1.html • Bib-1 comprises six types of groupings of attributes, or attribute types, that define a deep level of precision in putting together queries: • Useattributes (type = 1) define the access point: 1 Personal-name 2 Corporate-name 3 Conference-name 4 Title 5 Title-series 6 Title-uniform 7 ISBN 8 ISSN … • Relationattributes (type = 2) define the relation of the search term to the values in the database 1 Less than 2 Less than or equal 3 Equal 4 Greater or equal 5 Greater than 6 Not equal … • Position attributes (type = 3) specify the location of the search term within the field or subfield in which it appears. 1 First in field 2 First in subfield 3 Any position in field • Structure attributes (type = 4) specify the type of search term. 1 Phrase 2 Word 3 Key 4 Year 5 Date (normalized) 6 Word list 100 Date (un-normalized) … • Truncation attributes (type = 5) specify whether one or more characters may be omitted in matching the search term in the target system at the position specified by the Truncation attribute. 1 Right truncation 2 Left truncation 3 Left and right truncation 100 Do not truncate …. • Completeness attributes (type = 6) specify that the contents of the search term represent a complete or incomplete subfield or a complete field. 1 Incomplete subfield 2 Complete subfield 3 Complete field
Z39.50 (cont. 2) • Z39.50-compliant systems can use the these attributes correspond to numbers in the standard to deconstruct queries. • For a search query: FIND TITLE PROGRAM* OR SUBJECT UNIX • Useattributes (type = 1) • 1 Personal-name 2 Corporate-name 3 Conference-name 4 Title 5 Title-series 6 Title-uniform .. 21 Subject heading • Relationattributes (type = 2) • 1 Less than 2 Less than or equal 3 Equal 4 Greater or equal 5 Greater than 6 Not equal … • Position attributes (type = 3) • 1 First in field 2 First in subfield 3 Any position in field • Structure attributes (type = 4) • 1 Phrase 2 Word 3 Key 4 Year 5 Date (normalized) 6 Word list 100 Date (un-normalized) … • Truncation attributes (type = 5) • 1 Right truncation 2 Left truncation 3 Left and right truncation 100 Do not truncate …. • Completeness attributes (type = 6) • 1 Incomplete subfield 2 Complete subfield 3 Complete field
Z39.50 (cont. 3) • Z39.50 is more complex than either HTTP or OAl and is an important protocol for digital libraries because it is designed to meet the very real complexities of information retrieval. • It also can be used as a tool to build distributed search services, also know asfederatedsearch systems: • The client in a federated system sends a search to all of the servers comprising the federation. • It can then gather the results and attempt to eliminate duplicates or perform value-added services such as clustering the results under topics, unlike the harvesting approach used with OAI that takes entire sets of records (see Figure 2.6).
Z39.50 Software • Z39.50 is an abstract layer on top of an existing system, so it isn't surprising that most Z39.50 tools are architected to work on top of other applications. • Suggested by Library of Congress: http://www.loc.gov/z3950/agency/resources/software.html • Free Software • Commercial Software • Suggested in this chapter a few open source applications (see Table 2.5).
Other protocols for DL projects (4) • There are some protocols are supported widely outside of the digital library community • SOAP: Simple Object Access Protocol (SOAP) • It combines XML with HTTP for accessing services, objects, and servers. • It is a lynchpin of a suite of technologies called Web Servicesthat leverages the Web for delivering application functions in a well-defined manner. • SOAP allows a great deal of information to be passed to an application, and it leverages XML for laying out the data that goes between DL applications. • RSS: RDF Site Summary (RSS) • It is an XML-based format that allows simultaneous publication, or syndication, of lists of hyperlinks, along with other information or metadata, that help viewers decide whether they want to follow a link. • Shibboleth: http://shibboleth.internet2.edu/ • Shibboleth is an authentication and authorizationproject under the auspices of Internet 2, a consortium of a group of universities working in partnership with industry vendors and government agencies to develop and deploy advanced network applications and technologies. • The Shibboleth System is a standards based, open source software package for web single sign-on across or within organizational boundaries. It allows sites to make informed authorization decisions for individual access of protected online resources in a privacy-preserving manner.
Discussion and Reflection • Summary: • Protocols make network systems work together and are the basis of many formal communications. • Digital libraries depend on protocols, particularly HTTP, OAI-PHM, and Z39.50, to provide services. Think of • HTTP as the highway between digital libraries, with • OAI as a friendly but comprehensive census taker that periodically turns up on the highway for updates on changes in the collection, and • Z39.50 as a sometimes more demanding visitor asking for less predicable and more specific information on the collection. • SOAP, RSS, and Shibboleth promise to enhance further and expand the boundaries of digital library services. • Issues raised in this reading • How such issues are addressed in your DL case
2. Interoperability: Standards and protocols Witten & Bainbridge (2003): 8.5-8.7 in Ch. 8 Interoperability: Standards and protocols
Interoperability • Interoperability is the name of the game for libraries. An important part of traditional library culture is the ability to locate copies of information in other libraries and receive them on loan-interlibrary loan. Libraries work together to provide a truly universal international information service. The degree of cooperation is enormous and laudable. • For digital libraries to communicate with one another, standards are needed for representing documents, metadata, and queries. • The components are in place. What we need are protocols that put them all together to achieve effective and widespread communication. • Different protocols have sprung from the two different cultures upon which digital libraries are founded. Two principal ones: • the Z39.50 protocol developed by the library community and maintained by the Library of Congress, and • the Open Archives Initiative (OAl) protocol, developed by members of various communities concerned with electronic documents.
Supporting the Z39.50 protocol • A particular Z39.50 system need not implement all parts of the protocol. The protocol is so complex that full implementation is a daunting undertaking and may in any case be inappropriate for a particular digital library site. • For this reason the standard specifies a minimal implementation, which comprises the • Initialize Facility, • Search Facility, • Present Service (part of the Retrieval Facility), and • Type 1 Queries (part of the registry). • Using this baseline implementation, a typical client-server exchange works as follows: • First the client uses the Initialization Facility to establish contact with the server and negotiate values for certain resource limits. • This puts the client in a position to transmit a Type 1 query using the Search Facility. • The number of matching documents is returned, and the client then interacts with the Present Service to access the contents of desired documents. • Greenstone DL software supports Z39.50
Supporting the Open Archives Initiative (OAI) • For a given digital library site to become an OAl data provider, software needs to be written that can respond to CGI requests and access the database system that stores the documents. • Many programming languages have library support for implementing CGI scripts - Perl, Python, Java, and C++, among others although the database itself will probably dictate the most suitable choice. • Greenstone can support the construction of a digital library collection based on OAl exported data by the following two steps: • obtaining the raw material from a data provider and configuring a suitable collection • augmenting the collection configuration file with a built-in OAI plugin • With the issuing of the appropriate import.pl and buiIdcol.pl commands, the end result of these two stages is a searchable, browsable Greenstone collection based on the exported content. • Further configuration of indexes and classifiers is possible depending on the metadata available.
Research protocols – (1)Dienst • Two long-standing digital library protocols from the research community that are designed to promote interoperability. • The trouble with interoperability though is that the purpose is defeated if several groups promote different interoperability schemes. • Dienst - Dienst, at Cornell University, is one of the longest-running digital library projects in the research community: its origins stretch back to 1992. It has three facets: • a conceptual architecture for distributed digital libraries, • an open protocol for service communication, and • a software system that implement the protocol.
Research protocols – (1) Dienst (cont.) • The protocol supports • search and retrieval of documents, • browsing documents, • adding new documents, and • registering users. Each of these is an independent • There are six categories of DL collection services: • repository servicesstore digital documents and associated metadata; • index servicesaccept queries and return lists of document identifiers; • query mediator services dispatch queries to the relevant index servers; • info servicesreturn information about the state of a server; • collection servicesprovide information on how a set of services interact; • registry servicesstore user information.
(2) Simple digital library interoperability protocol (SDLIP) lnteroperation among distributed objects has been a central plank of Stanford University's digital library project, the lnfobus. Many lnfobus objects are in fact proxies to estab.lished information sources and services. The original Digital Library lnteroperation Protocol (DLIP) has since been superseded by the Simple Digital Library Interoperability Protocol (SDLIP), designed in collaboration with other U.S. research projects. SDLIP paces emphasis on a design that is scalable, permitting the development of digital library applications that run on handheld devices such as Palm Pilots) as well as workstation- and mainframe-based systems. There are four parts (called interfaces) to the protocol: searching, accessing results, metadata, and delivery. 20
Translating between protocols The Stanford research group provides a Java-based software development kit to support SDLIP. The translator runs as a server in its own right. For example, the translator server implements the intersection of the Greenstone protocol and SDLIP's search and source metadata interfaces. 21
Discussion and Reflection • Summary: • Four digital library protocols: Z39.50, Open Archives Initiative (OAl), Dienst, and SDLIP • all support browsing and document retrieval, and all but OAl support searching • Text searching is relatively well understood-alI support ranked and Boolean queries, with a rich array of options: fielded search, stemming, case matching, and so forth. • Issues raised in this reading • How such issues are addressed in your DL case
3. General purpose technologies useful for digital repositories Reese & Banerjee (2008): Ch. 4 General purpose technologies useful for digital repositories
The Changing Face of Metadata • The foundation of any digital repository is the underlying metadata structures that provide meaning to the information objects that it stores. • Libraries have traditionally treated the creation and maintenance of bibliographic metadata as one of the core values of the profession. • For libraries to truly integrate their digital content, their bibliographic infrastructure must change dramatically. This change must include both the metadata creation and delivery methods of bibliographic content. • The days of a homogenous bibliographic standard for all content are coming to an end as more specialized descriptive formats are needed to describe the various types of materials being produced today and into the future. • This chapter will focus on the technologies that make up today's current digital repository systems • XML (eXtensible Markup Language), and • SOAP (Simple Object Access Protocol)
XML in Libraries • The library community has been one of the early implementers of XML-based descriptive schemas. • Issues of document delivery, indexing, and display have pushed the library community to consider XML-based markup languages as a method of preserving digital and bibliographical information • Today, libraries make use of XML nearly every day. We can find XML in the ILS systems, in image management tools, and in many other facets of the library.
XML in digital repositories • The ability to provide XML-formatted data from one's digital repository is a valuable access method. • When making decisions regarding a digital repository, one must look at how well the digital repository supports XML and XML-related technologies. • One should ask the following questions: • Does the digital repository support XML-structured bibliographic and administrative metadata? Does the digital repository support structural XML-based metadata schemas like METS (Metadata Encoding and Transition Standard)? • Can the metadata be harvested or extracted? And can the data be extracted in XML? • Does the digital repository support SOAP or other XML query syntaxes? • Can my digital repository support multiple metadata formats?
Why Use XML-based Metadata? • XML is human readable • One of the primary benefits associated with XML is that the generated metadata is human readable. • This characteristic of XML (1) makes data more transparent, (2) makes the data less susceptible to data corruption, and (3) reduces the likelihood of data lockup. • XML offers a quicker cataloging strategy • In many cases, XML-based metadata schemas will lower many of the barriers organizations currently face when creating bibliographic metadata. • XML can represent multi-formatted and embedded documents • One of XML's strengths is its ability to represent hierarchical data structures and relationships. • An XML record could be generated that contains information on a single document available in multiple physical formats with the unique features of each item captured within the XML data structure.
Why Use XML-based Metadata? (continued) • XML metadata becomes “smarter” • In an XML document, metadata fields can have attributes and properties that can be acted upon. • Data can be manipulated and reordered without having to rework the source XML document. • The ability to illustrate relationships and interlinks between documents - the ability to store content or links to content within the metadata • XML is not just a library standard • While the LIS community has created XML-based schemas like MODS, METS, and Dublin Core, the fact that these schemas are in XML allows libraries to look outside the traditional library vendors to a broader development community.
Web Services and SOAP • SOAP: the Simple Object Access Protocol • SOAP is a standard method for generating API for Web-based applications. • As a digital repository's content and traffic grow, users of the repository may want to access the repository's content outside the traditional user interface. • A digital repository that lacks Web services support greatly reduces the amount of integration that an organization can accomplish with its content. • Technologies like SOAP hold the keys to opening a digital repository beyond the "walls of the application platform, allowing other services like search engines or users to search, harvest, or integrate data from one digital repository into their own context or workflow.
Discussion and Reflection • Issues raised in this reading • How such issues are addressed in your DL case
4. Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) • http://www.oaforum.org/tutorial/english/intro.htm • Rhyno (2004): Ch. 2 Important protocols for digital libraries and OSS options for using them
As one of the core protocols for DL projects OAI-PMH - Open Archives Initiative (OAI) Protocol for Metadata Harvesting (PMH) It has been called the "HTTP of digital libraries” even though the protocol actually uses HTTP as a transport mechanism between digital collections. OAI-PMH origined in a 1999 meeting in Santa Fe, New Mexico, to address a series of problems that were occurring in the e-print server world. As disciplinary e-print servers became more common, it was difficult to support searching across multiple repositories. Repositories needed greater capabilities to automatically identify and copy papers that had been deposited in other repositories The solution was the definition of an interface to permit an e-print server to expose metadata for the papers it held. This would allow the metadata to be picked up by programs on the Web called harvesters. Harvesting programs travel around a network gathering, or harvesting, content by copying it to a central site. 32
OAI-PMH (continued 1) OAI-PMH divides the world into data providers andservice providers Registered OAI-PMH data providers:http://www.openarchives.org/Register/BrowseSites Data providers who support the OAI-PMH may choose to list their repository in the OAI registry, which serves to Provide a publicly accessible list of OAI conformant repositories, making it easy for service providers to discover repositories from which metadata can be harvested. Repositories may also wish to expose a friends container as part of their Identify response as a parallel means for guiding service providers towards repositories from which metadata can be harvested. Provide a mechanism for data providers to ensure their conformance with the OAI-PMH specification. Provide a means for the OAI to monitor use of the protocol and plan future activities and strategies. 33
OAI-PMH (continued 2) Registered OAI-PMH service providers: http://www.openarchives.org/Register/BrowseSites As of Feb 9, 2009, there are 959 OAI conforming repositories. The concept is that service providers add value to the data they harvest by defining search engines and other applications. Although other metadata schemes can be specified, OAI-PMH mandates that Dublin Corebe available. OAI is purposely designed to be "low barrier" to developers. Relatively simple criteria are used for harvesting: date stamps, which identify when resources have last been modified, and sets, which group together records based on criteria defined by the data provider. 34
Main Technical Ideas of OAI-PMH (1) • The main ideas of OAI • world-wide consolidation of scholarly archives • free access to the archives (at least: metadata) • consistent interfaces for archives and service provider • low barrier protocol / effortless implementation (e.g., because based on HTTP, XML, DC) • Basic functioning of OAI-PMH • Data Providers (open archives, repositories) provide free access to metadata, and may, but do not necessarily, offer free access to full texts or other resources. OAI-PMH provides an easy to implement, low barrier solution for Data Providers. • Service Providers use the OAI interfaces of the Data Providers to harvest and store metadata. Note that this means that • there are no live search requests to the Data Providers; rather, services are based on the harvested data via OAI-PMH. • Service Providers may select certain subsets from Data Providers (e.g., by set hierarchy or date stamp). • Service Providers offer (value-added) services on the basis of the metadata harvested, and they may enrich the harvested metadata in order to do so.
Main Technical Ideas of OAI-PMH (2) • OAI-PMH: overview and structure model • OAI-PMH supports six request types (known as "verbs"), e.g., http://archive.org?verb=ListRecords&from=2002-11-01. • Responses are encoded in XML syntax. OAI-PMH supports any metadata format encoded in XML. Dublin Core is the minimal format specified for basic interoperability.
Data Provider: prerequisites These are the things you must, should, or may have in place in order to implement OAI-PMH as a Data Provider: • metadata on resources ("items") These should be stored in a database (such as an SQL database). A file system may be necessary. It is necessary to have a unique identifier for each item. • Web server, accessible via the Internet, e.g. Apache, IIS • programming interface / API • e.g. Perl, PHP, Java-Servlet • web server extension • access to database (or filesystem) • not needed: session management • archive identifier / base URL • unique identifier for each item • metadata format (one or more; at least: unqualified Dublin Core) • datestamps for metadata (created / last modified) • logical set hierarchy (may have) This is most usefully by agreement within communities, especially subject communities • flow control by implementation of resumption token (optional, but 'larger' repositories should have it)
Data Provider: components and architecture Components: • Argument Parser validates OAI requests. • Error Generator creates XML responses with encoded error messages. • Database Query / Local Metadata Extraction retrieves metadata from the repository, according to the required metadata format. • XML Generator / Response Creation creates XML responses with encoded metadata information. • Flow Control realises incomplete list sequences for 'larger' repositories. It uses resumption token as the control mechanism. This diagram illustrates an example architecture for a Data Provider
Service Provider: prerequisites • There are three technical infrastructure prerequisites for implementing an OAI-PMH Service Provider that will harvest metadata from Data Providers via OAI-PMH: • an Internet-connected server • a database system (relational or XML) • a programming environment. (The programming environment must be one that can issue HTTP requests to web servers, can issue database requests, and includes an XML parser.)
Service Provider: components and architecture • Archive management involves the selection of repositories to be harvested. Entries to your list of repositories to be harvested may be made manually or you can automatically add or remove archives using the official registry. • Request Component creates HTTP requests and sends them to OAI repositories (Data Provider). It demands metadata using the allowed verbs of the OAI-PMH. It may do selective harvesting using the set parameter. • Scheduler realises timed and regular retrieval of the associated archives. The simplest case would be manual initiation of the jobs, but this can be automated, e.g., as a cron job. • Flow Control is implemented via resumption token, partitioning of the result list into incomplete sections with a new request to retrieve more results. An HTTP error 503 (service not available) allows analysis of the response to extract a “retry-after” period. • Update Mechanism realises the consolidation of metadata which have been harvested earlier (merge old and new data). The easiest case would be to delete all ‘old’ metadata from each repository before harvesting it again. A reasonable alternative is to do an incremental update (from parameter) – insert new metadata and overwrite changed / deleted metadata (assignment using the unique identifiers). • XML Parser analyses the responses received from the repositories, with validation using the XML schema, and transforms the metadata encoded in XML into the internal data structure. • Normaliser transforms data in different metadata formats into a homogenous structure. It harmonises representation of, for example, date, author, language code. It may map between or translate different languages. • Database receives the output of the normaliser mapping the XML structure of the metadata into a relational database that will handle multiple values of elements. An alternative is to use an XML database. • Duplication Checker merges identical records from different data providers. One possibility for implementing this is by the unique identifier for each item (for example, by URN). However, this solution is often not easily practicable and is not risk or error free. • Service Module provides the actual service to the 'public'. The basis for a service provided is the harvested and stored records of the associated archives. That is, it uses only the local database for requests etc., and thus it does not make calls on the Data Providers during operation.
Basics of XML schemas for OAI-PMH • OAI-PMH uses XML Schemas to define record formats. • OAI-PMH allows for any metadata format, so long as it is encoded in XML with an XML Schema. • You can exchange any metadata you like using OAI-PMH as long as you can encode it as XML and define an XML Schema for it. • OAI-PMH mandates the oai_dc schema as a minimum standard for interoperability. • All repositories must support oai_dc for a minimum level of interoperability. • If oai_dc does not have enough elements, you can extend it. • If oai_dc is not precise enough, a qualified Dublin Core schema can be used. • If oai_dc is not the right schema for your community or purpose, then use something else as well.
OAI Software and Tools There are many OAI tools available. The following table contains links to tools implemented by members of the Open Archives Initiative community: http://www.openarchives.org/pmh/tools/tools.php 42
OAI Software and Tools (cont) • The tools you choose will depend on such considerations as the type of repository or service you are implementing and the technical skills available to you in-house: • if you are setting up an e-print archive you may want to consider using the EPrints software package, • DSpace provides a digital asset management framework that includes preservation considerations, and • the advantage offered by PHP OAI Data Provider is support for on-the-fly output compression aiming at a significant reduction in data transfer load. • In addition, about thirty OAI-related tools are described in the OA-Forum Final Report on Technical Issues (download from http://www.oaforum.org/documents/). This report also includes a detailed comparison of GNU EPrints and DSpace.
Discussion and Reflection • Issues raised in this reading • How such issues are addressed in your DL case