410 likes | 572 Views
DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it. Web: Crawling. “central” index. ?. Metadata harvesting. metadata. Author Title Abstract Identifer. ?. metadata. Metasearching. Metasearch Engine. ?. What is metasearching?.
E N D
Web: Crawling “central” index ?
Metadata harvesting metadata
Author Title Abstract Identifer ? metadata
Metasearching Metasearch Engine ?
What is metasearching? • Given many document sources and a query, a metasearcher: • Finds the good sources for the query • Evaluates the query at these sources • Merges the results from these sources Metasearcher Existing Web Application Unindexed Documents Legacy Database / WAIS / etc.
Main Issues • How to query different types of sources? • How to combine results and rankings from multiple data sources? Metasearcher http://…/getTitle? title=‘biomedical’&… SELECT title FROM articles . . . grep ‘biomedical’ *.txt
Other Issues • How to choose among multiple data sources? • How to get metadata about multiple data sources? Metasearcher Best: http://….?getMetaData Worst: “Hi. What do you have?” cat *.txt SELECT SCHEMA …….
Cost/Functionality Cost of acceptance Z39.50 SDLIP/STARTS Metadata Harvesting google Function
Z39.50 • http://www.loc.gov/z3950/agency/
Goals • Permits one computer, the client, to search and retrieve information on another, the database server • Important both technically and for its wide use in library systems • Most development has concentrated on bibliographic data • Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records
Principles • Abstract view of database searching. • Server stores a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server, carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction.
The results • Z39.50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.
Services • init -- client connects to the server and exchanges initial information, e.g., preferred message size • explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options • search -- client presents a query to a database choices of syntax for specifying searches • • only Boolean queries widely implemented • • one or more records may be returned to the client
Services manipulation of results sets -- e.g., sort or delete present -- requests the server to send specified records from the results set to the client in a specified format • options: for controlling content and formats for managing large records or large results sets
Example • In the database named "Books" find all records for which the access point title that contains the value "evangeline" and the access point author contains the value "longfellow.“ • Z39.50 defines a rich variety of search access points that can be extended by implementers
Problems • Very difficult to implement • There are freely available implementations, but they are complex • Outdated assumptions • Searching is expensive computationally • Bandwidth is limited (ASN.1 compression) • Originally designed for bibliographic record retrieval, and not full documents or other objects • “Overspecified” • (Almost) Nobody Implements Explain! • Assumes questionable user model (stateful)
Simple Digital Library Interoperability Protocol • http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/
SDLIP • Compromise between a full-scale, all encompassing search middleware design such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web • Support for stateful and stateless operation by the server • Support for thin clients, such as handheld devices • Developed jointly by Stanford, Berkeley, and UC Santa Barbara
Interfaces • Search Interface – defines simple query language, protocol can then include other languages • Result Interface – parking meter metaphor supports varying notions of results sets • Source Metadata Interface – provides extension mechanism through discovery server capabilities
Result access interface • This interface allows client applications to access the set of result documents, wherever that set is maintained • Four services: • getSessionInfo • getDocs • extendStateTimeout • cancelRequest
Source metadata interface • Provides information about the service and server itself, such as • Collections served • Collection metadata/content information • Searchable properties • Three operations • getInterface • getSubcollectionInfo • getPropertyInfo
Z39.50SGML MetadataHarvesting Cost DublinCore HTTPGoogle Functionality
Metadata metadata Author Title Abstract Identifer
History • Increasing interest in alternative scholarly publishing solutions – e.g., LANL arXiv • Increasing impact through federation • UPS Mtg., Sante Fe, October 1999 • Representatives of various ePrint, library, publishing, communities • Goal: definition of an interoperability framework among ePrint providers • Result: Santa Fe Convention, interoperability through metadata harvesting
Umbrella model Metadata Harvesting Reference Libraries Museums Publishers E-PrintArchives …that can be exploited by different communities
Key Technical features • Deploy now technology – 80/20 rule • Two-party model – providers (data providers) and consumers (service providers) • Simple HTTP encoding • XML schema for some degree of protocol conformance • Extensibility • Multiple item-level metadata • Collection level metadata
Roles Metadata harvesting Service Providers Discovery Current Awareness Preservation Data Providers
Key Features • definitions & concepts • repository • record • identifier • datestamp • set • protocol features • HTTP encoding • metadata prefix & schema • flow control • protocol requests • supporting requests • harvesting requests
Record protocol support format-specificmetadata community-specificrecord data <record> <header> <identifier>oai:eg:001</identifier> <datestamp>1999-01-01</datestamp> </header> <metadata> <dc xmlns=“http://purl.org/dc”> <title>My Example</title> </dc> </metadata> <about> <ea xmlns=“http://www.arXiv.org/ea” <usage>No restrictions</usage> </ea> </about></record>
Identifiers Registered URI Scheme Unique ID within archive: (syntax is archive-specific) Archive Idendifier: Registered within OAI locally unique key for extracting a record from a repository oai-identifier = oai:archive-identifier:record-identifier example = oai:ncstrl:ncstrl.cornellcs/TR94-1418
repos i tory harves ter service provider data provider Identify • Repository name • Base-URL • Admin e-mail • OAI protocol version • Description Container
repos i tory harves ter service provider data provider ListMetadataFormats • REPEAT • Format prefix • Format XML schema • /REPEAT
repos i tory harves ter service provider data provider ListSets • REPEAT • Set Specification • Set Name • /REPEAT
repos i tory harves ter service provider data provider * from=a * until=b * set=klm ListRecords * metadataPrefix=oai_dc • REPEAT • Identifier • Datestamp • Metadata • About Container • /REPEAT
repos i tory harves ter service provider data provider * from=a * until=b ListIdentifiers * set=klm • REPEAT • Identifier • Datestamp • /REPEAT
repos i tory harves ter service provider data provider * identifier=oai:mlib:123a GetRecord * metadataPrefix=oai_dc • Identifier • Datestamp • Metadata • About
http://www.google.com.tw/webmasters/sitemaps/docs/en/other.html#oaihttp://www.google.com.tw/webmasters/sitemaps/docs/en/other.html#oai • http://www.nla.gov.au/digicoll/oai/getRecord.html • oai_dc • http://www.nla.gov.au/digicoll/oai/ • http://www.nla.gov.au/digicoll/oai/listMetadataFormats.html (oai:nla.gov.au:nla.pic-an22111591)