1 / 41

DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it. Web: Crawling. “central” index. ?. Metadata harvesting. metadata. Author Title Abstract Identifer. ?. metadata. Metasearching. Metasearch Engine. ?. What is metasearching?.

lyris
Download Presentation

DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DL:Lesson 5Classification SchemasLuca Dinidini@celi.it

  2. Web: Crawling “central” index ?

  3. Metadata harvesting metadata

  4. Author Title Abstract Identifer ? metadata

  5. Metasearching Metasearch Engine ?

  6. What is metasearching? • Given many document sources and a query, a metasearcher: • Finds the good sources for the query • Evaluates the query at these sources • Merges the results from these sources Metasearcher Existing Web Application Unindexed Documents Legacy Database / WAIS / etc.

  7. Main Issues • How to query different types of sources? • How to combine results and rankings from multiple data sources? Metasearcher http://…/getTitle? title=‘biomedical’&… SELECT title FROM articles . . . grep ‘biomedical’ *.txt

  8. Other Issues • How to choose among multiple data sources? • How to get metadata about multiple data sources? Metasearcher Best: http://….?getMetaData Worst: “Hi. What do you have?” cat *.txt SELECT SCHEMA …….

  9. Cost/Functionality Cost of acceptance Z39.50 SDLIP/STARTS Metadata Harvesting google Function

  10. Z39.50 • http://www.loc.gov/z3950/agency/

  11. Goals • Permits one computer, the client, to search and retrieve information on another, the database server • Important both technically and for its wide use in library systems • Most development has concentrated on bibliographic data • Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records

  12. Principles • Abstract view of database searching. • Server stores a set of databases with searchable indexes • Interactions are based on a session • The client opens a connection with the server, carries out a sequence of interactions and then closes the connection. • During the course of the session, both the server and the client remember the state of their interaction.

  13. The results • Z39.50 • The server carries out the search and builds a results set • Server saves the results set. • Subsequent message from the client can reference the result set. • Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.

  14. Services • init -- client connects to the server and exchanges initial information, e.g., preferred message size • explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options • search -- client presents a query to a database choices of syntax for specifying searches • • only Boolean queries widely implemented • • one or more records may be returned to the client

  15. Services manipulation of results sets -- e.g., sort or delete present -- requests the server to send specified records from the results set to the client in a specified format • options: for controlling content and formats for managing large records or large results sets

  16. Example • In the database named "Books" find all records for which the access point title that contains the value "evangeline" and the access point author contains the value "longfellow.“ • Z39.50 defines a rich variety of search access points that can be extended by implementers

  17. Problems • Very difficult to implement • There are freely available implementations, but they are complex • Outdated assumptions • Searching is expensive computationally • Bandwidth is limited (ASN.1 compression) • Originally designed for bibliographic record retrieval, and not full documents or other objects • “Overspecified” • (Almost) Nobody Implements Explain! • Assumes questionable user model (stateful)

  18. Simple Digital Library Interoperability Protocol • http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/

  19. SDLIP • Compromise between a full-scale, all encompassing search middleware design such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web • Support for stateful and stateless operation by the server • Support for thin clients, such as handheld devices • Developed jointly by Stanford, Berkeley, and UC Santa Barbara

  20. SDLIP – Search Middleware

  21. Interfaces

  22. Interfaces • Search Interface – defines simple query language, protocol can then include other languages • Result Interface – parking meter metaphor supports varying notions of results sets • Source Metadata Interface – provides extension mechanism through discovery server capabilities

  23. Result access interface • This interface allows client applications to access the set of result documents, wherever that set is maintained • Four services: • getSessionInfo • getDocs • extendStateTimeout • cancelRequest

  24. Source metadata interface • Provides information about the service and server itself, such as • Collections served • Collection metadata/content information • Searchable properties • Three operations • getInterface • getSubcollectionInfo • getPropertyInfo

  25. OAI

  26. Z39.50SGML MetadataHarvesting Cost DublinCore HTTPGoogle Functionality

  27. Metadata metadata Author Title Abstract Identifer

  28. History • Increasing interest in alternative scholarly publishing solutions – e.g., LANL arXiv • Increasing impact through federation • UPS Mtg., Sante Fe, October 1999 • Representatives of various ePrint, library, publishing, communities • Goal: definition of an interoperability framework among ePrint providers • Result: Santa Fe Convention, interoperability through metadata harvesting

  29. Umbrella model Metadata Harvesting Reference Libraries Museums Publishers E-PrintArchives …that can be exploited by different communities

  30. Key Technical features • Deploy now technology – 80/20 rule • Two-party model – providers (data providers) and consumers (service providers) • Simple HTTP encoding • XML schema for some degree of protocol conformance • Extensibility • Multiple item-level metadata • Collection level metadata

  31. Roles Metadata harvesting Service Providers Discovery Current Awareness Preservation Data Providers

  32. Key Features • definitions & concepts • repository • record • identifier • datestamp • set • protocol features • HTTP encoding • metadata prefix & schema • flow control • protocol requests • supporting requests • harvesting requests

  33. Record protocol support format-specificmetadata community-specificrecord data <record> <header> <identifier>oai:eg:001</identifier> <datestamp>1999-01-01</datestamp> </header> <metadata> <dc xmlns=“http://purl.org/dc”> <title>My Example</title> </dc> </metadata> <about> <ea xmlns=“http://www.arXiv.org/ea” <usage>No restrictions</usage> </ea> </about></record>

  34. Identifiers Registered URI Scheme Unique ID within archive: (syntax is archive-specific) Archive Idendifier: Registered within OAI locally unique key for extracting a record from a repository oai-identifier = oai:archive-identifier:record-identifier example = oai:ncstrl:ncstrl.cornellcs/TR94-1418

  35. repos i tory harves ter service provider data provider Identify • Repository name • Base-URL • Admin e-mail • OAI protocol version • Description Container

  36. repos i tory harves ter service provider data provider ListMetadataFormats • REPEAT • Format prefix • Format XML schema • /REPEAT

  37. repos i tory harves ter service provider data provider ListSets • REPEAT • Set Specification • Set Name • /REPEAT

  38. repos i tory harves ter service provider data provider * from=a * until=b * set=klm ListRecords * metadataPrefix=oai_dc • REPEAT • Identifier • Datestamp • Metadata • About Container • /REPEAT

  39. repos i tory harves ter service provider data provider * from=a * until=b ListIdentifiers * set=klm • REPEAT • Identifier • Datestamp • /REPEAT

  40. repos i tory harves ter service provider data provider * identifier=oai:mlib:123a GetRecord * metadataPrefix=oai_dc • Identifier • Datestamp • Metadata • About

  41. http://www.google.com.tw/webmasters/sitemaps/docs/en/other.html#oaihttp://www.google.com.tw/webmasters/sitemaps/docs/en/other.html#oai • http://www.nla.gov.au/digicoll/oai/getRecord.html • oai_dc • http://www.nla.gov.au/digicoll/oai/ • http://www.nla.gov.au/digicoll/oai/listMetadataFormats.html (oai:nla.gov.au:nla.pic-an22111591)

More Related