510 likes | 639 Views
Increasing Interoperability on Searching Library Collections. Sarantos Kapidakis & Michalis Sfakakis. Laboratory on Digital Libraries and Electronic Publishing Archive and Library Sciences Department Ionian University, Corfu, Greece sarantos@ionio.gr. University of Cyprus
E N D
Increasing Interoperability on Searching Library Collections Sarantos Kapidakis & Michalis Sfakakis Laboratory on Digital Libraries and Electronic Publishing Archive and Library Sciences Department Ionian University, Corfu, Greece sarantos@ionio.gr University of Cyprus February 16, 2011
There are many sources of Information around The goal: How can we get the right results from all of them during a search? The challenge: Interoperability of the Heterogeneous Independent Sources The obvious solution: To use Standards and / or a common approach To create good practice guides, and use common approaches to semantics and configurable or optional parts of the standards The General Problem
There are many applications of searching: Web, Data Bases, Library Catalogs, Repositories, … Libraries describe different objects: Books, journals, CDs, Videos, pictures, paintings, … Libraries work on interoperability of their catalogues for many decades They use standards and common approaches Like MARC21, UNIMARC, UKMARC, … Like Z39.50 with Bib1 profile for metadata Application on Library Catalogues
Variety in contents and query systems library catalogues, bibliographic and full text databases, repositories, typical search engines, etc Huge number of available information sources conforms to the Z39.50 protocol The Z39.50 protocol is a typical case of query interface with abstract Access Points Meta-searching the Library Community
LibrarySearching Model On some fields only: Access Points • Using OPAC, in local system • [MARC example] • Using Web Gateway, mostly through Z39.50 • [Z39.50 - MARC example]
Z39.50 is not a standard for description or exchange of information It is a standard for dissemination and includes: Negotiation of capabilities Agreement in data profile (e.g. BIB-1) Communication protocol Query types and capabilities Format of results Z39.50
ANSI/NISO Z39.50-1995 Appendix 3,ATR: Attribute Sets, pages 81-83, define as such: Bib-1 Z39.50-attributeSet 1 Exp-1 Z39.50-attributeSet 2 Ext-1 Z39.50-attributeSet 3 CCL-1 Z39.50-attributeSet 4 GILS Z39.50-attributeSet 5 STAS Z39.50-attributeSet 6 Metadata Profiles in Z39.50
Hasattributes in the following categories: Use Attributes (π.χ. Personal name) Relation Attributes (π.χ. less than) Position Attributes (π.χ. first in field) Structure Attributes (π.χ. phrase) Truncation Attributes (π.χ. Right Truncation) Completeness Attributes (π.χ. incomplete subfield) Bib-1: Z39.50-attributeSet 1
BIB-1 Use Attributes Personal name 1 Corporate name 2 Conference name 3 Title 4 Title series 5 Title uniform 6 ISBN 7 ISSN 8 Thematic-number 1030 Material-type 1031 Doc-id 1032 Host-item 1033 Content-type 1034 Anywhere 1035 Author-Title-Subject 1036
Model Abstract record-based view No direct access to the underlying data and query methods Query mechanism Predefined abstract Access Points combined with specific attributes (Attribute Sets) Query languages (query types) General conformance requirements Attribute Set Bib-1, query Type-1 recognized (not necessarily implemented) Z39.50 Search Model & Primitives
The semantics of the Access Points are defined in the “Attribute Set BIB-1 (Z39.50-1995): Semantics” document Which represents consensus among the members of the Z39.50 Implementors Group (ZIG) Maintained as an official document of the Z39.50 Maintenance Agency Defines the semantics of the Access Points using the tag values of representative MARC bibliographic format fields Z39.50 Bib-1 Access Points Semantics
Query The proceedings from the IEEE’s conferences and only these No IEEE’s technical reports, neither records with subject IEEE’s conferences, etc. Z39.50 sources Copac Academic & National Library Catalogue (UK) Library of Congress (US) University of Crete Library (GR) Best Z39.50 Bib-1 Access Point: Author-name-conference-1006 = {111, 411, 711, 811} Rarely offered for use from the search environments Example 1
Query failures The Z39.50 source fails the query and returns a diagnostic message (e.g. MELVYL, COPAC) Inconsistent answers The Z39.50 source substitutes arbitrarily the unsupported Access Point with a supported (e.g. Library of Congress) Unknown answer derivation The user is not informed for the substitution of the unsupported Access Point Consequences from the unsupported Access Point
Statistical figures from IndexData for the “Ten most commonly supported Access Points” based on: 2,869 world wide Z39.50 sources where 1,821 of them support the search service Indicate that: No single Access Point is universally supported by the sources The most commonly supported Access Points are: Title supported from 1,667 (91.54%) sources Subject supported from 1,634 (89.73%) sources Author supported from 1,629 (89.45%) sources How Often Unsupported Access Points Occur
In a similar study we made in 24 academic Ζ39.50 sources in Greece There is only one Access Point that is supported by all sources, the Author (use attribute 1003) Subject Heading (use attribute 21) and Title (use attribute 4) are each supported by 23 different sources This situation in Greece seems better than the average one. The order of the supported Access Points is different. Unsupported Access Points in Greece
To permit queries with only the common Access Points to all sources Restricts the search capabilities of the sources To ignore the sources that do not support the Access Point Restricts the available sources To leave the source to substitute the unsupported Access Point with a supported one Results to inconsistent, unpredictable answers Common Approaches
Searching from the Environment“HEAL Link Search” • Restriction on the Access Points to only the common ones
To substitute the unsupported Access Point with other supported Access Points, so that (preferably) identical or (otherwise) similar semantics are obeyed A different substitution may have to be done for each source The Challenge
Information Integration Architectures deal with the problem of query rewriting Based on mapping rules between the global schema and the local schemas of the underlying sources No exploitation of the local schema semantics More room for optimization in our specific case Related Work
An Access Point is considered as a subset of an other one, if the set of the data fields used to create the first is a subset of the set of the data fields used to create the second An example: Author-name = {100, 110, 111, 400, 410, 411, 700, 710, 711, 800, 810, 811} Author-name-personal={100, 400, 700, 800} The Access Point Author-name- personal is considered being a subset of the Author-name 111 411 711 811 100 110400 410700 710800 810 Access Points Subset Relationship Author-name Author-name-personal
We represent the relationships between the Access Points with a directed graph G Vertices represent Access Points Arcs represent subset relationships <i, j> is an arc of the graph if and only if Access Point i is a subset of the Access Point j The Access Points Author-name and the Author-name-personal will be represented by two vertices of the graph and their subset relationship from the arc <Author-name-personal, Author-name> For the RDFS description: rdfs:Class maps to Access Points (Vertices) rdfs:subClassOf maps to Access Points subset relationships Access Points Semantic Graph Specification & RDF schema
rdfs:Class Metaschema rdfs:AccessPoint rdfs:subClassOf bib1:Any_1016 rdf:type Schema bib1:Name_1002 bib1:Author-name_1003 bib1:Name-conference_3 bib1:Name-corporate_2 bib1:Name-personal_1 bib1:Author-name-conference_1005 bib1:Author-name-personal_1004 bib1:Author-name-corporate_1006 mrc:f-600 mrc:f-611 mrc:f-111 mrc:f-711 mrc:f-100 mrc:f-610 mrc:f-700 mrc:f-710 mrc:f-411 mrc:f-110 mrc:f-811 mrc:f-400 mrc:f-800 mrc:f-410 mrc:f-810 A Sample of the RDFS Graph
A Representative Sample of the RDFS Graph of the Access Points
The RDFSGraph Including the Supported Access Points from the Library of Congress
The RDFSGraph Including the Supported Access Points from theUniversity of Crete
Two substitution policies (Broad, Narrow) Produce the Minimal set (depends on the substitution policy) Eliminates every Access Point which is an ancestor/descendant of anyone else This is the case when there are more than one ancestor/descendant path hierarchies containing a supported Access Point, while the selected Access Point from one path is also a member of another path at a higher/lower level position than the selected AP from this path Finally, either the Boolean AND or OR combination of supported Access Points substitutes the unsupported Access Point Access Point Substitution
Broad Access Point Substitution:Library of Congress Ζ39.50 Source
Broad Access Point Substitution:University of Crete Ζ39.50 Source
611 600 610100 110400 410700 710800 810 111411711811 Comparing the semantics of the results – I
The substitution for the Library of Congress produces equivalent results with the requested Access Point The answer has the same precision as the COPAC’s answer which supports the Access Point University of Crete We receive an answer with similar semantics (less precision) The answer excludes records having as subject the conferences of the IEEE But still contains also other types of editions of the IEEE (e.g. standards) Comparing the semantics of the results – II
Query All metadata records containing the term "Malinowski" as either Author or Subject or in the Title Z39.50 source Library and Archives Canada Best Z39.50 Bib-1 Access Point: Author-Title-Subject-1036 Rarely offered for use from the search environments Example 2
Narrow Access Point Substitution:Library & Archives Canada Ζ39.50 Source Selected Access Points • Title • Author-name-corporate • Author-name • Author-name-conference • Author-name-personal • Subject The Minimal Set • Title • Author-name • Subject
The semantics of an Access Point are assigned from the parts of the record used to generate the Access Point (i.e. the leaf subclasses) An Access Point has equivalent semantics with another Access Point or the union or intersection of a set of Access Points, if either: the sets of their underlying constitutional Access Points are equal, or the unions or the intersections of the sets of their underlying constitutional Access Points produce equal sets The semantic similarity of an Access Point with others is expressed mainly from its leaf subclasses Finally, the similarity among the semantics of the Access Points influences the result sets of the queries with the Access Points Access Points Semantic Similarity
Broad substitution, increases the number of corresponding leaf (MARC) fields Decreases the precision Does not affect the recall Narrow substitution, decreases the number of corresponding leaf (MARC) fields Decreases the recall Does not affect the precision Substitution Policies Effects
Characteristic extract leaf subclasses (lsc) lsc(ap, O)={api| api C api≤+ap xC: x≤api} Taxonomic Precision (tp) tp(aps, apr, O) = |lsc(aps) lsc(apr)|/|lsc(aps)| Represents the proportion of the fields used into the requested Access Point apr (relevant fields) out of the fields used into the selected Access Point for the substitution aps (searched fields) Taxonomic Recall (tr) tr(aps, apr, O) = |lsc(aps) lsc(apr)|/|lsc(apr)| Represents the proportion of the fields used into the selected Access Point for the substitution aps out of the fields used into the requested Access Point apr Similarity Evaluation Measures
Broad Substitution lsc(apr)lsc(aps) tp(aps, apr, O) = |lsc(apr)|/|lsc(aps)| (simplified form) tp(apsi, apr, O) = |lsc(apr)|/|lsc(apsi)| Narrow Substitution lsc(aps)lsc(apr) tr(aps, apr, O) = |lsc(aps)|/|lsc(apr)| (simplified form) tp(apsi, apr, O) = |lsc(apr)|/|lsc(apsi)| Similarity Evaluation Measures
Bib-1 source configuration Bib-1 source configuration Z39.50query Source 1 Source 1 … … Source n Source n ICS-FORTH RDFSuite query for source 1 query for source n RQL / RSSDB Bib-1RDFS z-request source 1 z-request source n z39.50SemanticAccessPointNetworkSystem Architecture & Substitution Process Access PointSubstitution Module … Z39.50 module / PHPYAZ …
Attacks the problem of the unsupported Access Points in the context of the Z39.50 and for the Bib-1 attribute set Substitutes the unsupported Access Point with the union or the intersection of other supported The substitution exploits the semantics of the Access Points from an RDFS description Broadens or Narrows the semantics of the unsupported Access Point according to the user preferences zSAPN is available as a free service at: http://dlib.ionio.gr/zSAPN z39.50SemanticAccessPointNetwork:A system for Semantic-Based Access Point Substitution
Ζ39.50 source No Substitution BroadSubstitution COPAC 2799 2799 Library of Congress 8312 1790equivalent semantics University of Crete Error: Unsupported Use attribute 349 similar semantics (less precision) Query Results Comparison • Query: Author-name-conference_1006 = IEEE • Narrow substitution is not feasible
Semantics based substitutions could really improve the effects form the unsupported Access Points when meta-searching metadata repositories behind query interfaces zSAPN, currently in the Z39.50 context, improves the search consistency and eliminates query failures exploiting the semantic information of the Access Points from an RDFS description zSAPN substitutes the unsupported Access Point with a set of others whose proper combination either broadens or narrows the semantics of the unsupported Access Point, while evaluates the modification on the precision or the recall for the original query respectively Conclusions - I
The proposed substitution policies enable any mediator to decide how to modify, if it is necessary, the semantics of an unsupported query prior to initiating the search requests. A source using the zSAPN underlying methodology could expand its functionality instead of making arbitrary or general substitutions The RDFS description of the Bib-1 Access Points could be a basis for the deployment of the library community primitive search semantics to the Semantic Web zSAPN is a free service at the Laboratory on Digital Libraries and Electronic Publishing of the Archive and Library Sciences Department of the Ionian University http://dlib.ionio.gr/zSAPN Conclusions - II