330 likes | 348 Views
Union Catalog Architecture. Tsach Moshkovits, Development Team Leader. Olybris, Ex Libris Seminar 2005 Kos, April 2005. Overview. The Union Catalog is a sophisticated mechanism that supports the integration of disparate libraries into a single environment.
E N D
Union Catalog Architecture Tsach Moshkovits, Development Team Leader Olybris, Ex Libris Seminar 2005 Kos, April 2005
Overview • The Union Catalog is a sophisticated mechanism that supports the integration of disparate libraries into a single environment. • By environment, we mean a unified User view, rather than a single database or a merged index.
Overview • The following will be discussed in this session: • Union catalog structure • Union catalog vs. Unified catalog • Equivalency • Merge
A Unified Catalog • Usually, a Union catalog involves a catalog where all Equivalent records are merged into one new record. • In this scenario, the original records are not saved, and the index is built on the merged version of the records. • Obviously, the merged record must include information about its different parts to allow navigation from the record to remote resources.
Unified Catalog Drawbacks • Match and Merge is preformed on load time, record by record. This is a slow process when additional resources are added. • A new resource may not be available until the slow load process is completely finished. • Updating a record is complex, since it may require more than just updating its merged record. This is true because the equivalence relation is not necessarily transitive.
Unified Catalog Drawbacks • Merging becomes even more problematic if the merge algorithm suggests that not all data is preserved for every source record. In such a case, any match and merge process must re-access all remote resources to retrieve all original records. • It is also impossible to update the unified catalog with a standard Cataloging GUI.
C Merge Equivalence Table (Z120) B Create Equivalence ALEPH Union Catalog “Just in Time” A Import Load / Catalog New/Update/Delete Indices Original Records Unified Catalog Structure – Virtual Approach Contributors
Union Structure – Level A • Records are stored as distinct entities in the database. • Records can be loaded from an external resource or cataloged with the ALEPH Cataloging module. • Records from an external resource can hold an identifier to the external resource to allow simple updating or navigation to an external resource. • Indices are created using the standard ALEPH indexing scheme.
Union Structure – Level B • An Equivalence table is created by mapping each record to its equivalent records. • The equivalence relation is not necessarily transitive. • This table can be recreated any time, leaving the records intact.
Union Structure – Level C • Result sets will be de-duplicated to contain only one record per group of equivalents. • Browse lists will de-duplicate their counters to count only one record per group of equivalents. • User View uses on-the-fly Merge to present a single record that is built from a group of equivalents. • The Merge algorithm can vary from user to user.
Virtual Approach Advantages • It is simple to update a record by unlinking it from the Equivalence table and marking it as “New.” This action breaks all existing connections in the group. • A new record is simply inserted as equivalent only to itself. • In all cases, the data of each record stays intact in the database.
Virtual Approach Advantages • A separate job runs on all equivalency tables marked as “New.” The job assures that records in a group are evaluated for their real equivalency. • It takes no longer to load external resources here than it does to load and index in ALEPH.
Virtual Approach Advantages • The worst-case effect of update, insert, or delete is that between the time a record is updated, until the time that equivalency entries are (re)created, the group of equivalent records appears as non-equivalent. • There is 100% uptime.
Virtual Approach Advantages • The same uptime considerations apply if the match algorithm is to be changed. • Changing the merge algorithm has absolutely no effect, since it is executed “just in time.”
Equivalency Table Creation • An equivalency table is created for each record in the database, and points to itself. • Pool selection: • The equivalency search is minimized to a certain number of candidates. • This is usually done on a direct index, such as ISBN, ISSN, or LCCN, and is therefore relatively fast. • If the number of candidates exceeds a certain limit, the record itself will be considered as the only candidate.
Equivalency Table Creation • Final match: • The equivalent records from the pool are found. • Matching and conflicting fields are searched. • Matching adds a positive weight, while conflicts add a negative weight. • The total weight is checked against a threshold.
Equivalency Table Creation • When both stages are complete, each record has a Z120 record, holding the numbers of all equivalent records. • Z120 is never empty. It holds the record’s own number if no equivalencies are found. • Both the pool selection program and the match program are table-defined, not hard-coded
Merge • When a user wants to view a record, a merge is done on all the records in its equivalency table, combining them into a single display. • No merged record actually exists in the database. This is a virtual display created on request.
Merge • A merged record display is built by taking the “basic” fields from the preferred record and adding other fields from each of its equivalent records. • The preferred record is selected by assigning weights to all the equivalent records based on table-defined criteria, and the top weight wins. • The merge program is also table-defined.
Implementation • The union_global_param tables defines the programs (algorithms) used for different Union Catalog tasks. • ! 1 2 3 4 • !!!!!-!-!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!! • USM90 B candidate_prog union_candidate_cdl • USM90 B match_prog union_match_cdl • USM90 B preferred_prog union_preferred_cdl • USM90 B merge_prog union_merge_aleph • USM90 B normalize_prog union_normalize_cdl
Preferred Table – An Example • !!!!!-!!!!!!-!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!-!!! • LDR F05-01 EQUAL d -10 • LDR F17-01 NOT-EQUAL 1,2,3,4,5,7,8,u,z 001 • 100## PRESENT 001 • 110## PRESENT 001 • 111## PRESENT 001 • 130## PRESENT 001 • The table defines a value for each field. All values are added according to the specifications in the middle columns. • The record with the highest value is selected as the preferred record.
Match Table – An Example • !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!-!!!> • date exact match + 200 • date within 2 - 025 • date mismatch - 250 • short title match + 450 • full title match + 600 • full title occur within + 350 • full title mismatch - 600 • full title keywords + 450 • full title keywords order + 050 • 260b exact match + 100 • 260b occur within + 100 • 260b mismatch - 025 The accumulative sum will be compare against a defined threshold
Match Table – An Example • Different fields are compared to determine whether two records match. • For each field, if a match is found, the plus value is added to the total match weight. Otherwise, the minus value is subtracted from the total matched weight. • The threshold in the first line defines the weight above which two records are considered a match.
Workflow Illustration Resources Contributors queue of new/updated records Single BIB record BIB’s pool of candidates BIB’s pool of matched records (= equiv table)
Two Types of Union Catalogs • “Union Catalog” - On top of Bibliographic + Holdings database • “Union View” - On top of ALEPH 500 administrative database
Bibliographic and Holdings Database UNION CATALOG Normalize records JUMP SOURCE 1 SOURCE 2 SOURCE 3
Bibliographic and Holdings Database • When records are loaded from various resources, fixes are done to normalize their structure and data. • Checks could be performed prior to the load so that incompatible records are rejected.
Bibliographic and Holdings Database Jump to original View in union holdings
ALEPH 500 Database Union Catalog - User View BIB 3 BIB 1 BIB 2 ADM 1 ADM 2 ADM 3 Librarian View
ALEPH 500 Database • Records are managed in standard ALEPH 500 in a single BIB and ADM library, but separately per sub-library or administrative unit. • The Staff User view does not change from an administrative GUI prospective. • A user (patron) has a unified view on the PAC.