1 / 52

De-duplication of Bibliographic Records

De-duplication of Bibliographic Records. Tsach Moshkovitz, Development Team Leader. Olybris, Ex Libris Seminar 2005 Kos, April 2005. Overview. De-Duplication is a required procedure whenever new records are introduced to the database.

berlinj
Download Presentation

De-duplication of Bibliographic Records

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. De-duplication of Bibliographic Records Tsach Moshkovitz, Development Team Leader Olybris, Ex Libris Seminar 2005 Kos, April 2005

  2. Overview • De-Duplication is a required procedure whenever new records are introduced to the database. • De-duplication streamlines the process of loading old or new records as much as possible.

  3. Overview (cont) • The main process steps are: • matching • finding merge direction • merging

  4. Overview (cont) • Matching involves searching for similar database records that have specific given parameters.

  5. Overview (cont) • Merging direction involves identifying a preferred record (which the process sometimes implies). A non-preferred record is merged into a preferred record.

  6. Overview (cont) • Merging involves blending a new record with a similar record in the database, according to a control table or a configuration table.

  7. De-duplication vs. Union Catalog • The Union Catalog is a sophisticated mechanism for supporting integration of disparate libraries into a single environment. • The Union Catalog is based on Match and Merge algorithms. • ALEPH’s Union Catalog is not a unified Catalog, in the sense that an actual Merge does not take place in ALEPH’s Union catalog database.

  8. De-duplication vs. Union Catalog (cont) Union Catalog Match • An Equivalence table is created to map each record to a set of equivalent records. • It is totally acceptable to find more than one match in the Union Catalog. • The set of equivalent records also contains the preferred record.

  9. De-duplication vs. Union Catalog (cont) Union Catalog Merge • User View uses an on-the-fly merge to construct a virtual single record that is built from a group of equivalents. • Merge product is not saved in the database.

  10. ALEPH De-duplication: Match - “Rigid”. search for similar records. Merge - similar records are combined together in the database. All substantial fields should remain. Union: Match - “Loose”. search for equivalent records. Merge – Virtual merge for display. Fields can be added or omitted from the merged display. De-duplication vs. Union Catalog (cont)

  11. ALEPH De-duplication: Preferred – Performed only between two records (incoming and matched). Union: Preferred – Performed on the entire equivalent set. De-duplication vs. Union Catalog (cont)

  12. Sources for New Records • The main sources for new records are: • Cataloging GUI – A user catalogs a new record in the database. • Load Servers (such as OCLC server) – The server receives records from a search client (e.g., CatMe). • Batch Loading of resource files – Records are loaded in the database from an input file.

  13. Sources for New Records (cont) • The different sources for new records pose a different level of user intervention during the process: • Cataloging GUI –Maximum user control during the process. • Loading Servers – Limited user control during the process. • Loading Resource Files – No user control during the process.

  14. Sources for New Records (cont) • In general, the more the involvement of a staff user, the less rigid a required match. Q. How can the staff user be involved in batch loading a record? A. A given input file can be split into three files (zero matched records, single matched records and multi-matched records) before the actual batch load.

  15. Sources for New Records (cont) • In the Cataloging GUI, it is common for a new record not to contain enough information for a rigid match (e.g., fast cataloging).

  16. Match and Merge Match & Merge Setup • Version 15.2, and after, feature a unified mechanism for Match and Merge. • The mechanism is similar to the check/fix doc mechanism, and defines the programs to be executed for deferent procedure codes (contexts). • The functionality is also in Versions 14.x but is configured separately for cataloging, load servers, and batch jobs.

  17. Match and Merge (cont) Match & Merge Setup • Just as in other Configuration tables, the Match/Preferred/Merge Conf tables feature the following three columns: • Section code (the context, e.g., OCLC) • Program(s) to execute • Program-specific parameters

  18. Match • Example, tab_match: !!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!!!! OCLC match_doc_uid I-ISBN OCLC match_doc_acc tab_match_acc CAT match_doc_uid I-ISBN CAT match_doc_acc tab_match_acc YBP match_doc_uid I-ISBN YBP match_doc_acc tab_match_acc YBP match_doc_gen TYPE=IND...

  19. Match (cont) • The procedure name is defined elsewhere, depending on the module. For example, the OCLC server match procedure is defined in tab_oclc col. 9. ! 1 2 3 4 5 6 7 8 9 !!!!-!!!-!!!!!-!!!!!-!!!!!-!-!-!!!!!!!!!!-!!!!! 7545 BIB USM01 1 N Y OCLC OCLC 7545 AUT USM10 Y OCLC OCLC

  20. Match Algorithms • The more common Match algorithms (programs) are: • match_doc_uid • match_doc_acc • match_doc_gen • match_doc_script

  21. Match Algorithms (cont) • The match_doc_uidis based on the direct index (Z11). • The parameters column in the match table should contain either the index name (prefix I) or the tag code (prefix T).

  22. Match Algorithms (cont) • Important notes formatch_doc_uid: • Even if direct index is used, the program might return several matches. • Program matches the normalized (filing) text, rather than the original text. (Normalization is different for different indexes.)

  23. Match Algorithms (cont) • The match_doc_accprogramis based on a headings (ACC) index. • The third column is a table name that lists the record tags that should be checked against the headings index. • Example (tab_match_acc): ! 1 !!!!! 245## 240##

  24. Match Algorithms (cont) • Important notes for match_doc_acc: • If relevant tags (listed in given table) are associated with several headings (e.g., TTL, NTL) then all headings of that type will be checked. • If an incoming record introduces a new heading, consecutive incoming records may not be able to use that heading immediately.

  25. Match Algorithms (cont) • The match_doc_genprogramis based on the headings (ACC) index, the direct index (Z11) and the direct system number. • The third column tells the program which index type to use (heading, direct, or sys) and which tag code or index code to extract from the incoming record.

  26. Match Algorithms (cont) • Example: !!!!!-!!!!!!!!!!!!!!-!!!!!!!!!!!!!!!!!!> ISSN match_doc_gen TYPE=IND,TAG=022,SUBFIELD=a,CODE=ISSN • Specific parameters instruct the program to use the z11/ISSN index and to match the tag 022 $$a of the incoming record.

  27. Match Algorithms (cont) • The match_doc_genis also used when the ALEPH system number is already catalogued in the incoming record. In this case, the index-type is SYS.

  28. Match Algorithms (cont) • The match_doc_scriptprogram is based on a command script that allows a user defined logical flow for matching. • Example: !1 2 3 4 5 !!-!!!!!!!!!!!!!!!!!!!!-!!!!-!!!!!!!!!!-!!!!!!!!!!!!!!!> 00 1 goto 02 00 0 stop 01 match_doc_gen 50- skip TYPE=ACC,TAG=245.. 01 10+ goto 03 02 match_doc_gen 5+ goto 03 TYPE=ACC,TAG=001 02 0+ skip 03 match_doc_gen 5+ skip TYPE=ACC,TAG=001 03 1 stop

  29. Finding Preferred Record • The tab_preferred located in $data_tab defines a preferred_doc program to be executed, per context. • Example: !!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!> OCLC preferred_doc_aleph1 weights_table1 RLIN preferred_doc_aleph1 weights_table2 • In this example, the preferred_doc_aleph1 must decide whether the assigned two documents and or the Weights table is preferred.

  30. Finding Preferred Record (cont) • A weights table might look like this: ! 1 2 3 4 5 !!!!!-!!!!!!-!!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!-!!! LDR F05-01 EQUAL d -10 LDR F17-01 NOT-EQUAL 1,2,3,4,5,7,8,u,z 010 LDR F17-01 EQUAL 1 009 110## PRESENT 001 505## PRESENT 050 • The document with the higher accumulative weight is the preferred one.

  31. Merge • The tab_merge located in $data_tab defines a merge program to be executed, per procedure. • Example: • !!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!> • OVERLAY-01 merge_doc_overlay 02 • OVERLAY-02 merge_doc_overlay 02 • OVERLAY-03 merge_doc_overlay 03 • OVERLAY-04 merge_doc_overlay 04 • OCLC merge_doc_overlay 01

  32. Merge (cont) • The procedure name is defined elsewhere, depending on the module. For example, the OCLC server merge procedure is defined in tab_oclc col. 8. ! 1 2 3 4 5 6 7 8 9 !!!!-!!!-!!!!!-!!!!!-!!!!!-!-!-!!!!!!!!!!-!!!!! 7545 BIB USM01 1 N Y OCLC OCLC1 7545 AUT USM10 Y OCLC OCLC2

  33. Merge Algorithms • The more common merge programs are: • merge_doc_overlay • merge_doc_replace • merge_doc_adv_overlay

  34. Merge Algorithms (cont) • The merge_doc_replaceprogram replaces the contents of an original record with the contents of a new record, while retaining the CAT fields from both records.

  35. Merge Algorithms (cont) • The merge_doc_overlayprogram overlays the record according to the specifications defined in tab_merge_overlay. • The tab_doc_overlay was named tab_doc_merge in Version 14.x • The table may define multiple merge sets by using col. 1. Column 4 of the tab_merge table contains themerge set,performed when the routine (e.g., OVERLAY-01) is selected.

  36. Merge Algorithms – tab_doc_overlay Example: !1 2 3 4 !!-!-!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 01 1 Y ##### 01 1 N 051## 01 1 N 245## 01 2 Y 245## 01 2 Y 650## 01 2 N 001 01 2 Y 008 01 2 Y LDR

  37. Merge Algorithms – tab_doc_overlay tab_merge

  38. Merge Algorithms • The merge_doc_adv_overlay works essentially like merge_doc_overlay, with the following two differences: • Before merge is finished, tab_preferred is called with a context parameter hard-coded to AD-OVERLAY. • The tab_merge_adv_overlay has slightly better functionality, and the conditions may be sensitive both to the tag’s existence and its value.

  39. Cataloging • Catalog GUI uses check_doc and fix_doc to implement match and merge respectively: • When uploading a record check_doc, (CATALOG-INSERT) is executed. The program check_doc_match is used as an entry to tab_match. • When using copy/paste of a whole record, fix_doc (MERGE) is executed. The program fix_doc_merge is used as an entry to tab_merge.

  40. Cataloging – Match

  41. Cataloging – Merge Original Record (1)

  42. Cataloging – Merge (cont) Copied Record (2)

  43. Cataloging – Merge (cont) Merged Record

  44. OCLC Server ALEPH Database OCLC Match & Merge Search OCLC Client (CatMe) OCLC Server

  45. OCLC Server (cont) • Search in OCLC. • Send a record from the OCLC client to the OCLC server. • Check if there is a similar record in the ALEPH database (tab_match). • If no matching record is found, the record is added to the ALEPH database. • If a single matching record is found, both records are merged (tab_merge), and the merged record is added to the ALEPH database. • If multiple matches are found, an error is reported to the OCLC client.

  46. OCLC Server – tab_oclc ! 1 2 3 4 5 6 7 8 9 !!!!-!!!-!!!!!-!!!!!-!!!!!-!-!-!!!!!!!!!!-!!!!!!!!!! 7545 BIB USM01 1 N Y OCLC OCLC 7545 AUT USM10 Y OCLC OCLC tab_match !!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!!! OCLC match_doc_uid I-ISBN tab_preferred !!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!!!!!> OCLC preferred_doc_aleph1 weights_tab1 tab_merge !!!!!!!!!!-!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!-!!!!!!! OCLC merge_doc_overlay 01

  47. Resource File • Input file – ALEPH sequential format • Run p_manage_36. This function splits an input file of documents into three output files, according to user defined matching criteria. • Run p_manage_38. This function runs a merge routine on the second output of p_manage_36, and the records in the database.

  48. Resource File (cont) • The first output of p_manage_36 (using p_manage_18), is loaded using NEW. • The output of p_manage_38 (using p_manage_18), is loaded using REPLACE.

  49. Resource File – p_manage_36 • Output File 1: • Contains records that do not match any record in the database. • The records in this file are given a new sequential number (starting from 000000001).

  50. Resource File – p_manage_36 (cont) • Output File 2: • Contains records for which a unique single match was found in the database. • The records in this file are given new system numbers that match the system number of the matched record.

More Related