1 / 48

From Data to Discovery

From Data to Discovery. Building Automated Cataloguing Tools with Perl. Huw Jones Cambridge University Library. Cambridge. Small city, big University = lots of libraries!. Lots of libraries = lots of books. University Library: 3.85 M Other libraries: 2.5 M 8 databases.

dbounds
Download Presentation

From Data to Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library

  2. Cambridge Small city, big University = lots of libraries!

  3. Lots of libraries = lots of books

  4. University Library: 3.85 M Other libraries: 2.5 M 8 databases Bibliographic records

  5. Quality Duplication Data problems

  6. Quality - fullness of 2.5 M records in our databases 1 M are short records

  7. Quality – coding

  8. Duplication

  9. Difficulty in resource discovery Patchy retrieval Lack of authority control Difficulty with standard deduplication Burden on staff time Ties us to multiple database model Effects

  10. Better records Fewer records Aims

  11. Manual recataloguing Commercial solutions Universal catalogue Discovery layer Either don’t solve the core problem, or expensive and/or time consuming Existing Solutions?

  12. Automated Cataloguing Tools! Short record enrichment Automated MARC correction Deduplication Order important – full, well coded records are easier to deduplicate Our solution

  13. Retrieve some records from a Voyager database Examine and/or manipulate them If necessary, make changes in the database N.B. Watch indexes and table space! General principles

  14. Perl – holds everything together Perl DBI – connects to databases SQL – retrieves records from database MARC::Record modules (from CPAN) – to examine/manipulate records Pbulkimport/Batchcat – to make changes to the database General tools

  15. Batchcat – installed on PC with Voyager More versatile Can’t be used on server Pbulkimport – limited functionality Needs Bibliographic Detection Profile and Bulk Import Rule (SYSADMIN) Can be used on server Batchcat vs Pbulkimport

  16. Learning Perl / Randal L. Schwartz and Tom Phoenix. 3rd ed. (Sebastopol, Calif. : O’Reilly, 2001). ISBN: 0596001320 Programming the Perl DBI / Alligator Descartes and Tim Bunce. (Sebastopol, Calif. : O’Reilly, 2000). ISBN: 1565926994 Books

  17. How to get from this … Enriching short records

  18. to this

  19. Take short record Find a matching full record Overlay short record with full record Need a source of full records In Cambridge - University Library - large database of full, authority controlled records Basic mechanism

  20. File of SHORT RECORD bib ids Connects to LOCAL database and checks if a valid bib id Connects to EXTERNAL source. Finds best FULL RECORD match and scores it Retrieves SHORT RECORD info from local database Compares match score to overlay threshold. If OK, retrieves MARC record for FULL RECORD Corrects FULL MARC record. Removes inappropriate fields. Inserts fields to be retained from SHORT RECORD In local database overlays SHORT RECORD with FULL RECORD

  21. Output

  22. Interface

  23. Service has been running for 1 year (much of which was testing) 18 libraries subscribed to use service 90,000 short records upgraded Results

  24. Bibliographic standard – agreed minimum standard for cataloguing Every week, libraries receive an automatically generated file of MARC coding errors for correction Based on MARC::Lint module with many alterations MARC checking and correction

  25. Output

  26. Connects to database using Perl DBI Retrieves MARC record for records created/edited in last week Runs them through MARC check Prints errors to file Emails file to library Over 100,000 errors pointed out so far! Mechanism

  27. How to get from this … MARC Correction • =LDR 00472nam\\2200157\a\4500 • =001 662002 • =005 20071205064734.0 • =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d • =020 \\$a9780961751111 • =100 1\$aBroecker, W.S.,$d1931- • =245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker. • =260 \\$aNew York ;$bEldigio Press,$cc1985 • =300 \\$a291p $bill $c23cm • =504 \\$aIncludes index. • =650 \0$aAstronomy. • =650 \0$aAstrophysics.

  28. to this! • =LDR 00453nam 2200157 a 4500 • =001 662002 • =005 20071205064734.0 • =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d • =020 \\$a9780961751111 • =100 1\$aBroecker, W. S.,$d1931- • =245 10$aHow to build a habitable planet /$cby Wallace S. Broecker. • =260 \\$aNew York :$bEldigio Press,$cc1985. • =300 \\$a291 p. :$bill. ;$c23 cm. • =504 \\$aIncludes index. • =650 \0$aAstronomy. • =650 \0$aAstrophysics.

  29. Version of module which, where there is no ambiguity, corrects errors Built into short record upgrade program Also offered as a retrospective service to clean up legacy records Possibility of building it into weekly check MARC Correction

  30. Connects to database using Perl DBI Retrieves full MARC record Runs against correction module Replaces corrected record in database Mechanism

  31. Bib id: 662002 How to build a habitable planet ; By Wallace S. Broecker. 100: UPDATE: Spaces inserted between initials in subfield _a 245: UPDATE: By uncapitalised at start of subfield c 245: UPDATE: Space forward slash inserted before subfield _c 260: UPDATE: Full stop inserted at end of field 260: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Full stop inserted after the p in pagination 300: UPDATE: Full stop inserted at end of field 300: UPDATE: Illustration abbreviation has been corrected 300: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Space inserted between digits and cm 300: UPDATE: Space inserted between digits and p in pagination 300: UPDATE: Space semi-colon inserted before subfield c Output

  32. In testing 70,000 records processed Corrected over 200,000 MARC coding errors May run ALL our existing records through at some stage Results

  33. Three stages: Identification of groups of duplicates Identification/construction of ‘best’ record Deletion of other records – relinking of holdings/items/Purchase Orders to ‘best record’ Deduplication – in progress!

  34. Connect to a database with Perl DBI Use SQL to retrieve records For each record, retrieve all available data from tables Use matching algorithm to identify groups of duplicates Identification of duplicates

  35. And you’ll end up with something like this:

  36. For each of group of duplicates, MARC records retrieved Passed to scoring algorithm Record with highest score forms basis of ‘best’ record Retains set fields (i.e. subject headings) from ‘other’ records Corrects any MARC coding errors Identification of best record

  37. No relinking functionality, even in BatchCat No viable workaround for libraries using Acquisitions/without losing circulation history But …

  38. Tools for librarians, not replacements! Do the stuff programs do well, allowing humans to concentrate on what humans do well Won’t do all the work, just makes a solution to major data problems feasible In conclusion …

  39. Questions?

More Related