480 likes | 593 Views
From Data to Discovery. Building Automated Cataloguing Tools with Perl. Huw Jones Cambridge University Library. Cambridge. Small city, big University = lots of libraries!. Lots of libraries = lots of books. University Library: 3.85 M Other libraries: 2.5 M 8 databases.
E N D
From Data to Discovery Building Automated Cataloguing Tools with Perl Huw Jones Cambridge University Library
Cambridge Small city, big University = lots of libraries!
University Library: 3.85 M Other libraries: 2.5 M 8 databases Bibliographic records
Quality Duplication Data problems
Quality - fullness of 2.5 M records in our databases 1 M are short records
Difficulty in resource discovery Patchy retrieval Lack of authority control Difficulty with standard deduplication Burden on staff time Ties us to multiple database model Effects
Better records Fewer records Aims
Manual recataloguing Commercial solutions Universal catalogue Discovery layer Either don’t solve the core problem, or expensive and/or time consuming Existing Solutions?
Automated Cataloguing Tools! Short record enrichment Automated MARC correction Deduplication Order important – full, well coded records are easier to deduplicate Our solution
Retrieve some records from a Voyager database Examine and/or manipulate them If necessary, make changes in the database N.B. Watch indexes and table space! General principles
Perl – holds everything together Perl DBI – connects to databases SQL – retrieves records from database MARC::Record modules (from CPAN) – to examine/manipulate records Pbulkimport/Batchcat – to make changes to the database General tools
Batchcat – installed on PC with Voyager More versatile Can’t be used on server Pbulkimport – limited functionality Needs Bibliographic Detection Profile and Bulk Import Rule (SYSADMIN) Can be used on server Batchcat vs Pbulkimport
Learning Perl / Randal L. Schwartz and Tom Phoenix. 3rd ed. (Sebastopol, Calif. : O’Reilly, 2001). ISBN: 0596001320 Programming the Perl DBI / Alligator Descartes and Tim Bunce. (Sebastopol, Calif. : O’Reilly, 2000). ISBN: 1565926994 Books
How to get from this … Enriching short records
Take short record Find a matching full record Overlay short record with full record Need a source of full records In Cambridge - University Library - large database of full, authority controlled records Basic mechanism
File of SHORT RECORD bib ids Connects to LOCAL database and checks if a valid bib id Connects to EXTERNAL source. Finds best FULL RECORD match and scores it Retrieves SHORT RECORD info from local database Compares match score to overlay threshold. If OK, retrieves MARC record for FULL RECORD Corrects FULL MARC record. Removes inappropriate fields. Inserts fields to be retained from SHORT RECORD In local database overlays SHORT RECORD with FULL RECORD
Service has been running for 1 year (much of which was testing) 18 libraries subscribed to use service 90,000 short records upgraded Results
Bibliographic standard – agreed minimum standard for cataloguing Every week, libraries receive an automatically generated file of MARC coding errors for correction Based on MARC::Lint module with many alterations MARC checking and correction
Connects to database using Perl DBI Retrieves MARC record for records created/edited in last week Runs them through MARC check Prints errors to file Emails file to library Over 100,000 errors pointed out so far! Mechanism
How to get from this … MARC Correction • =LDR 00472nam\\2200157\a\4500 • =001 662002 • =005 20071205064734.0 • =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d • =020 \\$a9780961751111 • =100 1\$aBroecker, W.S.,$d1931- • =245 10$aHow to build a habitable planet ;$cBy Wallace S. Broecker. • =260 \\$aNew York ;$bEldigio Press,$cc1985 • =300 \\$a291p $bill $c23cm • =504 \\$aIncludes index. • =650 \0$aAstronomy. • =650 \0$aAstrophysics.
to this! • =LDR 00453nam 2200157 a 4500 • =001 662002 • =005 20071205064734.0 • =008 071129s1985\\\\nyua\\\\\\\\\\001\0\eng\d • =020 \\$a9780961751111 • =100 1\$aBroecker, W. S.,$d1931- • =245 10$aHow to build a habitable planet /$cby Wallace S. Broecker. • =260 \\$aNew York :$bEldigio Press,$cc1985. • =300 \\$a291 p. :$bill. ;$c23 cm. • =504 \\$aIncludes index. • =650 \0$aAstronomy. • =650 \0$aAstrophysics.
Version of module which, where there is no ambiguity, corrects errors Built into short record upgrade program Also offered as a retrospective service to clean up legacy records Possibility of building it into weekly check MARC Correction
Connects to database using Perl DBI Retrieves full MARC record Runs against correction module Replaces corrected record in database Mechanism
Bib id: 662002 How to build a habitable planet ; By Wallace S. Broecker. 100: UPDATE: Spaces inserted between initials in subfield _a 245: UPDATE: By uncapitalised at start of subfield c 245: UPDATE: Space forward slash inserted before subfield _c 260: UPDATE: Full stop inserted at end of field 260: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Full stop inserted after the p in pagination 300: UPDATE: Full stop inserted at end of field 300: UPDATE: Illustration abbreviation has been corrected 300: UPDATE: Space colon inserted before subfield _b 300: UPDATE: Space inserted between digits and cm 300: UPDATE: Space inserted between digits and p in pagination 300: UPDATE: Space semi-colon inserted before subfield c Output
In testing 70,000 records processed Corrected over 200,000 MARC coding errors May run ALL our existing records through at some stage Results
Three stages: Identification of groups of duplicates Identification/construction of ‘best’ record Deletion of other records – relinking of holdings/items/Purchase Orders to ‘best record’ Deduplication – in progress!
Connect to a database with Perl DBI Use SQL to retrieve records For each record, retrieve all available data from tables Use matching algorithm to identify groups of duplicates Identification of duplicates
For each of group of duplicates, MARC records retrieved Passed to scoring algorithm Record with highest score forms basis of ‘best’ record Retains set fields (i.e. subject headings) from ‘other’ records Corrects any MARC coding errors Identification of best record
No relinking functionality, even in BatchCat No viable workaround for libraries using Acquisitions/without losing circulation history But …
Tools for librarians, not replacements! Do the stuff programs do well, allowing humans to concentrate on what humans do well Won’t do all the work, just makes a solution to major data problems feasible In conclusion …