360 likes | 525 Views
Automated translation of LCSH. Sirsi Unicorn API Summit 2004 Halifax Public Library October 17-18th, 2004. Benoit PAUWELS Université Libre de Bruxelles (ULB) Brussels. Automated translation of LCSH. Why? Overview of the solution Technically speaking … Unicorn configuration
E N D
Automated translation of LCSH Sirsi Unicorn API Summit 2004 Halifax Public Library October 17-18th, 2004 Benoit PAUWELS Université Libre de Bruxelles (ULB) Brussels
Automated translation of LCSH • Why? • Overview of the solution • Technically speaking … • Unicorn configuration • Creation of an LCSH/RVM dictionary • AUTHTRAN report • Bulk translation of LCSH in the catalog • Future developments
Why? • 1986: start of retrospective conversion of catalog cards through RETROCON project with OCLC • introduction of LCSH • 1986 - political statement: • all bibliographic descriptions should be searchable through both the english and french version of the corresponding LCSH subject entries • Possible solutions: • Duplication of english and french subject entries in catalog record • Cross-referencing from french to english terms
Why? • Cross-referencing is not good enough! • Impossible to navigate on french subject headings • Too complicated for end user (needs to go through cross-references for every search) • Example • 1986 - adopted solution: • each bibliographic description contains all LCSH in english and french • Université de Laval systematically translates LCSH into french • ULB decides to adopt RVM for its french translations
Overview of the solution • Example • iLink@ULB • Corresponding catalog and authority records in Workflows
Overview of the solution • Manually • Most of our catalog records are derived from Z39.50 sources (containing english LCSH)Original cataloging: LCSH manually added to catalog record • For every catalog record • For every LCSH • Validate LCSH • Browse/select from authority index • Create new authority record with RVM translation • Copy/paste RVM translation from authority record into catalog record • Validate RVM translation • Create new authority record • Workload very high: 15 min per catalog record • Very frustrating: • same LCSH subfields need to be translated over and over again • RVM heading gets entered 3 times (cat record; 2 x auth record)
Overview of the solution • Automated • manually translate the english LCSH heading ONCE(in the LCSH authority record) • automatically generate the french RVM authority record • automatically generate the french RVM subject entries in the catalog record
Unicorn configuration • Authority formats • Topical • ENGFRE • FREENG • Geographical • GEO-ENGFRE • GEO-FREENG • Authority indexes • English LCSH authority records are postedto ENGFRE • French RVM authority records are posted to FREENG • Authority index variations in catalog formats (MARC, SERIAL, …) • English LCSH entries -- ind2=0 (650-0; 651-0) • Validated against the ENGFRE authority index • French RVM entries -- ind2=6 (650-6; 651-6) • Validated against the FREENG authority index
Unicorn configuration Authority formats ENGFRE FREENG GEO-ENGFRE GEO-FREENG 150— 750-6 650 150— 650 151— 751-6 651 151— 651 Authority indexes ENGFRE FREENG Catalog formats 650-0 650-6 651-0 651-6
Unicorn configuration Authority formats
Unicorn configuration Catalog formats
Creation of an LCSH/RVM dictionary • Since 1986 LCSH and RVM have been added to catalog records « in the same order » • Thanks to this work we can now build an LCSH/RVM dictionary 245-- Title 650-0 LCSH-1 650-0 LCSH-2 651-0 LCSH-3 650-6 RVM-1 650-6 RVM-2 651-6 RVM-3
Creation of an LCSH/RVM dictionary • Step 1: empty subject authority database selauthority –f’ENGFRE,GEO-ENGFRE,FREENG,GEO-FREENG’ | remauthority
Creation of an LCSH/RVM dictionary • Step 2: recreate authority records • Dump subjects from all catalog records selcatalog –f’MARC,SERIAL,MAP,…’ | catalogdump –of –ka888 | filtermarc –i’650,651,888’ –od –Ds • Create English/French subject pairs LCSH-1||RVM-1||650 LCSH-2||RVM-2||650 LCSH-3||RVM-3||651 • Popular subject pairs Some English terms have been translated into different french terms during the 15 year manual input; we want to get rid of the wrong translation by counting them and only retain those translations with the biggest occurrence
Creation of an LCSH/RVM dictionary • Step 2: recreate authority records • Create « extended » subject pairs • derived from the original subject pairs; by iteratively chopping off the last subfield of the english and french part of the subject pair; the so obtained extended subject pair is only retained if the english part of it exists in the SUBJECT browse index • Original: Aids (disease)|xprevention||Sida|xpréventionExtended: Aids (disease)||Sida • Popular extended subject pairs • Merge subject pairs and extended subject pairs together; and retain popular pairs
Creation of an LCSH/RVM dictionary • Step 2: recreate authority records • Create ENGFRE,FREENG,GEO-ENGFRE and GEO-FREENG flat authority records Aids (disease)|xPrevention||Sida|xPrévention||650 *** DOCUMENT BOUNDARY *** FORMAT=ENGFRE .150. Aids (disease)|xPrevention .750. Sida|xPrévention *** DOCUMENT BOUNDARY *** FORMAT=FREENG .150. Sida|xPrévention
Creation of an LCSH/RVM dictionary • Step 3: populate authority database cat subjects.authinput | authload –s subjects.authload.errors –fa –mc –q’TODAY’ selcatalog | authcheck –m rebuildtext report rebldthesauri report correcthesauri report (several times) • Some figures # ENGFRE: 123016 # FREENG: 123030 # GEO-ENGFRE: 23069 # GEO-FREENG: 23047
AUTHTRAN report • Create RVM entries in touched catalog records. • A catalog record can be touched through: • an edit operation on the catalog record • a creation, modification or deletion of an authority record • Create FREENG/GEO-FREENG authority record for every touched ENGFRE/GEO-ENGFRE authority record. • An authority record is touched through: • an edit operation on the authority record
AUTHTRAN report • Build authkeysfile from the ‘gpn authedit’ directory • cat authkeysfile | authdump –ki | filtermarc –iALL –od –Ds > delimfile • Create flat authority records from records in delimfile • Load new authority records • cat newauthrecsfile | authload –mc –q’TODAY’ • authload checks for already existing authority records • touchkeys new authority records (for reindexing through adutext)
AUTHTRAN report 1. Catalog records that have been edited (through Workflows for example) • Catalog keys to be found in the ‘gpn textedit’ and ‘gpn browsedit’ directories 2. Catalog records can be touched through the creation, modification or deletion of an authority record 2.1.Creation of authority record • Find catalog records that contain the new LeadTerm (LT) • echo ‘authkey’ | autheditor –e | seltext 2.2.Modification of an authority record – change of LT • Find catalog records that contain the old LT echo‘authkey’ | autheditor –c | seltext • Find catalog records that contain the new LT echo ‘authkey’ | autheditor –e | seltext
AUTHTRAN report 2.3.Modification of an authority record – other change • Find catalog records that are authorized against this authority record echo ‘^Aauthkey’ | seltext • Find catalog records that contain the LT; and which are NOT authorized against this authority record • Construct heading from LT • Lookup heading key for this headingecho ‘constructed heading’ | selheading –iT –oKTn –b’SUBJECT’ • Look for catalog records with this heading key headinginfo = ‘^G003heading’ echo headinginfo | seltext 2.4. Deletion of authority record • Authority record could have been modified before deletion; we therefore need to consider all cases as under 2.2 and 2.3.
AUTHTRAN report 3. Look for lost catalog keys • will explain this later 4. Merge and deduplicate all these catalog keys
AUTHTRAN report • Only validated LCSH get a chance of being translated; so first authority check the catalog records cat touched_catkeys | authcheck –m • Dump all touched catalog records cat touched_catkeys | sort –n |\ catalogdump –of –ka888 –z –J |\ filtermarc –iALL –od –Ds > dumpfile • Lookup french RVM translations in corresponding authority records perl authtran_4.pl dumpfile > translations • Recreate catalog records • original record without FREENG/GEO-FREENG subjects + add new translations perl authtran_5.pl dumpfile translations > newdump
AUTHTRAN report • Split up file of new catalog records according to format policy (-a is a mandatory option on catalogload) • Reload each of these formatfiles cat fileforformatX |\catalogload –aX –if –bc –umy –j –r –mu –e/dev/null -4authtran_junktag -3> loadedcatkeys • Only reload if necessary • if there are any modifications in the catalog record • Don’t reload if too many new records (´gpn custom´/authtran) • see authtranrbld report • will explain this later
AUTHTRAN report • Make sure the RVM entries in the reloaded catalog records are validated against the FREENG authority index cat loadedcatkeys | authcheck –m • Only touchkeys uptil a limit (set in ´gpn custom´/bulktext) • will explain this later
AUTHTRAN report • Executed daily before adutext • Finished report - listing
AUTHTRAN report • Recover « lost » catalog keys • Execution of the adutext report will remove catalog keys from the ‘gpn textedit’ and ‘gpn browsedit’ directories. Executing the adutext report before authtran would lead to keys of catalog records getting lost. • cadutext report • customized version of adutext • saves treated catalog keys to special directories • authtran has the knowledge of finding these lost keys back.
AUTHTRAN report • Reloading « many » catalog records • Problem • could result in a long execution time of the authtran report; and hence jeopardize the execution of the (many) other daily reports, including the critical daily backup procedure of the Unicorn filesystems. • if number of catalog records to be reloaded exceeds a threshold (set in ‘gpn custom’/authtran), NO catalog records get reloaded • file of catalog records gets saved to separate directorymail is sent to « authtran administrator »file can be manually fed to the authtranrbld script
Reindexing « many » catalog records Problem touchkeys too many catalog keys could fill up the Unicorn filesystem, and hence jeopardize the correct functioning of Unicorn could result in a long execution time of the adutext report; and could hence jeopardize the execution of the (many) other daily reports, including the critical daily backup procedure of the Unicorn filesystems. Only a limited number of catalog records get reindexed (limit set in ‘gpn custom’/bulktext) Catalog keys of the other catalog records get saved to a separate directory cadutext will automatically (through a call to the bulktext script) pick up the ‘limited’ number of catalog keys and reindex the corresponding catalog records AUTHTRAN report
Bulk translation of LCSH in the catalog • authtranrbldfre script • Purpose: generate FREENG/GEO-FREENG authority records for a given set of ENGFRE/GEO-ENGFRE authority records • Syntax: authtranrbldfre authkeysfile reportfile [Y|N] • authtranrbld script • Purpose: generate RVM translations in a given set of catalog records • Syntax: authtranrbld catkeysfile reportfile [Y|N] • feed ALL catalog records to the script; the complete catalog gets translated • number of catalog records reloaded: 200923
Future developments • Automatically generate the translations of LCSH • based on RVM translation tables for uniterms (subfields a, x, y, z and v); to be bought from Université de Laval • need to build a local dictionary of ‘uniterms’ (not all uniterms are translated in RVM) • 100% automated translation seems impossible • |aA|zZ => |aA’ • |aFrench-Canadian literature|zQuebec (Province) • |aLittérature canadienne française|zQuébec (Province) • |aLittérature québecoise • |aA => {|aA’,|aA’’,…} • |aRight and left (Political science) • |aGauche (Science politique) • |aDroite (Science politique) • |aExtrême droite • |aNouvelle droite • |aExtrême gauche
Automated translation of LCSH Available soon in Randy’s API repository