400 likes | 702 Views
FRBR: Algorithms and Applications. T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004. Outline. Algorithms FRBR work matching Handling author-title variants Hardware Beowulf cluster Applications Bookmarklets FictionFinder Future directions.
E N D
FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004
Outline • Algorithms • FRBR work matching • Handling author-title variants • Hardware • Beowulf cluster • Applications • Bookmarklets • FictionFinder • Future directions
Working with Group 1 Entities WEMI: Work Expression Manifestation Item • Strict expression-level determination is hard • We primarily divide by language • Manifestation is easier • We use the WorldCat master record
Work Identification • Algorithm goals: • Efficient • Understandable • Controllable by catalogers • Uses existing WorldCat records
The Algorithm • A key is generated for each record • Extract author, title • Look up in LC name authority file • Added entry information as needed • Form a key from bibliographic record • Author, title, added entry information • These can be sorted, compared
More Detail • Extract author names • Look up in authority file • Currently only personal names • Subfields $abcdq • Extract title • Always use uniform titles if present • Look up author/short title (~$a) • Look up author/long title (~$abfgnp) • Prefer alternative title for non-English • Create key from author/title • Always do NACO normalization (has limitations) • Add information for uncontrolled title-main-entry
Authority Files Rule! • Authors • Author/titles • Bring together variations • Allow override in difficult cases • Both splitting and joining groups • Especially important with xISBN matching • Especially important with non-English metadata
Limitations of the Authority File • What’s missing: • Many uniform titles • Many author variants • Many title variants • Language of heading • Partial solution • Create auxiliary files of mechanically generated matches
Results of FRBR Matching on WorldCat • 88% of manifestations are ‘singletons’ • 30% of manifestations are in 12% of the works • Average size of multiple matches: 3.1 manifestations/work • 43.1 million works in 54 million manifestations • 54% of holdings on a FRBR work with >1 manifestation • WorldCat manifestations average about 20 holdings • FRBR helps where help is most needed
More FRBR Results • 310,000 works have more than 5 manifestations • 1.7 million have more than 2 manifestations • Largest: 30,000+ for the Bible • 1,537 Shakespeare’s Macbeth • 1,026 Dickens’s Christmas Carol
Our Beowulf Cluster • 24 Nodes • Each with 2x2.6 GHz processors • 4 GBytes memory (96 GBytes total) • One ‘head’ node, 23 ‘compute’ nodes • 46x40 GBytes disk (~2 Terabytes total) • Gigabit switch
What we are using it for • All our bibliographic processing • FRBR • Extractions • Searching • Matching
Starting point • FRBR key generation • 25 hours on a 3.00GHz workstation with 2GB of RAM • Generate two key files • sort by key, uniq by key, sort by occurrence • sort by key, post processing on keys, uniq by key, sort by occurrence • Merge key files
FRBR on the Cluster • 44 minutes on the cluster • 69 key builders & 23 sort buckets with hyperthreading ON • Generate 23 radix-sorted, post-processed key files • Collapse and sort by occurrence in parallel • Also outputs additional files used by other jobs
Application: Preservation • Identify ‘final copy’ items • Do it at the work level • Single-singles • Single manifestations with single holding • Found 18 million in WorldCat
Application: xISBN • A simple Web service • Given an ISBN: • Identify the workset it is in • Return all other ISBNs in that workset • Results should be symmetrical! • Same group retrieved for each ISBN in group • ISBNs sorted by number of library holdings
xISBN Example http://labs.oclc.org/xisbn/0-19-281664-0 returns: <?xml version="1.0" encoding="UTF-8" ?> <idlist> <isbn>0192816640</isbn> <isbn>0820312037</isbn> <isbn>0820315370</isbn> <isbn>0393015920</isbn> <isbn>0393952274</isbn> <isbn>0393952835</isbn> <isbn>0140430210</isbn> <isbn>0192811320</isbn> <isbn>0192835947</isbn> <isbn>0460872885</isbn> <isbn>1853262706</isbn> <isbn>0874131219</isbn> </idlist>
Matching on ISBNs • ISBN additional information beyond Author/Title • Allows relaxation of matching • Introduces possible errors • Offers the possibility of substantial improvement of work matching
Merging Worksets Using ISBN Matches • Pair ISBNs with FRBR keys (Starts with 10 million ISBNs) • Throw out ISBNs in single worksets • Throw out ISBNs in > 5 worksets (We now have 561,000 ISBNs left) • Are the titles similar enough? • Throw out large groups • Try to be very conservative • Authority file always overrides other matching
Matches from ISBN Matching • 74,000 author variants • ~200,000 title variants • These all create additional cross reference records • Automatically folded into FRBR matching • Kept separate from NACO file • Only used in research at this time
Examples of Possible Matches • /mcgraw hill encyclopedia of science & technology • /mcgraw hill encyclopedia of science & technology\1\aar aor • /mcgraw hill encyclopedia of science & technology\2\apa boo • /mcgraw hill encyclopedia of science & technology\3\bor cle • /mcgraw hill encyclopedia of science & technology\4\cli cyt • … • dickens, charles\1812 1870/tale of two cities • dickens, charles\1812 1870/hard times • dickens, charles\1812 1870/sketches by boz • dickens, charles\1812 1870/martin chuzzlewit • dickens, charles\1812 1870/bleak house • dickens, charles\1812 1870/little dorrit • dickens, charles\1812 1870/oliver twist • …
FictionFinder • Indexes fiction from WorldCat • Uses FRBR workset algorithm • Focused on fiction • Searching and browsing by • Genre • Fictitious Characters • Imaginary Places • Literary Forms • Links to • Google • Open WorldCat • Diane Vizine-Goetz’s project
Additional Matches • Match variant titles: • When the wind blows • When the wind blows: a novel • FictionFinder identified 10,000 of similar variations • novela, novella, roman, … • Created auxiliary authority records • Now automatically used when FRBR algorithm is run
Future • Continued development of FictionFinder • Extending algorithm to serials? • FirstSearch displays • Additional matching criteria • Local authority files? • Integration of auxiliary files for production? • Exploring FRBRizing some European catalogs • Looking at extending beyond Roman characters
Links • IFLA FRBR - Final Report • http://www.ifla.org/VII/s13/frbr/frbr.htm • Article in DLib • http://www.dlib.org/dlib/september02/hickey/09hickey.html • OCLC Research Activities with FRBR • http://www.oclc.org/research/projects/frbr/ • FictionFinder • http://fictionfinder.oclc.org/ • Top 1000 • http://www.oclc.org/research/top1000/