Matching names in parallel

Matching names in parallel T. Hickey Access 2006 2006 October

Virtual International Authority File • Link national authority records • Build on their authority work • Move towards universal bibliographic control • Allow national or regional variations in authorized forms to co-exist • Support needs for variations in preferred language, script, and spelling • 10 million WorldCat records in non-English metadata

Joint VIAF Project

Matching Variations In the LCNAF and PND authority files: • Same name, same person • Same name, different people • Different names, same person • Missing person in one file

Different Same Name People Two Different People – One Name Adams, Mike • PND: a golfer • LCNAF: author of a Beatles collector's guide

Different Same Person Names One Person – Two Names • LCNAF: Morel, Pierre • PND: Morellus, Petrus

Bibliographic Record Enhanced Authority Derived Authority Authority Record Enhancing the Authorities

Strong Matching Attributes • A work (title) in common • Common control numbers (ISBN, ISSN, or LCCN) • Exact birth and death year • Joint authors • Name as subject

Weaker Attributes • Only one of birth/death date(s) (allows some variation) • Subject area of works (two levels) • Format (books, films, musical scores, etc.) • Language • Publisher • Partial title match • Date of publication • Country • Role (author, illustrator, composer, etc.) • Format (books, films, musical scores, etc.)

Computing it • Standard approach • Generate keys and data • Load information into a database • Index it • Extract fields needed • Map/Reduce approach • Split the database up • Run parallel jobs • Bring information together via map/reduce • Assemble information in stages

Map/Reduce • Two stages • Map • Read in source file (e.g. MARC-21) • Write out key + data • Reduce • Read in array of data for each unique key • Write out key + data

Overview of MapReduce Source: Dean & Ghemawat (Google)

Our Implementation • Written in Python • Uses ssh and XML-RPC for control and communication • Map/Reduce seems to add ~ 10% overhead • Ran an earlier implementation on a 48 cpu cluster • Current VIAF cluster is a 12 cpu cluster on 4 nodes • Running Linux and 64-bit Python

VIAF Matching Code • 17 modules • 1,100 lines of code • Plus • 600 lines configuration • 2,755 lines of tables embedded in code

build compare data build compare data build name:id map build name:id map name:id id:tag, data name:id id:tag, data map authorities map authorities authority id: bib id authority id: bib id PND Catalog PND Catalog LC Catalog LC Authority PND Authority Extract Data Extract Data Extract Data Extract Data Extract Data VIAF Data Flow build buckets surname: forename,date eliminate forename, date conflicts from buckets get changed Ids identify compare data potential pairs select compare data changed authority ids select compare data pair id:[bib/auth]id identify compare data pair id: compare data pair id:[bib/auth]id pair id: compare data compare pair id: scores

WorldCat Identities • Bring together all of WorldCat’s information about people • Name(s) • Works by and about • Subjects • Dates • Fiction/non-fiction • Roles • Co-authors • Add links • Wikipedia • Authority files

Sample Identity

Statistics • Nearly 19 million different ‘identities’ in WorldCat • 80 million (nominally) controlled headings • The WorldCat Identity code is ~800 lines of Python in 4 modules (plus XSLT, CSS, etc.)

Identities Data Flow Cover Art WorldCat FRBR Audience Stage 1 NameInfo Citation Authorities Stage 3 Stage 2 NameInfo Citations Stage 4 Identities Wikipedia

Identities Stage 1Extract Data From WorldCat • Input: WorldCat (MARC-21) • Map output: • NameKey <nameInfo> • WorkID <citation> • Reduce output: • WorkID <best citation> • NameKey <cumulative nameInfo>

Identities Stage 2Extract Data From Authorities • Input: NACO Authorities file (MARC-21) • Map output • NameKey <authorityInfo> • XTos • XFroms • Reduce output • NameKey <authorityInfo, symetric xrefs>

Identities Stage 3Connect Citations with Names • Input • Stage 1 output • WorkID <by/about citation>’s • NameKey <nameInfo> • Map output • NameKey <nameInfo> • NameKey <topCitations>

Identities Stage 4Create Identities • Input • Authority info from stage 2 • Merged name info from stage 3 • Merged citations from stage 3 • Map output • Pass through • Reduce output • Pnkey <Identity Record>

Schedules • Identities • Up this year? • VIAF • Reload, rematch this year • Public service up early 2007

Conclusions • Our merged files (e.g. WorldCat) are really quite large • More processing power opens up new ways of manipulating and looking at our data • Parallel processing is the only way to obtain the cycles needed • Map-Reduce is an attractive way to do parallel processing • Forces decomposition • Scales well • Opens up new possibilities

Thank you T. Hickey VIAF.org http://errol.oclc.org/laf/n82-54463.html Access 2006 2006 October

Matching names in parallel

Matching names in parallel

Presentation Transcript

String Matching of Bit Parallel Suffix Automata

Author Names Author Names Author Names Author Names Author Names Author Names Author Names Author Names

Parallel Belief Propagation for Stereo Matching

Massively Parallel Cuckoo Pattern Matching Applied For NIDS/NIPS

Names Beautiful Names

Pipelined Parallel AC-based Approach for Multi-String Matching

Speculative Parallel Pattern Matching

Overcoming Computer Word Size Limitation in Bit-parallel Pattern Matching

Matching in 2D

Parallel Parentheses Matching

Matching in 2D

Trellis-based Parallel Stereo Matching

Matching in 2D

On Schema Matching with Opaque Column Names and Data Values

Names

Is Matching Company Names difficult?