340 likes | 354 Views
The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter Beth Goldsmith Mariella Di Giacomo Los Alamos National Laboratory Research Library. Rationale for Project. 60+ Million citations – multiple access points Duplicate records / citations
E N D
The ‘XML’ Project: Integrated Access to Scientific Resources Miriam Blake – Presenter Beth Goldsmith Mariella Di Giacomo Los Alamos National Laboratory Research Library
Rationale for Project • 60+ Million citations – multiple access points • Duplicate records / citations • No links between bibliographies and records we store • Need ‘smart objects’ with pointers (to full-text, etc.) • Wanted an updated interface with new features
Existing Databases at LANL Citation (A&I) databases • ISI • SciSearch 1945-present : ~30 M + 4 k weekly • Social SciSearch 1973-present: ~15M + 1k • Arts & Humanities 1975-present: ~5M + .5k • ISI Proceedings 1990-present : ~3M + .5k • All ISI dbs have associated citation records • INSPEC : ~8 M + • BIOSIS : ~15M • Engineering Index (Compendex) • Other (DOE, LAUP/tech repts., GeoRef, OPAC, etc.)
Project Team • 6 developers (librarians and programmers) • Miriam Blake, Doug Chafe, Mariella Di Giacomo, Frances Knudson, Beth Goldsmith, Mark Martinez, Ming Yu, Jeff Scott (hardware) • Research Library staff • Librarians / metadata experts • Interface team – 2 staff doing jsp, html, graphics for this project part time
Project Workflow Verity Indexes Single record format MySQL Indexes Verity XML Indexing Conversion Application Search & browse Display Multiple vendor record formats Vendor A Vendor B Vendor C
Hardware • Fault-tolerant architecture to provide reliability, flexibility, and speed • Sun Solaris 2.8 platform • Security environment • Data stored and accessed inside a firewall • Data accessed and application runs outside the firewall • Required a data sharing file system (for Solaris) • LSC file system called QFS • Multiple readers, one writer per filesystem
Application Application Verity Broker Verity Broker Application Verity Broker Application Verity Broker Application Application Verity Servers Verity Servers Load Balancer Verity Broker Verity Broker Verity Servers Author Browse db Author Browse db Verity Colls Verity Colls XML recs MySQL slave server Verity Colls XML recs XML recs User Authentication Db (mysql) Linux Firewall Development Environment Verity broker/servers MySQL slave server SAN (Storage Area Network)
Software components • Verity search engine • MySQL to handle author browse, user functions • Interface • XSLT to transform XML for query result displays • Java servlets • JSP • Apache / tomcat to handle Java/JSP presentation
Verity Search Engine • Commercial product – used by many large companies • Used in our older apps – users familiar with search capabilities • Strength in full-text searching • Required Solaris (now runs on Linux) • Verity K2 – parallel multi-tiered architecture • Brokered approach • Searches are distributed to multiple servers to concurrently search multiple Verity collections • LANL collections broken by year • Recs colls – bibliographic metadata • Cites colls - citations within articles
ISI vendor record (Bib record + citations) XML Record with bib data “recs” XML Record with citation data items “cites” ISI Conversion Verity recs coll (for searching) Verity cites coll (for searching)
Record Structure • Record keys <fullKey> – structure: • Combination of ISSN, author name, volume, issue, start page, and title letters /recs/sici00/0018-8190/46/2/173_SCIANCE-LSTROST • Not all elements are always present • ISI records split into 2 XML records with the same fullKey – one for bibliographic and one for citations (bibliography) • Bibliographic and cited indexed into separate collections for searching
Conversion to XML - recs • Verity XML • Specific fields needed to handle vendor indexing requirements • One XML record containing matching articles from multiple vendors • Consistent XML tags across databases as much as possible Verity XML record example Example Verity XML Bib record
Kludges for Verity • Sort fields <sorttitle>SPIRITUALITY IN MEDICINE A COMPARISON OF MEDICAL STUDENTS ATTITUDES AND CLINICAL PERFORMANCE </sorttitle> <sortauthor>MUSICK DW CHEEVER TR QUINLIVAN S NORA LM </sortauthor> <sortsource>ACADEMIC PSYCHIATRY 2003000027000002000000000067</sortsource> <sortdate>20030000000183296100001</sortdate> • Display Fields <resauthor>(Art)Musick, DW; Cheever, TR; Quinlivan, S; Nora, LM </resauthor> <ressource>(Art)Source: ACADEMIC PSYCHIATRY; SUM 2003; v.27, no.2, p.67-73 </ressource>
Kludges for Verity • Zones <znumber> <issn db="Soc">1042-9670</issn> <controlNum db="Soc">000183296100001</controlNum> </znumber> • Data enhanced tags <zjournal> <journalAbbrJ2 db="Soc">ACAD PSYCHIATR</journalAbbrJ2> <journalAbbrJ9 db="Soc">ACAD PSYCHIATRY</journalAbbrJ9> <journalAbbr db="Soc">Acad. Psych.</journalAbbr> <journalAbbrJ1 db="Soc">ACAD PSYCHI</journalAbbrJ1> <journal db="Soc">ACADEMIC PSYCHIATRY; SUM 2003; v.27, no.2, p.67-73</journal> </zjournal>
Unified record display • Preference order for fields to display when multiple databases are present in the same record • Some fields should be dedupped (e.g. title) • Some fields should display all data from all databases (e.g. subject, keywords) • Becomes critical when multiple vendor records are displayed together
ISI Conversion - Cites • Cites – citation data (bibliographies) in each bibliographic record • Searchable separately from the articles which cite them • 500+ Million individual citations (~170M are unique) • Can be search by cited author, source, year, volume or a combination thereof • One cites XML record can have multiple citations <refItem> - one for each citation • After conversion to XML, fullKeys created for each <refItem> where possible
Source (title, year, vol, issue, start page) Information and software technology 1996, v.38, # 9, p.601 3 authors Ling, TW (1st author) Goh, CH Lee, ML ISI Conversion - cites 26 M records with bibliographies 500 M individual citations <refItem> Title: xxx ----------------------- Citation 1 Citation 2 Citation 3 Citation x… FullKey for this item: /recs/sici09/0950-5849/39/9/601_LING-ECFDFPDD
Cites Fuzzy Matching • Every <refItem> is processed to try to link it to the recs article it matches using fullKey • Use “fuzzy matching” rules developed internally • Internal db of ISSNs matches brief source data ( PHYS REV B or P REV B) • ISSN + cited author name, cited volume, cited page creates fullKey that can match to the key of a bib record, creating a link • ~60% of bib records match a cite
XML Cited reference example <refItem type="ref"> <fullKey>/recs/sici10/1040-2446/67/1/42_VU-6YCCPBAUSPASIUS</fullKey> <starKey>/recs/sici10/1040-2446/67/*/42_VU*</starKey> <citAu src="cit">VU, NV</citAu> <citAu src="bib">VU, NV</citAu> <citAu src="bib">BARROWS, HS</citAu> <citAu src="bib">TRAVIS, T</citAu> <citSo src="cit">ACAD MED</citSo> <citSo src="bib">ACADEMIC MEDICINE</citSo> <citSo src="bib">ACAD MED</citSo> <citYear src="cit">1992</citYear> <citVol src="cit">67</citVol> <citIssue src="bib">1</citIssue> <citPage src="cit">42</citPage> <citEndPg src="bib">50</citEndPg> <citIssn src="bib">1040-2446</citIssn> </refItem>
Match on key /recs/sici01/0163-5808/29/3/76_LEE-CASXSL No record match found Match on key/recs/sici09/0950-5849/38/9/601_LING-ECFDFPDD Matching citations and bib records Sample bibliography
Cited browse • “citeinfo” database with over ½ billion individual citations (one of largest MySQL db’s around!) • Individual <refItem>s include fullKeys (which come from cite XML) for linking • FullKeys are de-dupped • Each cited author name is pulled from <refItem>s, normalized and added browselist tables • Browse tables contain ~195 Million names • After dedupping, only ~12 M unique names
Cited browse Total cite count Links to record via fullKey • 12 M unique names • Browse cited papers • Browse general search Number of times each item is cited
Times cited • <fullKey> is used to create real-time times-cited counts • Counts displayed in bibliographic record as well as cited browse • Times-cited count is also pulled out and indexed into verity to allow sorting of results by “times cited”
Citation 1 Full Record Full Record Citation 1 Times cited: 96 Number of times cited: Total 96 2003: 12 2002: 23 2001: 24 2000: 50 1999: 70 1998: 17 Records citing Citation 1 Published in 2000 ------------------------- Title A Title B Title C … Cited linkages Full Record Title: A Published: 2000 ----------------------- Citation 1 Citation 2 Citation 3 Citation x…
Cited browse • Connections to citeinfo MySQL use connection pooling • 100 connections refreshed after every 10 queries (can be increased on the fly) • Table structure optimizations reduced browse time to avg. under 1 second • Highly cited works (cited more than 10,000 times) are slow
Adding MySQL to the mix • Fast performance and an Open Source relational db • On Sun platform, can address up to 32GB of memory for query caching • Used to provide browse capability for article authors / cited authors • Also provides a live, disk based backup to XML bibliographic data • Separate MySQL databases used for User authentication and preferences and for current alerts services
Application - Requirements • 250,000+ searches per month • 3300 users have weekly alerts set up • 115 run saved searches “on demand” • Access requests from all over the world National Inst. Of Materials Physics – Bucharest-Romania Univ. Program in Ecology – Duke University Dept. of Biochemistry and Molecular Biology – U of Western Australia National Center for Atmospheric Research – Boulder, CO
Application - Requirements • Interface enhancements • Keep “successful” options from legacy interfaces • Add features based on user feedback • Search screen options - features based on appropriate dbs • Alerts and saved searches • User preferences • Marking and output • SFX
Performance • Many variables – attempts to improve each component • XML layout on the filesystems • Memory use • Network infrastructure • Application issues • MySQL engine, Verity engine, JVM, Java compiler, XSL, and JSP • Application Code itself
Lessons Learned • As deadlines approach, design suffers • Standards evolve slower than software • As projects become bigger, teams need to formalize work patterns • Project Management tools are critical – ant, CVS, Bugzilla
Next steps • INSPEC will be added to ISI October 2003 • Some interface rework to handle • INSPEC “only” users – no cited features • New / expanded list of indexes • Searches over INSPEC db only (not ISI) • BIOSIS by the end of 2003 • Merging User databases across product suite • Expanding into a “component architecture” • Increase use of standards and open source (MARCXML, OAI, etc.)