110 likes | 362 Views
Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart University of California, San Diego San Diego Supercomputer Center (moore, charliec)@sdsc.edu http://www.npaci.edu/DICE/. Reagan Moore Sheau Yen Chen Charles Cowart George Kremenek Erdem Kulrul Richard Marciano
E N D
Persistent Archive for the NSDL Reagan W. Moore Charlie Cowart University of California, San Diego San Diego Supercomputer Center (moore, charliec)@sdsc.edu http://www.npaci.edu/DICE/
Reagan Moore Sheau Yen Chen Charles Cowart George Kremenek Erdem Kulrul Richard Marciano Arcot Rajasekar Michael Wan Persistent Archive Team
Status • Architecture design • Choice of web crawler • Demonstration • Proof of concepts
Architecture • Built on existing tools • Retrieve metadata • OAI metadata harvester • Retrieve digital entities • Web crawler • Organize and archive digital entities • Data grid • Provide access • OAI and HTTP interfaces
OAI Interfaces • OAI service provider interface • Used Tom Kalt’s (U Mass) OAI harvester classes • Initiate connection • Retrieve metadata as XML • Parse XML into objects • OAI data provider interface • Custom CGI interface to SRB/MCAT written in C • Parses OAI2 requests and generates SRB client calls • Transforms from SRB objects to XML
Web Crawler • HTML crawler choice • WGET (Gnu) • WebBase (Stanford) • HTML/XML translator (SDSC) • Capabilities • Parallelized for performance • Recursively crawl web site • Build link graph structure • Translation of links to logical name space
Data Grid • Organize retrieved digital entities • Snapshot based (time) • Support for compound documents • Conversion of all internal URL links to SRB URL links, and associated SRB logical name space for digital entities • Manage storage of digital entities • Store on disk / archive at SDSC, could be replicated to any other site
Implementation • URL list generation from “harvesting of NSDL repository” • Crawl and retrieve digital entities into a “buffer area” • Archive into snapshot organized collections • Flags / time stamps for changed data for OAI based retrieval
Demonstration • Register digital entity by original URL • Store DC metadata • Crawl based on text file of desired URLs • Tested on LoC American Memory collection • Currently crawl two levels • Manages CGI redirection • Organize compound documents • Add SRB links for redirection • Preserve external web links • Display results using INQ interface to SRB
C, C++, Libraries Unix Shell Databases DB2, Oracle, SQLServer Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker & Meta-data Catalog Common APIs Application Linux I/O OAI Access APIs DLL / Python Java, NT Browsers GridFTP Consistency Management / Authorization-Authentication Prime Server Logical Name Space Latency Management Data Transport Metadata Transport Catalog Abstraction Storage Abstraction Databases DB2, Oracle, Sybase Servers HRM
General Information • http://www.npaci.edu/DICE