120 likes | 135 Views
This article discusses the technology choices and implementation strategies for the JSTOR Online Archive, a digital library of scholarly materials. It covers aspects such as storage, searchability, delivery, and server infrastructure.
E N D
Technology Choices for the JSTOR Online Archive Presented by Chang Feng Department of Computer Engineering and Computer Science, University of Missouri-Columbia, Columbia, MO 65211
Reference • Technology Choices for the JSTOR Online Archive, S. W. Thomas, K. Alexander, and K. Guthrie, Computer (February 1999), 60-65.
JSTOR Overview • Goals: To increase access to older scholarly materials by converting them to digital media and providing a full-text search capability. • Benefits: Preservation of the original documents and conserving library shelf space. • Development phases: • Phase-I (scheduled for completion by the end of 1999): minimum of 100 journal titles, primarily in the humanities and social sciences. • As of December 1998: 67 journal titles, total 450,000 articles and 2.7 million pages.
Implementation JSTOR • Principles • Let mission guide technical choices. • User first. • Issues to be addressed when building the digital library • Formats (e.g., image v.s. formatted text) • Storage, display and distribution technologies (e.g., CD-ROM v.s. Internet)
Implementing JSTOR • Mission: A reliable and faithful electronic archive • Choice of technology: Scanned-in image at 600 dpi for each page. • Mission: Searchable • Choice of technology: Use OCR software to create text files that would let the user search journals’ full text. • Mission: Reduce long-term library costs • Choice of technology: Database storage centralized, with distribution over the Internet.
Delivering JSTOR Pages • Deliver in GIF format: ~30 Kbytes/page. • Converts page to screen resolution as needed. • System caches converted pages for 3-4 days. • Deliver one page at a time with next page pre-loading. • Print entire article: ( at 600 or 150 dpi resolution ) • JPrint as a separate application (faster) • Adobe Acrobat files • PostScript files
Searching JSTOR • Graphic searching interface. • Stores the full text in one file per page. • Each article also contains a citation file. • Text files have embedded tags that specify which parts of the text belong to which article. • Separate index for each journal title. • Articles are indexed using Full-Text Lexicographer (U. of Michigan): • Allow dynamic updating (no index down time). • Periodically optimizing index with no down time.
Browser Interoperability • Major issue: Back compatibility. • Support HTML 3.2 standard • JSTOR interface uses frame, but can adjust itself automatically to an unframed interface. • Use new technology to enhance functionality, but not to provide basic functionality. • Plug-ins not encouraged.
JSTOR Server Infrastructure • Storage: • Online: 600 dpi TIFF page images compressed with Cartesian Perceptual Compression (1:4, CPI Inc.). • Offline: multiple copies of the original TIFF images for archival purposes. • Performance: • Replacing CGI programs with FastCGI or Java servlets. • Server mirroring
Issues of Server Mirroring • Mirror server load balancing: Currently using a round-robin method. • Mirror server synchronization: Currently, new release (> 1 GB/month) are shipped overnight on magnetic tape to mirror sites. • User state synchronization: Currently, • Regenerate the data at the current server if possible, or • Current server request information from the server that originally created it and caches that copy for future use.
Authentication • Cross organization access management • JSTOR currently rely on participating institutions to supply with authenticated IP address. • Under evaluation: • digital certificates issued by the participating institutions. • password-based access control.
Conclusions • The choice of technology is based on the mission of the project and user feedback. • Must remain flexible.