290 likes | 446 Views
CS 502 Computing Methods for Digital Libraries Cornell University – Computer Science Herbert Van de Sompel herbertv@cs.cornell.edu. Lecture 5 A research perspective on Digital Libraries. DL Ancestry. URLs to some of these DLs. ADS: http://adswww.harvard.edu/
E N D
CS 502 Computing Methods for Digital LibrariesCornell University – Computer ScienceHerbert Van de Sompelherbertv@cs.cornell.edu Lecture 5 A research perspective on Digital Libraries
URLs to some of these DLs ADS: http://adswww.harvard.edu/ NCSTRL: http://www.ncstrl.org UCSTRI: http://www.cs.indiana.edu:800/cstr/cover.html arXiv: http://arXiv.org LTRS: http://techreports.larc.nasa.gov/ltrs/ NTRS: http://techreports.larc.nasa.gov/cgi-bin/NTRS
DL Architectural Review Assumptions made in this perspective • things start with TCP/IP connectivity • distribute full content (reports, software, etc.) • not only metadata
DL Architecture History approach1 1. Build special client and server (generally using Motif/X11, Tcl/Tk, etc.), and use TCP/IP as the transport protocol only • pros: rich functionality • cons: high development cost, client distribution problem • observation: many of these projects spent more time building the interfaces, protocols, searching, etc. than populating their DL!
DL Architecture History approach2 2. use standard protocols built upon TCP/IP: SMTP, FTP, Gopher, WAIS, HTTP, etc. • con: less functionality (restricted by protocol) • pros: less development cost, uses commonly available clients • observation: this approach is now the most common • The ones listed on slide 2 fit into this category
Early TCP/IP DLs • a very old one: IETF: http://www.ietf.org/ • Internet RFC’s • Very first TCP/IP DL?
Early TCP/IP DLs • Netlib • http://www.netlib.org/ • begun in 1985, distributing mathematical software via e-mail (SMTP) • other access methods and protocols added (ftp, X11 client, http)
Los Alamos arXiv • Physics pre-print server • http://xxx.lanl.gov/ == http://arXiv.org • begun in 1991 as an e-mail service to exchange TeX source of pre-prints in high energy physics • ftp, http access added shortly • Now THE communication channel in Physics • Paul Ginsparg
Characteristics of early TCP/IP, non-HTTP DLs • Useful • could get the “thing” that you were looking for • Constrained by transport protocol • SMTP, FTP, etc. interface inherently “clunky” • Higher level services such as searching, sophisticated browsing, etc. difficult to implement • Small scale • would the same systems work well if the holdings went from 100’s or 1000’s to millions?
Characteristics of early TCP/IP, HTTP DLs • Initial HTTP implementations / conversions pretty much provided incremental steps in DL improvement • a “nice” ftp interface, maybe with better searching and browsing • but the nature of the DLs changed little • LTRS is an example of a http DL that is really: FTP+Searching(WAIS)+Browsing • http://techreports.larc.nasa.gov/ltrs/ • Also check out user interface of http://arXiv.org
Early TCP/IP, HTTP DLs • But http is a very general transport protocol, and it is possible to build even higher level protocols on top of it • Combine this with the expressive HTTP client (web browser), and there is a lot of potential • Dienst • (http://www.ncstrl.org/Dienst/htdocs/Info/protocol4.html) • builds an actual DL protocol on top of HTTP • 1994 -- the first to do so? • Open Archives Initiative: metadata harvesting protocol on top of HTTP
Sophistication increases, tracks meet library automation track sophistication research track http Dienst http LTRS, e-print, Netlib, etc. ftp / gopher e-mail time
A Framework for Distributed Digital Object Services Kahn/Wilensky Framework [Kahn 1995] • 1995 • A high level document • Almost a definition of key concepts, terminologies, … for next generation DLs • Foundation for a research discipline? • Not detailed enough to be a real architecture. • Architecture is independent of the type of data stored in the DL
KWF: key terms • digital object (do) • A do is a data structure that contains • Digital data; data is typed (cf MIME) • Persistent Key Metadata; especially handle • Other metadata (for instance Terms and Conditions) • handle • a handle is a unique, persistent name for a do • repository • The place where do’s live • Has unique global name • Repository Access Protocol (RAP) • To deposit/access do’s in repositories
makes a Data which consists of Transaction record per do handle comes from a handle generator • Key-Metadata • handle at which point the do becomes a stored do which can go in a repository Properties record per do • Key metadata: handle • Other metadata: • Terms and conditions Repository which registers the do’s handle with a handle server Accesses/Deposits the do in repositories by means of the Repository Access Protocol What the client receives as a result of an access to a do is a dissemination. Handle Server at which point the do becomes a registered do client KWF: flow Originator digital object
Digital objects • do = data + key-metadata • data is typed; core types include: • bit-sequence / set-of-bit-sequences • digital-object / set-of-digital-objects • handle / set-of-handles • other types can be defined, and registered with a global type registry • definition and registration left undefined • ~ similar to MIME • key-metadata includes handle • possibly other metadata (left undefined in KWF)
Digital objects • Composite do’s: • a do with data of type digital-object • non-composite do’s are elementaldo’s • composite do’s can – for instance -- be used to collect similar works together • composite do than contains a do for each work of Shakespeare...
Changing digital objects • Mutabledo’s can be changed once placed in a repository • key-metadata cannot be changed • the do’s handle does never change! • Immutabledo’s cannot be changed once placed in a repository • however, they can be deleted
Handles • Guest lecture by Professor Arms 02/19
Repositories • A network accessible storage system in which digital objects may be stored for possible subsequent access or retrieval • A storeddo is a do that resides in a repository • A registereddo is a do that the repository has registered with a handle server • storing and registering can be the same or different processes
Repositories • A repository keeps a properties record for each do • contains key-metadata and any other metadata the repository chooses to keep • A do may have a transaction record associated with it in a repository
Repository Access Protocol • “Protocol” may be misleading, its really just the concept for a protocol • RAP is designed to be simple; higher level services should come from other protocols • KWF defines 3 basic operation classes: • ACCESS_DO [metadata; key-metadata, digital object] • A dissemination of a do is the result of a request to access a do • DEPOSIT_DO [metadata; key-metadata, digital object] • ACCESS_REF • this is a means to tell the world about other ways (protocols) to access do’s in the repository.
Terms and Conditions • TC are attached to: • each do • each dissemination • each repository • TC are a precondition for any operation on the above • Repositories responsible for enforcing TC
Terms and Conditions 1 1 terms and conditions repository 1 N 1 1 digital object dissemination 1 1 1 1 1 1 1 1 terms and conditions data terms and conditions data Figure 1 from 95 TR-1593
Digital Objects: Terms and Conditions • Set by originator and/or repository • Can be arbitrarily complex, but generally consist of: • permissions: read, write, etc. • authentication - person, group, etc. • payment • 3rd party intervention (possibly in support of the above)
Readings • Kahn, R. & Wilensky, R. 1995. A Framework for Distributed Digital Object Services • http://WWW.CNRI.Reston.VA.US/home/cstr/arch/k-w.html • Arms, W.Y. 1995. Key Concepts in the Architecture of the Digital Library. In: D-Lib Magazine. http://www.dlib.org/dlib/July95/07arms.html • Marc VanHeyningen. 1994. The Unified Computer Science Technical Report Index: Lessons in indexing diverse resources. http://www.cs.indiana.edu/ucstri/paper/paper.html