180 likes | 364 Views
University Library Experience CDL Case Study. 30 June 2005 John Kunze, California Digital Library. California Digital Library. A university library with no books, students, or faculty Central services for 10 campus libraries
E N D
University Library Experience CDL Case Study 30 June 2005 John Kunze, California Digital Library
California Digital Library • A university library with no books, students, or faculty • Central services for 10 campus libraries • Content hosting: electronic texts, web-based material, datasets, finding aids • Linked: California museums & archives • Plus a Digital Preservation Program
What’s digital preservation? • Safeguarding electronic information • Viability (intact bit streams) • Renderability (by machines) • Understandability (by humans) • There’s no preservation if we don’t know what it’s called • CDL core need for persistent identifiers
What’s a persistent identifier? • An identifier that is valid for long enough • valid, enough: these are service/user dependent • What’s an identifier? It’s an association between a string and a thing. It follows that: • An id is not a string of data (good) • An id is a matter of opinion, not fact; there will be at least one other provider, serial if not in parallel, or your objects die with you (inconvenient) • Same thing, two strings; or same string, two things • Often: same string, different metadata • Often: same string, parallel things diverging over time due to different preservation practices (eg, migrations)
Accepting some disorder • Long term preservation won’t happen unless objects can change residence and diverge • Campus snapshot to CDL; subsequent snapshots • Publisher to dim CDL archive; later CDL to SS? • Better if object lives in several places at once • Eventually, Producer loses control of copies • Multiple opinions and practices will flourish • Static, id-based persistence claims soon irrelevant • “urn:…”, “hdl:…”, etc. reflect hopes of people long gone • Not pretty, but the alternative (loss) is worse
Agreeing to disagree • What we say, but shouldn’t (not loudly): • Don’t re-assign a persistent id to something else • Or don’t replace a persistent object with another • What we do: • Knowingly replace our persistent objects (typos, drafts, format conversions, home page redesign) • Honestly provide a real kind of persistence, but with very different replacement policies • Won’t have one way within CDL, let alone without
Diverse persistence practice • How dissimilar must two objects be before they get different ids? • CDL’s home-grown Digital Preservation Repository (open source) is self-service: • Lets the Submitter decide • Makes preservation a joint responsibility • Requirement: need to be able to tell users what flavor of permanence is in effect
CDL Persistent Ids Must… • Identify, whether or not the object is at hand • It may not be convenient, helpful, or permitted for you to inspect the object itself -- metadata needed • Convey different flavors of permanence • Lead to access (if authorized) • Not strictly an “identification” problem, but it is the “404 not found” that we need to fix • Be valid for some longish period • Be carried on, in, or with the object
How to choose an id scheme • All CDL requirements are purely about service • Candidate schemes: URL, PURL, URN, ARK, Handle, DOI, MD5, GUID, ISxx, … • CDL gets no direct service help from any scheme; no scheme or syntax confers persistence of any kind • We then ask which schemes are lowest cost and lowest risk?
Myths to fight against • Harmful Fallacy 1. A URL is a location, and is therefore inherently unstable. (ridiculous) • Harmful Fallacy 2. Explicit server/resolver names make URLs inherently unstable. • So “loc.gov” is less stable than “handle.net” and the implicit global resolvers that it depends on? • Harmful Fallacy 3. HTTP-based resolvers will not scale for persistent access. (google) • Harmful Fallacy 4. URLs are the problem. • “Cool URLs don’t break” -- Tim Berners-Lee
Impersistence - big factors • Bankruptcy - no successor found • Loss of funding - no successor found • Loss of political support • War, social upheaval, natural disaster • Scheme impact: zero
Impersistence - lesser factors Deliberately or accidentally, objects are • Removed • Replaced • Moved without setting up a redirect • Everyone has an indirection mechanism, though most don’t use it • Scheme impact: zero
Impersistence - small factors Your org likes persistent ids in principle, but • It lacks knowledge that vanilla web servers trivially support 500,000 redirect directives • It lacks the expertise or staff to maintain a web server, a two-column database table, and a nightly server config file report writer • Scheme impact: zero
Scheme costs and risks • Every modern service needs to support indefinitely and find or be given replacements for at least • Web server, web browser, and DNS • In addition, URN, Handle, and DOI resolution need a global proxy or a plugin for every access • ARK could use a plugin, but doesn’t need it • Handle and DOI also require • You to maintain an extra local server • The community to maintain a set of global servers • For the CDL • Handle and DOI come with highest risk • ARK comes with lowest risk
Persistence - indirect factors CDL’s persistence requirements call for an id scheme (not service) connecting users to • metadata • whether and what kind of persistence • sub-object and variant inferences • core ids on proxy failure (gracefully) • Scheme impact: ARK provides these • A scheme is not a service (DOI is not CrossRef) • When choosing a scheme, we wanted to remain independent of extra external service providers
Our Stuff vs Their Stuff • Persistence can be split into • the Our Stuff Problem • the Their Stuff Problem • It makes no sense for CDL to assign persistent ids to Their Stuff • Their Stuff can be hugely important to our users, but we don’t control it and cannot vouch for it • Where we can afford it, we track them with PURLs • CDL does assign persistent ids to Our Stuff
Distribution of Id Assignment • Objects ingested in flows from other libraries per submission agreements • Each object has an ARK after ingest • Either it has it already • Or we give it one upon entry • Campuses can mint their own ARKs or rely on our minting service • Their own campus ARK namespace is theirs to divide up as they wish
Opaque ids with semantic extensions • CDL dilemma: • opaque ids are needed for names that age and travel well • Semantically laden ids are helpful in providing many id services • Hybrid: • opaque ids are used to name abstract preservation objects • Semantic and sometimes transient extensions address components inside of objects (the set of components evolves over time anyway)