250 likes | 349 Views
An Introduction to Repositories. Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project. Creating a digital library is not a process of moving the traditional library online. Increasingly, it’s more about the care and feeding of the web!.
E N D
An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project
Creating a digital library is not a process of moving the traditional library online. Increasingly, it’s more about the care and feeding of the web!
Creating digital surrogates of paper collections is only the beginning • Surrogate collections are an important step! • Collecting born-digital materials is rapidly coming upon us • Simple Institutional repository approaches are good but only scratch the surface • Complex scholarly and scientific projects are the biggest challenge
Repositories are designed to be flexible and adaptable • Relational databases are too rigid • Need to be able to add new content types and media easily • Need to be able to handle arbitrary complexity in relatively simple ways • Above all, it all needs to be durable over a very long time!
Scholars Workbench Institutional Repository Preservation and Archiving Data Curation Solutions The Repository (Content abstraction) Raid Arrays Cloud Storage Tape Libraries
Repositories are the foundation for many applications • A set of abstractions that can be used to represent different kinds of data • Manages the actual content beneath the surface • Negotiates the connection between access and storage • Designed to make data “durable” over the long term
Access is the core purpose of a repository • Searching is important but it is not the only thing • Finding is the point of searching! • The point of finding is very often to use the resource that you have found, for analysis or reuse • New digital resources that reuse found objects depend on continuing access for validity
Any unit of content may have more than one context • Within one collection • An architectural image may related to more than one building • Across collections • Special collections images many be art objects • Across repositories • Born digital publications will almost always cross institutional boudaries
Authenticity and fidelity • What is an authoritative digital surrogate of a real object? • When is a copy of an original surrogate exact? • A born-digital object has nothing to compare • Digital “fingerprints” must be captured and managed as metadata • When formats change, objects will not have all the same technical characteristics…
Making complex digital information “durable” is a very hard problem • Durability implies that digital content is directly in use and sustained long-term • A history of the changes to the encoding and state of content must be reliably provided • A meaningful context for any unit of content may be one of many and must be sustained • Replication appears to be our best friend and the could looks like an answer
Management is the core function of a repository • Repositories are designed to keep everything as stable as possible while providing flexible access • Managing things such that when they aren’t changing they are reliably the same • Accounting for migration for technical reasons • Disaster preparedness (lots of copies!) • Must respect legal and policy issues
Repository abstractions provide a durability framework for managing. • Content is “unitized” as information objects that combine data, metadata, policies, relationships and the history of the object. • Complex digital resources are formally defined graphs of related objects. • The public view of the content is presented as virtual data components.
A data object is one unit of content Persistent ID DC RELS-EXT Reserved Datastreams AUDIT POLICY 1 2 Custom Datastreams (any type, any number) n
Files are stored on disk and managed directly • Versioning is necessary • Checksums for each file provide assurance that they file has not changed • Can be managed by the repository or as remote files
Virtual datastreams provide the access abstraction • Can be simply retrieving a stored component • Views of the content can be derived on demand, for different formats and resolutions • Other data productions can be derived on demand; i.e. tiles from a JPEG2000 file • By providing an abstract view of the content you break the dependence on the stored files
Content Access Content Management
Descriptive metadata is about the content of the resource • Indexed for searching • Also used for rendering user experiences • Some standards in use: • Dublin core - general • MODS - bibliographic • VRACore – cultural heritage • FGDC - GIS datasets • DDI – social science datsets
Administrative metadata is more about the encoding and use • Metadata about the object generally, like checksums • Technical metadata about the specifics of the encoding each format • Event metadata, about what happens to an object over its lifetime; audit trails • Policy metadata, like access restrictions and credit lines
Relationships Among Objects • Describes adjacency relationships among objects, among units of content • Can be done by explicitly listing IDs in XML, using METS for example • or using RDF: PID – typeOfRelationship – relatedObjectPID • Can used to assemble complex resources and aggregations of objects • Explicit and implicit aggregations
Establishing and Enforcing Policies • Policies must be established for the entire life-cycle of the information • Ownership and workflow policies • Access and use policies • Policies associated with sustaining (or not!) • Polices must be expressed for end users • Policies must also be expressed for machine access
Indexing • In a repository there is no “catalog”; the repository is the catalog • Many indexes can be created for many reasons • Either metadata or full content, or both • Ontology-based indexes are rapidly becoming more feasible • Keeping indexes updated is the trick
Indexing as a harvesting service GSearch Blacklight Simple JMS OAI services listen and consumeevents or other messages Fedora Repository Service Ingest repository publishes events More…