1 / 24

An Introduction to Repositories

An Introduction to Repositories. Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project. Creating a digital library is not a process of moving the traditional library online. Increasingly, it’s more about the care and feeding of the web!.

ponce
Download Presentation

An Introduction to Repositories

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Introduction to Repositories Thornton Staples Director of Community Strategy and Alliances Director of the Fedora Project

  2. Creating a digital library is not a process of moving the traditional library online. Increasingly, it’s more about the care and feeding of the web!

  3. Creating digital surrogates of paper collections is only the beginning • Surrogate collections are an important step! • Collecting born-digital materials is rapidly coming upon us • Simple Institutional repository approaches are good but only scratch the surface • Complex scholarly and scientific projects are the biggest challenge

  4. Repositories are designed to be flexible and adaptable • Relational databases are too rigid • Need to be able to add new content types and media easily • Need to be able to handle arbitrary complexity in relatively simple ways • Above all, it all needs to be durable over a very long time!

  5. Scholars Workbench Institutional Repository Preservation and Archiving Data Curation Solutions The Repository (Content abstraction) Raid Arrays Cloud Storage Tape Libraries

  6. Repositories are the foundation for many applications • A set of abstractions that can be used to represent different kinds of data • Manages the actual content beneath the surface • Negotiates the connection between access and storage • Designed to make data “durable” over the long term

  7. Access is the core purpose of a repository • Searching is important but it is not the only thing • Finding is the point of searching! • The point of finding is very often to use the resource that you have found, for analysis or reuse • New digital resources that reuse found objects depend on continuing access for validity

  8. Any unit of content may have more than one context • Within one collection • An architectural image may related to more than one building • Across collections • Special collections images many be art objects • Across repositories • Born digital publications will almost always cross institutional boudaries

  9. Authenticity and fidelity • What is an authoritative digital surrogate of a real object? • When is a copy of an original surrogate exact? • A born-digital object has nothing to compare • Digital “fingerprints” must be captured and managed as metadata • When formats change, objects will not have all the same technical characteristics…

  10. Making complex digital information “durable” is a very hard problem • Durability implies that digital content is directly in use and sustained long-term • A history of the changes to the encoding and state of content must be reliably provided • A meaningful context for any unit of content may be one of many and must be sustained • Replication appears to be our best friend and the could looks like an answer

  11. Management is the core function of a repository • Repositories are designed to keep everything as stable as possible while providing flexible access • Managing things such that when they aren’t changing they are reliably the same • Accounting for migration for technical reasons • Disaster preparedness (lots of copies!) • Must respect legal and policy issues

  12. Repository abstractions provide a durability framework for managing. • Content is “unitized” as information objects that combine data, metadata, policies, relationships and the history of the object. • Complex digital resources are formally defined graphs of related objects. • The public view of the content is presented as virtual data components.

  13. A data object is one unit of content Persistent ID DC RELS-EXT Reserved Datastreams AUDIT POLICY 1 2 Custom Datastreams (any type, any number) n

  14. Files are stored on disk and managed directly • Versioning is necessary • Checksums for each file provide assurance that they file has not changed • Can be managed by the repository or as remote files

  15. Virtual datastreams provide the access abstraction • Can be simply retrieving a stored component • Views of the content can be derived on demand, for different formats and resolutions • Other data productions can be derived on demand; i.e. tiles from a JPEG2000 file • By providing an abstract view of the content you break the dependence on the stored files

  16. Content Access Content Management

  17. Descriptive metadata is about the content of the resource • Indexed for searching • Also used for rendering user experiences • Some standards in use: • Dublin core - general • MODS - bibliographic • VRACore – cultural heritage • FGDC - GIS datasets • DDI – social science datsets

  18. Administrative metadata is more about the encoding and use • Metadata about the object generally, like checksums • Technical metadata about the specifics of the encoding each format • Event metadata, about what happens to an object over its lifetime; audit trails • Policy metadata, like access restrictions and credit lines

  19. Relationships Among Objects • Describes adjacency relationships among objects, among units of content • Can be done by explicitly listing IDs in XML, using METS for example • or using RDF: PID – typeOfRelationship – relatedObjectPID • Can used to assemble complex resources and aggregations of objects • Explicit and implicit aggregations

  20. Text Collections

  21. Establishing and Enforcing Policies • Policies must be established for the entire life-cycle of the information • Ownership and workflow policies • Access and use policies • Policies associated with sustaining (or not!)‏ • Polices must be expressed for end users • Policies must also be expressed for machine access

  22. Indexing • In a repository there is no “catalog”; the repository is the catalog • Many indexes can be created for many reasons • Either metadata or full content, or both • Ontology-based indexes are rapidly becoming more feasible • Keeping indexes updated is the trick

  23. Indexing as a harvesting service GSearch Blacklight Simple JMS OAI services listen and consumeevents or other messages Fedora Repository Service Ingest repository publishes events More…

More Related