210 likes | 363 Views
US GPO AIP Independence Test. CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ Abbott GPO contact: Kate Zwaard. Overview. Background OAIS FDsys Project Objectives AIP METS, MODS, and PREMIS Solution Strategy
E N D
US GPOAIP Independence Test CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ Abbott GPO contact: Kate Zwaard
Overview • Background • OAIS • FDsys • Project Objectives • AIP • METS, MODS, and PREMIS • Solution Strategy • XML parsing • A note on deliverables • Repositories • Testing • Conclusion
OAIS Open Archival Information System • “An OAIS is an archive consisting of an organization of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community” • Developed by the Consultive Committee on Space Data Systems (ISO 14721:2003)
FDsysFederal Digital System • FDsys – Am OAIS maintained by the U.S. Government Printing Office to provide public access to information submitted by Congress and Federal agencies.
OAIS Primary Functions • Ingest – Turn SIPs into AIPs • Archival Storage – Storage and retrieval of AIPs • Data Management – Populating, maintaining and accessing the varieties of information • Administration – Controls day to day operations • Preservation Planning – Maintaining archive accessibility • Access – Functions for access of archive
Information Package- critical component of OAIS • The information package is a conceptual linking of content information with its preservation description and packaging information. • Three kinds of information packages • SIP – Submission Information Package • AIP – Archive Information Package • DIP – Distribution Information Package
AIP • Archival Information Package • Defines how digital objects and its associated metadata are packaged using XML based files. • METS (binding file) • MODS • PREMIS
Project Objective: Prove AIP Independence • An AIP is independent if, in the event of catastrophic and irretrievable loss or damage of the content management system, a knowledgeable user can still make sense of the data.
Project Objectives • This project simulates FDsys breaking down due to some catastrophic attack or error. • We are attempting to categorize and reconstruct an amount of sample data from FDsys outside the context of the actual CMS. • The only references we have available, other than the actual files in the archive, are publicly defined standards. • It is our hope that this project will help GPO improve the robustness of their file system.
AIP: METS • Schema • XML file format • Seven major sections
AIP: METS Schema • 5 Major Sections • 1)METS Header • 2)Descriptive Metadata • 3) Administrative Metadata • 4) File Section • 5) Structural Map
AIP: MODS • Descriptive metadata • Extension to METS • Top-level elements • Mandatory • Recommended • Optional
AIP: PREMIS • Preservation metadata • Extension to METS • PREMIS Data Model • Intellectual Entity • Object Entity • Event Entity • Agent Entity • Rights Entity*
Solution Strategy • Data submitted to us are AIPs, not SIPs. Repository software cannot ingest AIPs, only SIPs. We must write scripts that parse the AIPs in such a way to construct SIPs from the the arbitrary file structure, then ingest those SIPs with a repository software to create to new AIPs.
XML Parsing • As described above, all metadata is in the form of XML files. Hence, using code to read XML files is integral to the project. • We plan to use the Java programming language for our scripting needs. • Java API for XML Processing (JAXP): the standard Java library for handling XML • It provides several different possible representations for XML
A Note on Deliverables • This is not a typical computer science design project because our aim is not to design software. Instead, we will be conducting scripted tests on real data and forming conclusions based on the results. • Deliverables will most likely include: • a written report of our findings and recommendations • a reorganized version of the input data
Testing • After parsing and organizing the data, it will be important to perform checks to ensure that the reconstruction is accurate. • We may send a preliminary report to GPO for verification. • The exact testing procedure is still undefined, as we haven’t had a chance to investigate the data in depth yet. • Our goals should be clearer once we understand exactly what type of data we are dealing with.
Repositories • Third party repository software to ingest created SIPs. • DSpace, Fedora Commons (Duraspace) • Based on simple technologies • Java • Mysql • Apache Tomcat Javascript Server
Conclusion • Our thanks to Kate, Dr. Abbott, and Dr. Pamula for their support.