220 likes | 352 Views
US GPO AIP Independence Test. CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ Abbott GPO contact: Kate Zwaard. Overview. Background OAIS FDsys AIP METS, MODS, and PREMIS Project Objectives Solution Strategy
E N D
US GPOAIP Independence Test CS 496A – Senior Design Team members: Antonio Castillo, Johnny Ng, Aram Weintraub, Tin-Shuk Wong Faculty advisor: Dr. Russ Abbott GPO contact: Kate Zwaard
Overview • Background • OAIS • FDsys • AIP • METS, MODS, and PREMIS • Project Objectives • Solution Strategy • XML parsing • A note on deliverables • Repositories • Testing • Conclusion
OAIS Open Archival Information System • “An OAIS is an archive consisting of an organization of people and systems that has accepted the responsibility to preserve information and make it available for a Designated Community” • Developed by the Consultive Committee on Space Data Systems (ISO 14721:2003)
FDsysFederal Digital System • FDsys – Am OAIS maintained by the U.S. Government Printing Office to provide public access to information submitted by Congress and Federal agencies.
OAIS Primary Functions • Ingest – Turn SIPs into AIPs • Archival Storage – Storage and retrieval of AIPs • Data Management – Populating, maintaining and accessing the varieties of information • Administration – Controls day to day operations • Preservation Planning – Maintaining archive accessibility • Access – Functions for access of archive
Information Package- critical component of OAIS • The information package is a conceptual linking of content information with its preservation description and packaging information. • Three kinds of information packages (before, after, and during ingestion) • SIP – Submission Information Package • AIP – Archive Information Package • DIP – Distribution Information Package
AIP • Archival Information Package • What is AIP? • METS • MODS • PREMIS
Project Objectives: • Prove AIP Independence • Improve their file system.
AIP: METS • Understanding METS • Schema • File format • Seven major sections
AIP: METS Schema • 5 Major Sections • METS Header • Descriptive Metadata • Administrative Metadata • File Section • Structural Map
AIP: MODS • Descriptive metadata • Extension to METS • Top-level elements • Mandatory • Recommended • Optional
AIP: PREMIS • Preservation metadata • Extension to METS • PREMIS Data Model • Intellectual Entity • Object Entity • Event Entity • Agent Entity • Rights Entity*
Solution Strategy • The data we have received are AIPs, not SIPs. Repository software can only ingest SIPs. We must therefore write scripts to parse the AIPs in such a way to construct SIPs from an arbitrary file structure, and then ingest those SIPs into a repository software in order to create new AIPs for the same information.
XML Parsing • We plan to use the Java programming language for our scripting needs. • The Java API for XML Processing (JAXP) is the standard Java library for parsing XML • It provides several different possible representations for XML • After being rendered human-readable, the AIP files will need to be converted into a new SIP schema of our own design, which would only describe information that still appears relevant.
XML Parsing Example • This is a portion of a sample FDsys MODS file that summarizes a bill in Congress: • <extension><collectionCode>BILLS</collectionCode><searchTitle>To increase Federal Pell Grants for the children of fallen public safety officers, and for other purposes.;Officer Daniel Faulkner Children of Fallen Heroes Scholarship Act of 2010;S. 3880 (IS)</searchTitle><category>Bills and Statutes</category><waisDatabaseName>111_cong_bills</waisDatabaseName><branch>legislative</branch><dateIngested>2010-10-06</dateIngested></extension>
XML Parsing Example • We might expect this type of output once properly parsed: • <extension> Collection code: “BILLS” Search title: “To increase Federal Pell Grants for the children of fallen public safety officers, and for other purposes.;Officer Daniel Faulkner Children of Fallen Heroes Scholarship Act of 2010;S. 3880 (IS)” Category: “Bills and Statutes” WAIS database name: “111_cong_bills” Branch: legislative Date ingested: 2010-10-06 </extension>
A Note on Deliverables • Because our aim is not to design software, this is not a typical computer science design project. Instead, we are conducting coded experimental tests on real data and forming conclusions based on the results. • Deliverables will most likely include: • a written report of our findings and recommendations • a reorganized version of the input data
Testing • After parsing and organizing the data, it will be important to perform checks to ensure that the reconstruction is accurate. • We may send a preliminary report to GPO for verification. • The exact testing procedure is still undefined, as we haven’t had a chance to investigate the data in depth yet. • Our goals should be clearer once we understand exactly what type of data we are dealing with.
Repositories • Third party repository software to ingest created SIPs. • DSpace, Fedora Commons (Duraspace) • Based on a few simple technologies: • Java • MySQL • Apache Tomcat JavaScript Server
Conclusion • Our thanks to Kate, Dr. Abbott, and Dr. Pamula for their support.