1 / 28

Repository Development Center Office of Strategic Initiatives

Releasing Open Source at the Library of Congress. Repository Development Center Office of Strategic Initiatives. Leslie Johnston 2009 LITA Forum. S TARTING D OWN A P ATH T OWARDS B ETTER C ONTROL. What are our most basic needs? What is the first step?

Download Presentation

Repository Development Center Office of Strategic Initiatives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Releasing Open Source at the Library of Congress Repository Development Center Office of Strategic Initiatives Leslie Johnston 2009 LITA Forum

  2. STARTING DOWN A PATH TOWARDS BETTER CONTROL • What are our most basic needs? What is the first step? • How do we know what we have, where it is, and who it belongs to? • How do we get files – new and legacy – from where they are to where they need to be?

  3. IDENTIFYING THE TRANSFER PROBLEM SPACE • As part of its first phase repository development, the Library of Congress is working on solutions for a category of activities that we refer to as “Transfer.” At a high level, we define transfer as including the following human- and machine-performed tasks: • Adding digital content to the collections, whether from an external partner or created at LC; • Moving digital content between storage systems (external and internal); • Review of digital files for fixity, quality and/or authoritativeness; and • Inventorying and recording transfer life cycle events for digital files.

  4. RECENT TRANSFER EXPERIENCE During 2008 the Library of Congress received: • 30 Tb from NDIIPP preservation partners, 20 Tb in Web Capture crawls to preserve identified web sites, 30 Tb from National Digital Newspaper Project (NDNP) partners, and 1 Tb from World Digital Library partners. • From 20 MB to over 2 Tb in a single transfer retrieved over the network. • Dozens of hard drives with licensed, partner and vendor supplied content. •  All forms of content, some to be dark archived for preservation, some limited to Library use, and some to be made publicly available. • There is also newly internally digitized content that has to be managed.

  5. DEVELOP A STANDARD AND TOOLS TOOPTIMIZE TRANSFERS • BagIt: A Packaging Specification for File Transfers • A packaging specification for file transfers. Supports minimally self-identifying and self-describing packages with support for error detection and transfer optimization. • http://www.digitalpreservation.gov/library/resources/tools/docs/bagitspec.pdf • Motivating use cases: • Transfer of content internally and between preservation partners. • Long-term storage of content. • Needs: • Minimally self-identifying and self-describing packages. • Support for error detection and transfer optimization. • Characteristics: • Low overhead • Content-type agnostic • Supported by off-the-shelf, easily supported tools.

  6. WHAT’S IN A BAG? /data directory with contents Package description: bag-info.txt Manifest of contents with checksums

  7. TRANSFER TOOL DEVELOPMENT To promote the use of BagIt in the Library and outside, tools were required to make the specification easy to use. • Parallel Retriever script • Efficient package transfer • Validation script • Validates Bags against the BagIt specification • VerifyIt script • Verifies that files are uncorrupted • BagIt Java Library (BIL) • Used for application and command line tool development • Bagger Desktop application • Graphical desktop tool to create/update/validate Bags • LocDrop Web application • Supports partner registration of transfers, whether shipping a hard drive or sending over the network. • Inventory System • Record lifecycle events for packages of Bags and files. • Workflow Tools

  8. TRANSFER TOOL DEVELOPMENT: BAGGER Bagger Graphical Bag Authoring Tool Allows users to create generic Bags or Bags that meet specified project profiles. Provides project-specific templates that enforce project Bag descriptive metadata requirements. Built on top of the BagIt Java Library. Presents a range of options for compressed serialization and complete versus “holey” bags. Java Webstart version automatically checks for the most recent version to keep the tool updated. Standalone version is bundled with all necessary software and runs without requiring installation privileges. Runs on a PC or Mac.

  9. USING BAGGER Add files to the /data directory create and select a profile Entering bag-info metadata

  10. USING BAGGER Completed bag with generated manifest

  11. LOCDROP TOOL DEVELOPMENT LocDrop is designed to support notification for transfers of content into the Library of Congress both from outside the Library and within the Library itself. The application currently lets you register network and physical media transfers (hard drives, CDs, DVDs, etc.) that the Library will retrieve. In later versions we expect to add the ability to launch network transfers directly. LocDrop will simplify the processes to track content we expect to receive. Over time, we expect to connect this application to related services that will continually improve how we manage the transfer and receipt of materials from all sources.

  12. USING LOCDROP Register the information needed to track data shipments to and from the Library

  13. USING LOCDROP Register the information needed for the Library to retrieve network transfers

  14. INVENTORY TOOL DEVELOPMENT Record Package Events Examples of Package Events include “Package Received Events,” which are recorded when a project receives a package; and “Package Accepted Events,” which are recorded when a project accepts curatorial responsibility for a package. Record File Events Examples of File Events include “File Copy Events,” which are recorded when a package is copied from one File Location to another; and “Quality Review Events,” which are recorded when quality review is performed. For legacy collections the Inventory Tool can be pointed at existing file systems and directories to package, checksum, and record life cycle events to bring the files under initial control. The Inventory Tool is implemented on top of our BIL Java Library.

  15. USING THE INVENTORY TOOL Running an Inventory operation

  16. USING THE INVENTORY TOOL Searching the Inventory, plus auditing, file count, space usage, and project-specific Inventory reports

  17. WORKFLOW DEVELOPMENT The Transfer components and Inventory Tool are tied together through multiple project-based Workflow systems. Through case study development with stakeholders we identify the data flow and tasks to be performed. Workflow tasks formalized through the system include transfer, validation by an format validation application, manual quality review inspection, and file copying to archival storage and production storage. A workflow UI allows users to initiate, monitor and administer processes; and notify the workflow engine of the outcome of manual tasks, including task completion.

  18. RUNNING A WORKFLOW Starting, searching, and monitoring workflows

  19. RUNNING A WORKFLOW Updating an in-progress workflow

  20. INITIATING THE OPEN SOURCE RELEASE • It was decided that the three utility scripts – the key tools needed for the movement and validation of Bagged content – should be the first candidates for open source release. • The scripts were submitted to the Office of General Counsel at the Library for review. This review included close scrutiny by the attorneys in the office for everything from purpose (automating a process) to originality (determining that no code came from any other licensed sources) to authorship (Library staff versus Library contractors). • Due to some contractual obligations with a contracting company which prohibited straightforward public domain release, the three scripts were released on SourceForge in December 2008 under a BSD license. http://sourceforge.net/projects/loc-xferutils/

  21. CONTINUING THE OPEN SOURCERELEASE • The next vital release had to be BIL—the BagIt Library—a Java library developed to support Bag services. • A barrier to uptake of the BagIt specification was the ability to automate the Bagging process and to support the development of tools. BIL supports key functionality such as creating, manipulating validating, and verifying Bags, as well as the uploading of Bags using the SWORD deposit protocol. • The review of BIL for open source release by the Office of General Counsel was a more complex affair. There was a single author who was a Library staff member, but there were thirteen bundled dependencies each with their own licenses to be reviewed. • BIL was released into the public domain with the understanding that those licenses restricted any bundling of BIL and its dependencies into new tools by others, but in no way restricted the release. • BIL was released as both compiled and source code in June 2009.

  22. MANAGING THE RELEASE • At the time of both releases the Library made a conscious decision to just release the code, and not take advantage of the SourceForge functionality that supports the committing of code back into the project. • These were three relatively simple scripts and it seemed to make the most sense to release them and let others work with them or use them to model their own development. • No one was available at the time who could devote the effort needed to manage a full-blown open source project. • The scripts can be updated by anyone in the community for their use. The Library has committed to releasing its updates to BIL. Updates to the source code are expected and welcome through the Digital Curation group.

  23. UPCOMING RELEASES • The Bagger application is nearing the completion of its development and partner testing. Bagger is meant to provide a graphical desktop to for the Bagging of content, ideally requiring no client-side IT support or infrastructure. • It is implemented as a Java Web-Start application for use across platforms as well as a standalone version with its own bundled, stripped down Java JRE, and supports the aggregation of files into Bag packages, including the creation of checksum manifests and Bag information files. It is developed on top of BIL. • The Bagger review includes the proposed release of three variants – the Java Webstart version, and standalone versions for the PC and Mac – as well as the source code. • The review encompasses a number of bundled dependencies, including the redistribution license for Java.

  24. BUILDING A COMMUNITY • The BagIt specification was posted on the Library of Congress and California Digital Library sites and as an Internet “Request for Comment” (RFC). • The BagIt specification will also be released on SourceForge to promote wider dissemination, discussion, and community building. • BagIt and the tools have been promoted to partners from three different initiatives, blogged, tweeted, shared on Facebook, presented at conferences, described in the Library’s Digital Preservation Newsletter, described in email sent to listservs, discussed in a Google group, and written up in journal articles. • The team launched a Digital Curation Google group in part to support the activities of this increasingly participatory community and encourage open, public discussion. http://groups.google.com/group/digital-curation • The best strategy for building a community was in its use by the NDIIPP partner institutions. NDIIPP strongly encouraged partners to “bag” their content for their preservation transfers to the Library.

  25. BUILDING A COMMUNITY • The Library moved into new modes or promotion and community building, including development of an introductory video featuring Brian Vargas, one of the authors of the specification http://www.digitalpreservation.gov/videos/bagit0609.html

  26. SUCCESSES FOR THE RELEASE • How is the success of this initiative measured? • There have been close to 300 downloads from the SourceForge site. • The Google group has over 120 participants. • A significant percentage of the 130 NDIIPP partners have utilized the BagIt specification in their preservation transfers to the Library. • The Library recently become aware of the open source Ruby BagIt, a Ruby Gem released in early 2009 to support use of the specification. • http://rubyforge.org/projects/bagit/

  27. OUTCOMES FOR THE LIBRARY • The Library's first Open Source software release. • http://sourceforge.net/projects/loc-xferutils/ • BagIt is in use with multiple NDIIPP partners, in the eDeposit pilot project, and for the packaging and transport of file packages internally. • Gradual development of graphical workflow tools for all active projects • The transfer of partner content has informed the Library’s own preservation efforts, building our understanding about what we need to know about files and what events in their life cycle we need to record and track. • The Inventory Tool will support the Library's initial efforts in a file-level preservation audit. • Put all tools and services into full production during 2009

  28. Questions? Leslie Johnston lesliej@loc.gov

More Related