290 likes | 422 Views
Interoperation and Infrastructure for Digital Archiving: the LuKII Project. by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard Altenhöner & Tobias Steinke, Deutsche Nationalbibliothek. Berlin School of Library and Information Science. Introduction.
E N D
Interoperation and Infrastructure for Digital Archiving: the LuKII Project by Michael Seadle & Peter Schirmbacher, Humboldt-Universität zu Berlin & Reinhard Altenhöner & Tobias Steinke, Deutsche Nationalbibliothek Berlin School of Library and Information Science
Introduction In June, 2007, a DFG-sponsored workshop on digital archiving took place in Berlin. Interoperability between LOCKSS (Lots of Copies Keep Stuff Save) and KOPAL (Co-operative Development of a Long-Term Digital Information Archive) was one of the most discussed ideas that emerged from that workshop. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Scholarly infrastructure Today's scholarly infrastructure depends heavily on digital materials. In some fields, particularly in the natural sciences, digital publication is taken for granted. More publishers are launching new journals only in digital formats and open-access publications are almost exclusively digital. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Repositories The repositories offer ways to collect and give access to digital information. They lack infrastructure to do integrity checking with a statistically significant likelihood of finding and addressing integrity problems or to address usability problems with regular migration. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Open Access Germany has played a leading part internationally in the open access movement. As a result its institutional repositories contain a wealth of research works. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Cost Effectiveness Cost-effectiveness is key because long term digital archiving is expensive. Universities and their libraries have grown accustomed to paying the costs for retaining paper works, including their housing, handling and repair after heavy use. Those costs will not go away any time soon, which means that the cost of digital preservation comes in addition to, not instead of, existing costs. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
LuKII goals The first goal of this project is to establish interoperability between KOPAL (from Germany) and LOCKSS (from the US) in order to marry German goals for migration and usability with cost-effective bitstream preservation. The second goal is to test the prototype interoperable system by harvesting a wide variety of data from German OPUS and eDoc institutional repositories. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
LOCKSS LOCKSS (Lots of Copies Keep Stuff Safe from Stanford University) is arguably the earliest digital preservation and dissemination system. It is known in particular for its robustness in maintaining the integrity of the digital object. LOCKSS has faced genuine attack scenarios, shifted platforms, and tested format migration network-wide. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Bitstream integrity Bitstream integrity is broadly seen in the US as the sine qua non of long term digital archiving. If the file is damaged, usability/readability and authenticity cease to be meaningful. LOCKSS is neutral toward usability/readability solutions and can function with more than one. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Archival Storage The Archival Storage in LOCKSS uses seven separate nodes to check routinely on the integrity of an archived bitstream and to take action to replace a damaged copy. The updated version is copied to other LOCKSS boxes in the network, but the older version is also retained in case of future need. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Context Context plays an important role in LOCKSS. The URL of the original work is stored with the digital object. This not only allows the system to recognize and refer back to the original version of a digital document in order to check routinely for changes without requiring human intervention, but also lets the system know if the original for some reason ceases to be available online. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Ingest The current LOCKSS ingest process (its SIP or Submission Information Package in OAIS terms) uses a crawler that efficiently harvests all documents in a standard tree-structure website when it has permission from a “manifest” on the server being harvested. The manifest serves as a guarantee to publishers that the LOCKSS crawler only takes materials that they have explicitly authorized. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Cost-effectiveness Cost-effectiveness has been an integral feature of LOCKSS design from the outset. It helps to reduce costs by using cheap and simple equipment. The fact that it is open source means that libraries and other preservation-oriented institutions world-wide can use it without paying for permission. LOCKSS is used by 197 libraries and institutions in 19 countries. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
LOCKSS Alliance LOCKSS Alliance membership is not required for the use of an open source package like LOCKSS, though it is strongly encouraged as a way of sharing development and support costs. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Community LOCKSS looks to a community of developers at member institutions of the LOCKSS Alliance to help to keep it up to date. This community-based co-development on the LINUX model is particularly cost-effective. Cost is obviously a factor for a commercial firm with profits to make. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
KOPAL Background The goal of the KOPAL project (2004 – 2007), founded by the Federal Ministry for Education and Research (Bundesministerium für Bildung und Forschung), was the cooperative development of a long-term digital information archive. The archival system is based on DIAS by IBM, which was originally developed for the Koninklijke Bibliotheek of the Netherlands (KB). Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
KOPAL The German National Library and the Staats- und Universitätsbibliothek Göttingen (SUB Göttingen) use KOPAL, whose DIAS (Digital Archive Information System) core was developed by IBM for the National Library of the Netherlands. Additional open source software has enhanced the ingest procedures and has provided tools to enable preservation planning activities like systematic migration workflows. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
KOPAL users The DIAS system for the KOPAL solution is currently used by two clients, DNB and SUB Göttingen. Their data are independently of each other stored and accessible. The system is located at Göttingen, which is responsible for guaranteeing bitstream preservation. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Universal Object Format The KOPAL system tries to deal with the problem of obsolete file formats and rendering environments by support of file format migration throughout its architecture. Every archival package is in an open defined format called Universal Object Format, which describes a structure to record metadata for preservation together with the content files. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
koLibRI • The koLibRI Java software library was developed by the German National Library and SUB Göttingen within the KOPAL project to support the integration of DIAS in the local IT infrastructure of the clients. Its tasks are: • Encapsulate the communication with DIAS • Create archival objects conforming to the Universal Object Format • Automatically generate technical metadata with the tool JHOVE • Manage the ingest and the access to DIAS • Manage the workflow to migrate file formats in archival objects based on given parametersand migration tools Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
KOPAL advantages • KOPAL gains several advantages in working with LOCKSS. • LOCKSS strength in preserving bitstream integrity • LOCKSS's effective dissemination package. • The shared support and development structure of the LOCKSS Alliance • KOPAL's state-of the-art presentation environment offers a solution for digital objects that are no longer usable. • Since KOPAL's systematic migration-flow guarantees the long-term usability and accessibility of digital objects, it complements the functions of LOCKSS well. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
LOCKSS advantages KOPAL's state-of the-art presentation environment offers a solution for digital objects that are no longer usable. Since KOPAL's systematic migration-flow guarantees the long-term usability and accessibility of digital objects, it complements the functions of LOCKSS well. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
1st Objective The goal of this project is to make open access repositories in Germany, both discipline-specific and institutional, more robust over time. The first objective involves establishing a LOCKSS network in Germany and providing the technical support to maintain it without constant reference to the LOCKSS teams in Stanford or Edinburgh. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
2nd Objective • Interoperability with KOPAL is the second objective. • David Rosenthal (Stanford/LOCKSS) in private correspondence suggested the following three types of interoperability: • Transfer interoperability • Dissemination interoperability • Audit interoperability Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
3rd Objective The third objective is to test the interoperability prototype (the “LuKII prototype”) by harvesting digital contents from a selection of German institutional repositories from the OA-Netzwerk-Projekt. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
3rd Objective • Among the key development issues for this third objective are: • ingest automation, • cost-effective metadata creation, • format migration testing. • An absolutely essential feature of long term digital archiving systems is to free them as much as possible from the need for costly human intervention. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Current status • Current status: The project has the following rough timeline: • March/April – Hiring staff • May -- Development of the LOCKSS network in Germany • June– training for Berlin technical staff at Stanford. • July/August – Programming for METS and query support at Stanford; programming for SFTP crawler, and parsing & extracting METS metadata at Berlin • September– koLibRI generation of data for testing LOCKSS modifications at D-NB; implementation into test LOCKSS network – Berlin / Stanford • October– first repository data load – start of iterative tool development. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Conclusion Scholarly research on long term digital archiving is just beginning. Today's system designs may no longer be the ideal in 50 or 100 years. The more that systems can cooperate and interoperate, the greater the chances that investments in archiving systems can be carried into the future. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin
Sources • Deutsche Initiative für Netzwerkinformationen (2009) “Open Access-Netzwerk Projekt”. Available (Dec 2009): http://www.dini.de/projekte/oa-netzwerk/ • Library of Congress, (2009), “Metadata Encoding and Transmision Standard”. Available (Dec 2009): http://www.loc.gov/standards/mets/ • Library of Congress, National Digital Information Infrastructure Preservation Program (2009), “WARC, Web ARChive file format”.. Available (December 2009): http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml • LOCKSS (2009), “Libraries“. Available (Dec 2009): http://www.lockss.org/lockss/Libraries • LOCKSS (2009), “Publications”. Available (Dec 2009): http://www.lockss.org/lockss/Publications • Country (Ranking Web of Repositories). • Seadle, Michael & Elke Greifeneder. 2008. “In archiving we trust: Results from a workshop at Humboldt University in Berlin.” First Monday 13(1). • Directory of open access journals. Available at: http://www.doaj.org/doaj?func=findJournals [Accessed January 23, 2009]. Michael Seadle Berlin School of Library & Information Science Humboldt Universität zu Berlin