240 likes | 252 Views
This pilot project aims to preserve, curate, and enable access to digital data and associated electronic journals content in the field of astronomy. It focuses on collaboration among libraries, publishers, and the Virtual Observatory to address the data preservation problem.
E N D
Digital Data Preservation in Astronomy: A Collaboration Among Libraries, Publishers, and the Virtual Observatory A pilot project aimed at preserving, curating, and enabling access to digital data and associated electronic journals content. Robert Hanisch, Space Telescope Science Institute Sayeed Choudhury, Tim DiLauro, Alex Szalay, and Ethan Vishniac, The Johns Hopkins University Julie Steffen, University of Chicago Press Teresa Ehling, Cornell University Robert Milkey, American Astronomical Society Ray Plante, National Center for Supercomputer Applications
Outline for Presentation • The Virtual Observatory • Data in Astronomy • The data preservation problem • A scenario • Past experience and research • Approach • A prototype project DCC 2006, Glasgow
The Virtual Observatory • The Virtual Observatory enables new science by greatly enhancing access to data and computing resources. The VO makes it easy to locate, retrieve, and analyze data from archives and catalogs worldwide. • The VO is about data discovery, access, and integration. • The VO is NOT a huge centralized data repository. • The VO provides standard protocols for obtaining data from distributed collections. • The VO is national (US NVO) and international (IVOA). DCC 2006, Glasgow
Without VO n services, n interfaces astronomer archive 1 service 3 archive 2 service 2 archive 3 service 1 survey 1 survey 3 survey 2 DCC 2006, Glasgow
With VO n services, 1 interface astronomer archive 1 service 3 archive 2 VO service 2 archive 3 service 1 survey 1 survey 3 survey 2 DCC 2006, Glasgow
It has no commercial value No privacy concerns Can freely share results with others Great for experimenting with algorithms It is real and well documented High-dimensional (with confidence intervals) Spatial Temporal Diverse and distributed Many different instruments from many different places and many different times The questions are interesting There is a lot of it (soon petabytes) ROSAT ~keV DSS Optical IRAS 25m 2MASS 2m GB 6cm WENSS 92cm NVSS 20cm IRAS 100m Why is Astronomy Data Special? DCC 2006, Glasgow
Pixel data collected by telescope Sent to Fermilab for processing Beowulf Cluster produces catalog Loaded in a SQL database Data Flow (Levels of Data) DCC 2006, Glasgow
The data preservation problem • Research communities publish peer-reviewed journal papers that describe highly processed data. • Long-term preservation and curation systems for digital journal content, including the digital data presented only graphically, are not currently in place. • The research cannot be verified and the results cannot be easily compared to other data in order to broaden impact. • Public funds invested in scientific research do not have maximum return on investment. Essential legacy datasets may be lost. DCC 2006, Glasgow
Storyboard DCC 2006, Glasgow
Storyboard DCC 2006, Glasgow
Storyboard Save as FITS Copy to my VOSpace Display in Aladin DCC 2006, Glasgow
Astronomy Digital Image Library DCC 2006, Glasgow
ADIL query DCC 2006, Glasgow
ADIL is great, but… • Data capture and curation is separate from manuscript processing • Data access is not integrated into the journals • Data management is centralized ADIL query DCC 2006, Glasgow
Repository-related Research • Digital Library framework comprises service-oriented architecture with repositories as foundation, especially for digital preservation • Archive Ingest and Handling Test (AIHT) through Library of Congress NDIIPP • A Technology Analysis of Repositories and Service Integration (funded by Mellon Foundation) • Project STORE (Source to Output Repositories) DCC 2006, Glasgow
Approach • Integrate digital data management into the publication process (data capture, review, metadata tagging and validation, storage). • Exploit emerging information technology standards for managing distributed data collections, including digital journals. • Provide multiple access methods to digital data to maximize visibility and re-use. • Exploit information management and curation experience in the university libraries and build on long-term institutional commitments to preservation. DCC 2006, Glasgow
Components • Publication & • Editorial Process • Data capture • Metadata capture & validation • Links • Identifiers • Library • Curation • Preservation • Data Storage Appliance • Metadata database • Digital data objects • Ancillary information • Data Storage Appliance • Metadata database • Digital data objects • Ancillary information • Data Storage Appliance • Metadata database • Digital data objects • Ancillary information replication services VOSpace • Data Access • VO portals • Journal portals • Other after-market distributors • Registry • Logging DCC 2006, Glasgow
A prototype project • Implement end-to-end prototype using astronomy scholarly publications as a test-bed. • Understand operational costs and develop long-term business plan for preservation of peer-reviewed journal content and associated supporting data. • Develop associated policies affecting data accessibility (e.g., move toward requiring digital data availability as requirement for publication). • Utilize commodity open-source technologies and partner with Virtual Observatory to maximize return on investment, flexibility, adaptability. • Long-term: evaluate impact on citations and productivity resulting from having ready access to digital data. DCC 2006, Glasgow
A prototype project • Tasks • metadata definition* • content management tool evaluation/selection (Fedora)* • physical storage and replication* • publication process revisions and testing • policy development • business model development *Shared technology development/deployment with National Virtual Observatory DCC 2006, Glasgow
Current collaborators • The Johns Hopkins University-Sheridan Libraries, Edinburgh University Library, University of Washington Library and Cornell University Library (information management and curation) • The National Virtual Observatory project (representatives from JHU, Space Telescope Science Institute, and the National Center for Supercomputing Applications) • American Astronomical Society (journals, editors) • The University of Chicago Press (publisher for the AAS journals) DCC 2006, Glasgow
Status • Support from • UK JISC (Joint Information Systems Committee) and CURL (Consortium of Research Libraries in the British Isles) • US Institute of Museum and Library Services • Support committed from • Microsoft • SPARC (Scholarly Publishing and Academic Resources Coalition) • TeraGrid • NVO • Development has started… DCC 2006, Glasgow