240 likes | 345 Views
Accessing the Utility of Current Format Registry Efforts for Geospatial Resources. Nancy J. Hoebelheinrich Stanford University Libraries (co-author and presenter) With Natalie K. Munn, Content Innovations LLC (co-author) Tuesday, May 5, 2009 IS & T’s Archiving 2009. Background.
E N D
Accessing the Utility of Current Format Registry Efforts for Geospatial Resources Nancy J. Hoebelheinrich Stanford University Libraries (co-author and presenter) With Natalie K. Munn, Content Innovations LLC (co-author) Tuesday, May 5, 2009 IS & T’s Archiving 2009
Background • NGDA project sponsored by Library of Congress’ NDIIPP • Previously, an investigation into the MD needed for long-lived geospatial resources (issued in 2008) • Assumption that there is a place for format registries in preservation strategy
Key Question for study • Would current FR efforts work for geospatial resources? • Often complex, compound digital resources • Often proprietary • Specialized user domain • What’s necessary for a general preservation repository environment in contrast to a geospatial repository?
No need to re-invent! • Reviewed the current format registry development efforts (at the time) • The National Archive of the UK: Pronom Technical Registry • Library of Congress’ Sustainability Factors • Global Digital Format Registry, collaborative project by Harvard University, NARA and OCLC and funded by the Andrew Mellon Foundation • NGDA draft wiki based FR
Methodology of study • Compare data models of and output from current FR efforts for: • File format characteristics • Relationships among formats • Structures for documenting versions
Methodology of study • Use real examples of geospatial data formats that were intended for ingest into NGDA repositories • Research & locate public documentation of data format to be ingested • Note where explicit metadata existed within the resources themselves, e.g., in file headers • Examine 4 commonly used GIS conversion tools/utilities to determine how widespread the use and level of support for import/export, and/or direct read / write of data formats (SDRI, GDAL, Manifold, & SAFE)
Methodology of study • Create format definitions for (9 so far) data formats based on publicly available specifications, White Papers, reference materials, and input from expert GIS users • Evaluate & report upon utility of FR’s data structures for geospatial resources & identify issues
Formats reviewed in study (23) • Raster images (based on pixels), e.g., • TIFF, Geotiff • BIL (Band Interleaved by Line) • ADRG (Arc Digitized Raster Graphic) • ESRI Grid • Vector images, (using points, lines, curves, shapes), e.g., • Shapefile • ArcInfo Coverages • Grid formats (represents elevation data for ground positions, regularly spaced), e.g., • Digital Elevation Model (DEM) • Spatial Data Transfer Standard (SDTS)
Formats reviewed in study, continued • Proprietary formats (e.g., ESRI) • Openly available formats (e.g., TIFF family) • Data formats used by international & US national data sources
Results from the study to date: • “Report to National Geospatial Digital Archive Regarding Geospatial Data”, latest draft of 4 May 2009 • “NGDA Format Registry Research Bibliography and Resources” (Appendix A) • “NGDA Registry Survey” spreadsheet of 23 formats and how / whether current FR efforts describe them (Appendix B) • Summary of Registry Field Map across 4 FR efforts (Appendix C) • “Sample Geospatial Format Registry Definitions” using the Pronom XML schema (Appendix D) • “GDFR and Pronom Format Registry Definitions’ comparison” (Appendix E preliminary)
Elements compared across FR efforts in study • NGDA Field Research Survey.pdf • 49 tags compared across 4 FRs • See also (links to docs in References at end) • GDFR Data Model • Pronom Information model • Relevant articles by Alex Ball & Adrian Brown
Key & sometimes problemmatic) elements for geospatial resources • File format category: • Format family (parent/child; supertype/subtype) • Relationship among formats • Container information • Version information • Associated software category
Geospatial ex.: Shapefile (vector images) • Files included may vary greatly: • As defined by the spec: • .shp (main file describing shapes) • .shx (index file) • .dbf (database file containing feature attributes) • As found in the wild, also may include • .prj (text files describing projection info – very important!) • .xml, .sbx, .sbn (often useful files generated by ArcGIS tools which provide output descriptive MD, binary spatial indexes used by tool) • .atx (ArcGIS files created to provide index to attributes)
Two geospatial examples: shapefiles (vector images) • Description of format family / relationship among shapefile related formats adequate? • Are the variations in shapefiles supertype/subtype, parent / child or Second cousin once removed?
Two geospatial examples: shapefiles (vector images) • Does Pronom’s relatedFormat adequately explain? How to explain evolution of format, inclusion of projection (.prj) files • [Also true of TIFF family: • GeoTiff • High Resolution Orthoimagery (HRO)] • Container info: GDFR’s compositionFacet doesn’t include file directory as container (does include container as “bundle” & as “wrapper”) How to describe? Same is true for DEM’s and STDS
Two geospatial examples: shapefiles (vector images) • Version info: If variations in the format develop over time, but appear only as different agencies or producers evolve (e.g., inclusion of .prj files), are these versions? • ?? When / why would it matter??
Two geospatial examples: shapefiles (vector images) • Arguably, software apps capable of rendering shapefiles NOT ubiquitous enough (in general purpose preservation repository), is there an obligation to describe them? • Very unclear how to link software to format in GDFR & don’t seem to show up in Pronom in Search mode
Issues / Suggestions raised by creation of format definitions • FR should either keep copies of specs, white papers, or be sure to provide resolvable link to a preservation copy • Encourage creation of ontologies for describing relationships among data format types
Issues / Suggestions • Include examples of formats as part of the format definition (assuming no rights restrictions on same) • Provide or encourage creation of more extensive definitions and guidelines
Issues / suggestions • Invite software vendors to the party? How to persuade / include? (e.g., ESRI) • How practicable is it to create the format definitions?
Issues / Suggestions • Some argue that all this work would not be helpful anyway; how to test that question?
Next Steps for NGDA efforts • Complete as many format definitions as possible within time left in grant • Attempt to devise rational description of format relationships among geospatial formats • Approach ESRI and other software or service vendors for assistance in creating format defs or making specs available • Contribute format defs created to LC’s Sustainability Factors site, and either PRONOM or UDFR as possible.
References • National Geospatial Digital Archive (NGDA) project sponsored by Library of Congress’ NDIIPP • National Digital Information Infrastructure Preservation Program (NDIIPP) • Nancy Hoebelheinrich, et al, “An Investigation into Metadata for Long-Lived Geospatial Data Formats”, July 2008 • The National Archive (TNA) PRONOM Technical Registry • Library of Congress Sustainability Factors for Digital Resources • Global Digital Format Registry • “Report to National Geospatial Digital Archive Regarding Geospatial Data”, latest draft of 4 May 2009 • “Pronom 4 Information Model”, by Adrian Brown for The National Archives (UK), Version 1, 4 January 2005 • “GDFR Data Model Specification”, version 5_0_14 • “Briefing Paper: File Formats and XML Schema Registries,” issued 31 May 2006, by Alex Ball • “White Paper: Representation Information Registries”, issued 29 January 2008 for PLANETs project by Adrian Brown, The National Archives (UK) • Unified Digital Format Registry
Contacts and Questions? • Nancy Hoebelheinrich, nhoebel@stanford.edu • Natalie Munn, nkmunn@contentinnovations.com