1 / 24

Accessing the Utility of Current Format Registry Efforts for Geospatial Resources

Accessing the Utility of Current Format Registry Efforts for Geospatial Resources. Nancy J. Hoebelheinrich Stanford University Libraries (co-author and presenter) With Natalie K. Munn, Content Innovations LLC (co-author) Tuesday, May 5, 2009 IS & T’s Archiving 2009. Background.

anoki
Download Presentation

Accessing the Utility of Current Format Registry Efforts for Geospatial Resources

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Accessing the Utility of Current Format Registry Efforts for Geospatial Resources Nancy J. Hoebelheinrich Stanford University Libraries (co-author and presenter) With Natalie K. Munn, Content Innovations LLC (co-author) Tuesday, May 5, 2009 IS & T’s Archiving 2009

  2. Background • NGDA project sponsored by Library of Congress’ NDIIPP • Previously, an investigation into the MD needed for long-lived geospatial resources (issued in 2008) • Assumption that there is a place for format registries in preservation strategy

  3. Key Question for study • Would current FR efforts work for geospatial resources? • Often complex, compound digital resources • Often proprietary • Specialized user domain • What’s necessary for a general preservation repository environment in contrast to a geospatial repository?

  4. No need to re-invent! • Reviewed the current format registry development efforts (at the time) • The National Archive of the UK: Pronom Technical Registry • Library of Congress’ Sustainability Factors • Global Digital Format Registry, collaborative project by Harvard University, NARA and OCLC and funded by the Andrew Mellon Foundation • NGDA draft wiki based FR

  5. Methodology of study • Compare data models of and output from current FR efforts for: • File format characteristics • Relationships among formats • Structures for documenting versions

  6. Methodology of study • Use real examples of geospatial data formats that were intended for ingest into NGDA repositories • Research & locate public documentation of data format to be ingested • Note where explicit metadata existed within the resources themselves, e.g., in file headers • Examine 4 commonly used GIS conversion tools/utilities to determine how widespread the use and level of support for import/export, and/or direct read / write of data formats (SDRI, GDAL, Manifold, & SAFE)

  7. Methodology of study • Create format definitions for (9 so far) data formats based on publicly available specifications, White Papers, reference materials, and input from expert GIS users • Evaluate & report upon utility of FR’s data structures for geospatial resources & identify issues

  8. Formats reviewed in study (23) • Raster images (based on pixels), e.g., • TIFF, Geotiff • BIL (Band Interleaved by Line) • ADRG (Arc Digitized Raster Graphic) • ESRI Grid • Vector images, (using points, lines, curves, shapes), e.g., • Shapefile • ArcInfo Coverages • Grid formats (represents elevation data for ground positions, regularly spaced), e.g., • Digital Elevation Model (DEM) • Spatial Data Transfer Standard (SDTS)

  9. Formats reviewed in study, continued • Proprietary formats (e.g., ESRI) • Openly available formats (e.g., TIFF family) • Data formats used by international & US national data sources

  10. Results from the study to date: • “Report to National Geospatial Digital Archive Regarding Geospatial Data”, latest draft of 4 May 2009 • “NGDA Format Registry Research Bibliography and Resources” (Appendix A) • “NGDA Registry Survey” spreadsheet of 23 formats and how / whether current FR efforts describe them (Appendix B) • Summary of Registry Field Map across 4 FR efforts (Appendix C) • “Sample Geospatial Format Registry Definitions” using the Pronom XML schema (Appendix D) • “GDFR and Pronom Format Registry Definitions’ comparison” (Appendix E preliminary)

  11. Elements compared across FR efforts in study • NGDA Field Research Survey.pdf • 49 tags compared across 4 FRs • See also (links to docs in References at end) • GDFR Data Model • Pronom Information model • Relevant articles by Alex Ball & Adrian Brown

  12. Key & sometimes problemmatic) elements for geospatial resources • File format category: • Format family (parent/child; supertype/subtype) • Relationship among formats • Container information • Version information • Associated software category

  13. Geospatial ex.: Shapefile (vector images) • Files included may vary greatly: • As defined by the spec: • .shp (main file describing shapes) • .shx (index file) • .dbf (database file containing feature attributes) • As found in the wild, also may include • .prj (text files describing projection info – very important!) • .xml, .sbx, .sbn (often useful files generated by ArcGIS tools which provide output descriptive MD, binary spatial indexes used by tool) • .atx (ArcGIS files created to provide index to attributes)

  14. Two geospatial examples: shapefiles (vector images) • Description of format family / relationship among shapefile related formats adequate? • Are the variations in shapefiles supertype/subtype, parent / child or Second cousin once removed?

  15. Two geospatial examples: shapefiles (vector images) • Does Pronom’s relatedFormat adequately explain? How to explain evolution of format, inclusion of projection (.prj) files • [Also true of TIFF family: • GeoTiff • High Resolution Orthoimagery (HRO)] • Container info: GDFR’s compositionFacet doesn’t include file directory as container (does include container as “bundle” & as “wrapper”) How to describe? Same is true for DEM’s and STDS

  16. Two geospatial examples: shapefiles (vector images) • Version info: If variations in the format develop over time, but appear only as different agencies or producers evolve (e.g., inclusion of .prj files), are these versions? • ?? When / why would it matter??

  17. Two geospatial examples: shapefiles (vector images) • Arguably, software apps capable of rendering shapefiles NOT ubiquitous enough (in general purpose preservation repository), is there an obligation to describe them? • Very unclear how to link software to format in GDFR & don’t seem to show up in Pronom in Search mode

  18. Issues / Suggestions raised by creation of format definitions • FR should either keep copies of specs, white papers, or be sure to provide resolvable link to a preservation copy • Encourage creation of ontologies for describing relationships among data format types

  19. Issues / Suggestions • Include examples of formats as part of the format definition (assuming no rights restrictions on same) • Provide or encourage creation of more extensive definitions and guidelines

  20. Issues / suggestions • Invite software vendors to the party? How to persuade / include? (e.g., ESRI) • How practicable is it to create the format definitions?

  21. Issues / Suggestions • Some argue that all this work would not be helpful anyway; how to test that question?

  22. Next Steps for NGDA efforts • Complete as many format definitions as possible within time left in grant • Attempt to devise rational description of format relationships among geospatial formats • Approach ESRI and other software or service vendors for assistance in creating format defs or making specs available • Contribute format defs created to LC’s Sustainability Factors site, and either PRONOM or UDFR as possible.

  23. References • National Geospatial Digital Archive (NGDA) project sponsored by Library of Congress’ NDIIPP • National Digital Information Infrastructure Preservation Program (NDIIPP) • Nancy Hoebelheinrich, et al, “An Investigation into Metadata for Long-Lived Geospatial Data Formats”, July 2008 • The National Archive (TNA) PRONOM Technical Registry • Library of Congress Sustainability Factors for Digital Resources • Global Digital Format Registry • “Report to National Geospatial Digital Archive Regarding Geospatial Data”, latest draft of 4 May 2009 • “Pronom 4 Information Model”, by Adrian Brown for The National Archives (UK), Version 1, 4 January 2005 • “GDFR Data Model Specification”, version 5_0_14 • “Briefing Paper: File Formats and XML Schema Registries,” issued 31 May 2006, by Alex Ball • “White Paper: Representation Information Registries”, issued 29 January 2008 for PLANETs project by Adrian Brown, The National Archives (UK) • Unified Digital Format Registry

  24. Contacts and Questions? • Nancy Hoebelheinrich, nhoebel@stanford.edu • Natalie Munn, nkmunn@contentinnovations.com

More Related