320 likes | 476 Views
State and Local Agency Digital Geospatial Data Preservation. The North Carolina Experience. Earth Sciences Information Partners (ESIP) Workshop July 8, 2009. Steve Morris NCSU Libraries. NC Geospatial Data Archiving Project (NCGDAP).
E N D
State and Local Agency Digital Geospatial Data Preservation The North Carolina Experience Earth Sciences Information Partners (ESIP) Workshop July 8, 2009 Steve Morris NCSU Libraries
NC Geospatial Data Archiving Project (NCGDAP) • One of eight initial collection building projects in the Library of Congress NDIIPP (National Digital Information Infrastructure and Preservation Program) • Lead organizations: North Carolina State University Libraries and North Carolina Center for Geographic Information & Analysis (NCCGIA) • Focus: • State and local government geospatial data in NC • Repository development as catalyst for discussion • Goal: Engage spatial data infrastructure in data archiving • Initial 3 year project extended to Dec. 2009
NCGDAP Data Types – Raster • Digital orthophotography • Satellite imagery Static data
NCGDAP Data Types – Vector Data • Point, line, and polygon • Attached attribute data Often updated
Imagery = Durable Static Simple structure Mostly open formats Vector data = Volatile Frequent update Complex structure Mostly proprietary formats Imagery = Durable Static Simple structure Mostly open formats Vector data = Volatile Frequent update Complex structure Mostly commercial formats Downtown Raleigh Near State Capitol 2005 Wake County Ortho Downtown Raleigh, NC Near State Capitol 2005 Wake County Ortho Note: Percentages based on the actual number of respondents to each question
NCGDAP Data Types – Spatial Databases • Vector and raster data • Relationships • Behaviors • Annotation • Data Models
Geospatial Data: Compelling Issues • Dynamic content • Constantly updated information • Data versioning • Digital object complexity • Spatially-enabled databases • Complicated, multi-component formats • Proprietary formats
Ingest Challenges: General • Data consists of multi-file, multi-format objects • Ancillary data files can be shared by datasets • Some format conversions involve one-to-many relationships • Compressed archive files are common and behave unpredictably • And all the usual challenges: format validation, validity checking, threat scanning,…
Here’s One! Files • Multi-file dataset • Georeferencing • Metadata file • Symbolization file • Additional documentation • License • Disclaimer • More Metadata • FGDC • Acquisition metadata • Transfer metadata • Ingest metadata • Archive rights • Archive processes • Collection metadata • Series metadata
Ingest Challenges: Metadata • Metadata is encoded in a variety or ways • The FGDC content standard for metadata lacked an encoding standard (arrived pre-XML), addressed in ISO 19115/19139 North American Profile implementation • XML (varied schemas), TXT, HTML • Metadata is missing • Only about 25% of local agencies use FGDC • Metadata is wrong • Metadata is commonly asynchronous with the data • Inconsistent use of dataset naming, etc. • e.g., “Streets” vs. “Wake County Streets”
NCGDAP Metadata Summary • Existing geospatial metadata often needs: • Remediation – to fix errors or omissions • Normalization – to adhere to a standard structure • Synchronization – so that the data at hand matches the metadata • If no metadata then: • Can build minimal metadata using templates and auto-extraction • Lose key information such as data quality, lineage, data dictionaries • Automating metadata for repository ingest • Raster data is easy – large sets of consistently structured files • Vector data is hard – each dataset is a different story • Many additional administrative and technical metadata elements not accommodated by FGDC
Data Receipt Content Producers Format Processing Industry Metadata Processing Standards Organizations Ingest Processes Extended Curation: Feedback and Outreach
Spatial Data Infrastructure and Archiving Metadata standards and outreach • Metadata quality, best practices Inventories • Reduce “contact fatigue”, shareable information store Content exchange networks • Leverage more compelling business reasons to put data in motion • Automate process, add technical & administrative metadata Framework data communities • Snapshot frequency, schemas, format strategies
Content Packaging Issues • Geospatial datasets are typically complex, multi-file objects • Data are often accompanied by ancillary data, which must be associated with the data item • Rights information and licenses must be associated with the item • Various implementations in different domains (METS, IMS-CP, XFDU, etc.) • Simpler .zip-based packages also used (MEF, KMZ, etc.)
Spatial Database Approaches Manage database forward over time Extract data layers to preservable form Set aside archival snapshot of database
GeoMAPP: Geospatial Multistate Archival and Preservation Partnership • Partners (NC, KY, UT, Library of Congress, NCSU): • State geospatial organizations • State Archives • State-to-state and geo-to-Archives collaboration • Organizational and technical diversity across states • Archives as part of spatial data infrastructure • Selection and appraisal processes • Retention schedule development • Data transfer to archives • Development of enhanced business cases
NCGDAP Learning Outcomes Preservation of GIS projects is needed to support re-creation of past work Preservation of data representations is needed to document decision-making processes Validation, remediation, and conversion of data and metadata is expensive: push for improvements upstream Some repositories handle “items”: can result in “atomization” of data For vendors, frame data preservation as a “customer problem” -- must build the business case
Thank You! Steve Morris Head, Digital Library Initiatives North Carolina State University Libraries steven_morris@ncsu.edu North Carolina Geospatial Data Archiving Project http://www.lib.ncsu.edu/ncgdap GeoMAPP http://www.geomapp.net
Draft of Utah’s GIS to Archives Data Flow • All Metadata is completed to FGDC Standards • AGRC creates geoPDF files of individual datasets, plus ZIP files of the native format. • One ZIP file would contain all the pieces belonging to one shapefile or, alternatively, the file would contain a geodatabase. • Geodatabases would not be just one big database with everything in it (multiple series and years). • Instead, the native files would be composed of a single downloadable file per series per year. AGRC exports data from SGID and splits out datasets by series. Metadata occasionally incomplete complete Local governments supply GIS datasets on CD/DVD to AGRC. Metadata often missing AGRC copies these files to Archives’ FTP server. Example FTP Site Structure: • ftp.archives-agrc.utah.gov/Archives Metadata harvested to populate Archive’s Finding Aids • Biota Dublin Core Metadata • Boundaries Dublin Core Metadata • MunicipalityRecords-Series-26846Dublin Core Metadata • 2000 • MunicipalBoundaries.zip FGDC Metadata • MunicipalBoundaries.pdf FGDCMetadata • 2001 • 2002 • 2003 • CountyBoundaries-Series-26845 Dublin Core Metadata • 2003 • 2004
Kentucky Metadata Workflow into DSpace and iRODS Environment UNC other KDLA Database with Dublin Core Descriptive and Administrative Metadata Metadata & content entered by agencies using template and modified by Archivist DSpace Single item & batch ingest into DSpace by Archivist Database with Administrative & Preservation Metadata Content Files iRODS Batch metadata extraction using iRODS rules Preservation metadata from iRODS rules Distributed Storage Layer
Source Metadata Translation • Hub-and-spoke model a la Echo DEPository • repository agnostic • modular conversion hub • facilitate repository software migration & inter-archive exchange
GeoMAPP: Geospatial Multistate Archival and Preservation Partnership • Lead organizations: North Carolina Center for Geographic Information & Analysis (NCCGIA), State Archives of NC, with Library of Congress • Partners: • State geospatial organizations of Kentucky and Utah • State Archives of Kentucky and Utah • NCSU Libraries in catalytic/advisory role • State-to-state and geo-to-Archives collaboration • 2 year project: Nov. 2007-Dec. 2009 • Archives as part of Spatial Data Infrastructure
GeoMAPP: Project Components • Introduce GIS organizations and State Archives to each other • Archival selection and appraisal processes • Retention schedule development • Data transfer to archives • Development of enhanced business case
NC Geospatial Data Archiving Project (NCGDAP) • Repository Goal • Capture at-risk data • Explore technical and organizational challenges • Project End Goal • Data Producers: Improved temporal data management practices • Archives: More efficient means of acquiring and preserving data; Progress towards best practices Temporal data management vs. long-term preservation
Geospatial Data Preservation Challenges • Data capture • Backups are common, but not long-term archives • Producer focus on current data • Shift to web services-based access • Inadequate or non-existent metadata • Consistent NC survey statistics: Only 40% of data producers create and maintain metadata • Existing metadata often needs to be normalized, synchronized with the data, and remediated Loss of memory about the data is also a problem
Ongoing Challenges • When to automate and when not to • Learn first from human intervention • Minimizing risk of error related to human intervention • Accepting that ingest packages used will evolve over time (implications for archive?) • Handling post-ingest migrations
Challenge: Preservation Metadata Results from a 2006 survey of all 100 NC counties and 25 largest NC municipalities
Some Key Metadata Decisions • Capture “transfer set” metadata • Normalize, synchronize, and remediate existing metadata, and retain original metadata record • Treat contact information as archival • Update metadata with format conversions • Use ESRI Profile of FGDC • added technical and administrative elements • Has an XML schema • ArcCatalog tool support • Use simple rights encoding scheme • Record metadata in a workflow management database
Digital Preservation in State Government - Wilmington SIP Item Creation: Workflow • Submission Information Package grouping • Ontology logic based on defined multi-file complex format components and directory structure • Repository-agnostic item grouping
Metadata Overview • Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata • Version one (1994) mandated for use by federal agencies • Descriptive metadata, plus some administrative and technical • Extensive use at state level, spotty use at local level • Problem: content standard without an encoding spec • FGDC profiles: ESRI, NBII, Remote Sensing, etc. • ISO Standards • ISO 19115: Geospatial Information – Metadata (2003) • ISO 19139: Geospatial Information – Metadata – XML (2007) • North American Profile of ISO to replace FGDC CGDSM