580 likes | 816 Views
Preservation of Digital Geospatial Data: Challenges and Opportunities Steve Morris Head of Digital Library Initaitives North Carolina State University Libraries. NARA Meeting. Dec. 14, 2005. Outline. Digital Geospatial Data: Types Risks to Digital Geospatial Data
E N D
Preservation of Digital Geospatial Data: Challenges and OpportunitiesSteve MorrisHead of Digital Library InitaitivesNorth Carolina State University Libraries NARA Meeting Dec. 14, 2005
Outline • Digital Geospatial Data: Types • Risks to Digital Geospatial Data • Overview of NC Geospatial Data Archiving Project • Preservation Challenges and Possible Solutions Note: Percentages based on the actual number of respondents to each question
Geospatial data types: Vector data Note: Percentages based on the actual number of respondents to each question
Geospatial data types: Satellite imagery Note: Percentages based on the actual number of respondents to each question
Geospatial data types: Aerial imagery Note: Percentages based on the actual number of respondents to each question
Geospatial data types: Aerial imagery Note: Percentages based on the actual number of respondents to each question
Geospatial data types: Aerial imagery Note: Percentages based on the actual number of respondents to each question
Geospatial data types: Tabular data (w/vector) Note: Percentages based on the actual number of respondents to each question
Time series – vector data Parcel Boundary Changes 2001-2004, North Raleigh, NC Note: Percentages based on the actual number of respondents to each question
Time series – Ortho imagery Vicinity of Raleigh-Durham International Airport 1993-2002 Note: Percentages based on the actual number of respondents to each question
Today’s geospatial data as tomorrow’s cultural heritage Note: Percentages based on the actual number of respondents to each question
Risks to Digital Geospatial Data .shp .mif .gml .e00 .dwg .dgn .bsb .bil .sid Note: Percentages based on the actual number of respondents to each question
Risks to Digital Geospatial Data • Producer focus on current data • Time-versioned content generally not archives • Future support of data formats in question • Vast range of data formats in use--complex • Shift to “streaming data” for access • Archives have been a by-product of providing access • Preservation metadata requirements • Descriptive, administrative, technical, DRM • Geodatabases • Complex functionality Note: Percentages based on the actual number of respondents to each question
NC Geospatial Data Archiving Project • Partnership between university library (NCSU) and state agency (NCCGIA) • Focus on state and local geospatial content in North Carolina (statedemonstration) • Tied to NC OneMap initiative, which provides for seamless access to data, metadata, and inventory information • Objective: engage existing state/federal geospatial data infrastructures in preservation Note: Percentages based on the actual number of respondents to each question
Targeted Content • Resource Types • GIS “vector” (point/line/polygon) data • Digital orthophotography • Digital maps • Tabular data (e.g. assessment data) • Content Producers • Mostly state, local, regional agencies • Some university, not-for-profit, commercial • Selected local federal projects Note: Percentages based on the actual number of respondents to each question
Local Government GIS: Archival Issues • Data resources are highly distributed and subject to frequent update • More detailed, current, accurate than federal/state data resources • North Carolina local agency GIS environment • 100 counties, 95 with GIS • 85 counties with high resolution orthophotography • Growing number of municipal systems • Value: $162 million plus investment (est. in 2003) Note: Percentages based on the actual number of respondents to each question
Work plan in a Nutshell • Work from existing data inventories • NC OneMap Data Sharing Agreements as the “blanket”, individual agreements as the “quilt” • Partnership: work with existing geospatial data infrastructures (state and federal) • Technical approach • METS with FGDC, PREMIS?, GeoDRM? • Dspace now; re-ingest to different environment • Web services consumption for archival development Note: Percentages based on the actual number of respondents to each question
NCGDAP Philosphy of Engagement Provide feedback to producer organizations/ inform state geospatial infrastructure Take the data as in the manner In which it can be obtained “Wrangle” and archive data Note the ‘Project’ in ‘North Carolina Geospatial Data Archiving Project’– the process, the learning experience, and the engagement with geospatial data infrastructures are more important than the archive Note: Percentages based on the actual number of respondents to each question
Big Challenges • Format migration paths • Management of data versions over time • Preservation metadata • Harnessing geospatial web services • Preserving cartographic representation • Keeping content repository-agnostic • Preserving geodatabases • More … Note: Percentages based on the actual number of respondents to each question
Vector Data Format Issues • Vector data much more complicated than image data • ‘Archiving’ vs. ‘Permanent access’ • An ‘open’ pile of XML might make an archive, but if using it requires a team of programmers to do digital archaeology then it does not provide permanent access • Piles of XML need to be widely understood piles • GML: need widely accepted application schemas (like OSMM?) • The Geodatabase conundrum • Export feature classes, and lose topology, annotation, relationships, etc. • … or use the Geodatabase as the primary archival platform (some are now thinking this way) Note: Percentages based on the actual number of respondents to each question
GIS Software Used: NC Local Agencies Note: Percentages based on the actual number of respondents to each question Source: NC OneMap Data Inventory 2004
Vector Data Format Options • Option A: use an open format and have a really unfortunate transformation and limited vendor support for the output object • Option B: use closed format but retain the original content and count on short- and medium-term vendor support. • Option C: do both to buy time and look for an open, ASCII-based solution. (watch GML activity) No sweet spot, just an evolving and changing mix of flawed options that are used in combination. Note: Percentages based on the actual number of respondents to each question
Geography Markup Language Issues • GML still more useful as a transfer format than an archival format, support limited even for transfer • “Permanent access” requirements: • profiles and application schemas widely understood and supported, avoid requiring “digital archaeology” • role of GML Simple Features Profile? • Assessing formats for preservation: sustainability factors, quality & functionality factors • Apply same approach to GML profiles and application schemas? Note: Percentages based on the actual number of respondents to each question
Geography Markup Language Issues • Plans for environmental scan of existing GML profiles and application schemas or profiles • schema name (e.g. OSMM, top10NL, ESRI GML, LandGML) • responsible agency; schema has official government status? • GML version; known unsupported GML components • schema history; known interoperation with other schemas • vendor support; translator support; stability over time Note: Percentages based on the actual number of respondents to each question
Managing Time-versioned Content Note: Percentages based on the actual number of respondents to each question
Managing Time-versioned Content • Many local agency data layers continuously updated • E.g., some county cadastral data updated daily—older versions not generally available • Individual versioned datasets will wander off from the archive • How do users “get current metadata/DRM/object” from a versioned dataset found “in the wild”? • How do we certify concurrency and agreement between the metadata and the data? Note: Percentages based on the actual number of respondents to each question
Managing Time-versioned Content • Can we manage the relationship loosely using a persistent identifier link to a parent object? Persistent ID Resolver Parent Object Manager version version version version version Note: Percentages based on the actual number of respondents to each question
Preservation Metadata Issues • FGDC Metadata • Many flavors, incoming metadata needs processing • Cross-walk elements to PREMIS, MODS? • Metadata wrapper/Content packaging • METS (Metadata Encoding and Transmission Standard) vs. other industry solutions • Need a geospatial industry solution for the ‘METS-like problem’ • GeoDRM a likely trigger—wrapper to enforce licensing (MPEG 21 references in OGIS Web Services 3) Note: Percentages based on the actual number of respondents to each question
Metadata Availability Note: Percentages based on the actual number of respondents to each question
Harnessing Geospatial Web Services Note: Percentages based on the actual number of respondents to each question
Note: Percentages based on the actual number of respondents to each question
Note: Percentages based on the actual number of respondents to each question
Note: Percentages based on the actual number of respondents to each question
Note: Percentages based on the actual number of respondents to each question
Note: Percentages based on the actual number of respondents to each question
Geospatial Web Service Types • Image services • Deliver image resulting from query against underlying data • Limited opportunity for analysis • Feature services • Stream actual feature data, greater opportunity for data analysis • Other • Geocoding services • Routing • .etc. Note: Percentages based on the actual number of respondents to each question
Note: Percentages based on the actual number of respondents to each question
Geospatial Web Services Rights IssuesExample: Desktop GIS-accessible ArcIMS • 39 of 100 NC counties have desktop GIS-accessible ArcIMS services • It is difficult to know how many of these counties actually expect users to either: • A) access data through desktop GIS for viewing only, or • B) extract and download data Note: Percentages based on the actual number of respondents to each question
Harnessing Geospatial Web Services • Automated content identification • ‘capabilities files,’ registries, catalog services • WMS (Web Map Service) for batch extraction of image atlases • last ditch capture option • preserve cartographic representation • retain records of decision-making process • … feature services (WFS) later. • Rights issues in the web services space are ambiguous Note: Percentages based on the actual number of respondents to each question
“Web mash-ups” and the New Mainstream Geospatial Web Services Note: Percentages based on the actual number of respondents to each question
Preserving Cartographic Representation Note: Percentages based on the actual number of respondents to each question
Preserving Cartographic Representation • The true counterpart of the old map is not the GIS dataset, but rather the cartographic representation that builds on that data: • Intellectual choices about symbolization, layer combinations • Data models, analysis, annotations • Cartographic representation typically encoded in proprietary files (.avl, .lyr, .apr, .mxd) that do not lend themselves well to migration • Symbologies have meaning to particular communities at particular points in time, preserving information about symbol sets and their meaning is a different problem Note: Percentages based on the actual number of respondents to each question
Preserving Cartographic Representation • Image-based approaches • Generate images using Map Book or similar tools • Harvest existing atlas images • Capture atlases from WMS servers • Export ‘layouts’ or ‘maps’ to image • Vector-based approaches • Store explicitly in the data format (e.g. Feature Class Representation in ArcGIS 9.2) • Archive and upward-migrate existing files .avl, .apr, .lyr, .mxd, etc. • SVG, VML or other XML approaches • Other? Note: Percentages based on the actual number of respondents to each question
Preserving Cartographic Representation Note: Percentages based on the actual number of respondents to each question
Preserving Cartographic Representation Note: Percentages based on the actual number of respondents to each question
Repository Architecture Issues • Interest in how geospatial content interacts with widely available digital repository software • Focus on salient, domain-specific issues • Challenge: remain repository agnostic • Avoid “imprinting” on repository software environment • Preservation package should not be the same as the ingest object of the first environment • Tension between exploiting repository software features vs. becoming software dependent Note: Percentages based on the actual number of respondents to each question
Preserving Geodatabases • Spatial databases in general vs. ESRI Geodatabase “format” • Not just data layers and attributes—also topology, annotation, relationships, behaviors • ESRI Geodatabase archival issues • XML Export, Geodatabase History, File Geodatabase, Geodatabase Replication • Some looking to Geodatabase as archival platform (in addition to feature class export) Note: Percentages based on the actual number of respondents to each question
Geodatabase Availability • Local agencies, especially municipalities, are increasingly turning to the ESRI Geodatabase format to manage geospatial data. • According to the 2003 Local Government GIS Data Inventory, 10.0% of all county framework data and 32.7% of all municipal framework data were managed in that format. Note: Percentages based on the actual number of respondents to each question
Evolving Geodatabase Handling Approaches Note: Percentages based on the actual number of respondents to each question
Efficient Content Replication • Content replication also needed for: • Disaster preparedness • State and federal data improvement projects • Aggregation by regional geospatial web service providers • WFS, e.g.: efficiency in complete content transfer? • Rsync-like function, plus: rights management, inventory processes, metadata management, informed by data update cycles • Archiving delta files vs. complete replication – need to avoid requiring “digital archaeology” in the future Note: Percentages based on the actual number of respondents to each question