290 likes | 417 Views
Investigating Metadata for Long-Lived Geospatial Resources: An Exploration By Nancy J. Hoebelheinrich Metadata Coordinator Digital Library Systems & Services. Stanford Digital Repository. To Be Discussed. The Study Question asked Methodology used MD standards’ strengths / weaknesses
E N D
Investigating Metadata for Long-Lived Geospatial Resources:An Exploration By Nancy J. Hoebelheinrich Metadata Coordinator Digital Library Systems & Services Stanford Digital Repository NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
To Be Discussed • The Study • Question asked • Methodology used • MD standards’ strengths / weaknesses • Conclusions & Recommendations • Future work needed NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
The Study • NGDA Project: • Pertinent project objective: • Collect and archive major segments of at-risk digital geospatial data and images • Partners / Backgrounds / Areas of experience • UCSB: Alexandria Digital Library / Presentation • Stanford Libraries: Stanford Digital Repository / Preservation • Differences in experiences gave rise to study question NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Study Question • What metadata is needed for long-lived geospatial data formats? • Grounded in previous studies • Hunolt paper for USGCRP Office • Digital Preservation Coalition (UK) • NSF • OAIS Reference model • Duerr, Parsons articles • OCLC / RLG Preservation studies NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Methodology used • Evaluate four fairly typical geospatial data formats • Shapefiles, DOQQ’s, DRG’s Landsat 7 satellite images (preliminary) • Compare / contrast 3 different approaches to documenting • FGDC Content Standard • CIESIN Geospatial Electronic Records • PREMIS NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Categories of information about the resources • Environment (computer platforms) • Semantic Underpinnings • Domain specific terminology • Provenance • Data trustworthiness • Data quality • Appropriate use NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Environment (computing platform) • Definition: characteristics of the hw / sw configuration that allow a resource to function properly • Function could be defined as: • Rendering • Viewing • Using • May need to be repeated NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Environment, cont. • All 3 systems have means for documenting these characteristics • Both PREMIS and GER provide more granularity & parsability, e.g., creatingApplication, sw, hw name, versions; dependencies, environment type, etc. • FGDC uses: “technical prerequisites” & “native data set” NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Semantic Underpinnings • Detailed concepts: • Meaning or essence of data • Significance of data, i.e., why preserve it? • Purpose or function served by data • Intended community • FGDC & GER have fairly extensive set, particularly GER • PREMIS NOT = “descriptive” or domain specific, so not covered NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Domain specific terminology • For geospatial, particularly valuable: • Keywords associated with data themes • Spatial coverage • Time period • Stratum coverage / place names • GER & FGDC cover • PREMIS NOT = descriptive MD NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Provenance • Detailed concepts: • Info about the events, parameters & source data associated w/ construction of data set prior to ingestion • Source of data • Changes made to data inside the preservation archive • FGDC, GER, PREMIS all ok for 1st 2 • FGDC NOT for last NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Provenance, cont. • Greater level granularity / parsability in PREMIS using Object, Event & Agent entities • See Example for Rumsey Historical Map Image Collection about descriptive MD transformation NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Example Event 1: Transform of descriptive MD from MS Access db => XML => MODS Why this event? In case of questions from outside data provider Retain singular scripts & transform mechanisms Use of PREMIS Event Data Elements NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
PREMIS Event Excerpt (v1.1) NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Example Event 2: Merge c:\temp\states1;c:\temp \states2; c:\temp\USA (includes process = “merge” and data sources Advantage – can describe events once in repository, unlike FGDC, but Can include if prior to ingestion? Why this event? Important to describe processes during different phases of lifecycle, even prior to ingestion Not to be able to do so – problemmatic for geospatial resources Is best practice issue for this domain Use of PREMIS Event Data Elements NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Data trustworthiness • Detailed concepts: • Who are parties responsible for creation, development, storage, maintenance of data set • Where is data located • How is data available • What important factors about the data should be preserved NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Parties Location of data Factors to preserve FGDC, only “originator”, GER & PREMIS more granular & parsable GER & FGDC seem more specific & less inclusive (only POV of “distributor”) for last 2 Data trustworthiness, cont. NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Data Quality • Detailed concepts: • General condition statement • Accuracy of the data • Fidelity of relationships within the data set • Accuracy of measurements of the data • FGDC – has tags, but are very specific • GER – not much coverage here NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Detailed concepts: Legal use and liability statements Technical characteristics that impact use FGDC & PREMIS have, GER NOT FGDC NOT, GER & PREMIS have means of linking to format registry info More about format registries, later Appropriate use NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
PREMIS & “Significant properties” • Way within PREMIS to document: • Data trustworthiness: data creator / provider reliable = “authentic” • Data quality: describing completeness, logical consistency, attribute accuracy • Data Provenance: processes & sources for dataset = “understandable & reliable” • Appropriate use: understanding of the specific needs of the “designated community”? • Other important factors to preserve • More work needs to be done in this area NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Rich in detail Specificity for the geospatial domain Ubiquity Very complex & laborious to complete Poor means for describing relationships among file components of a digital resource No way to describe digital resource once within preservation archive Strengths & weaknesses: FGDC NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Focus on archiving Comprehensive Little known as yet No data dictionary, so unclear how to apply tags (cardinality, repeatability, etc.) Relational DB format Unclear if and/or how to describe digital resource once within preservation archive Strengths & weaknesses: GER NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Applicable at many levels of digital resource: abstract & physical Capability for describing relationships among file components of digital resource Capability for describing digital resource during its entire lifecycle within the preservation archive Generic & focused upon preservation Not specific enough for geospatial Does not include critical semantics or “descriptive” information important for using digital resource Fairly young specification; unclear how to document “significant properties” for digital resources Strengths, weaknesses: PREMIS NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Recommendations • Use of content standard (e.g., FGDC or ISO when replaces) • Best used for semantics, domain specific terminology • PREMIS • Best used for management of resources over time using • PREMIS Object entity • PREMIS Event entity • PREMIS Agent entity • Useful to package resources & metadata together to facilitate tracking of aggregation of resource(s), MD & resource structure & file inventory, e.g., METS NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Issues & Challenges • What if domain specific MD is not available? • If not, how can one get important info from data creators? • How to determine what is truly necessary for use of data sets? • Establishment of geospatial format registries • Getting buy-in from geospatial domains for use of vocabularies, etc. (see Global Spatial Data Infrastructure: http://www.gsdi.org/Default.asp ) • More research needs to be done on “significant properties” like that done by JISC – DPC studies, e.g., SP’s of vector images NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Future directions for NGDA Project • Further investigation of other geospatial formats including more vector based data such as: • layers of the National Atlas • National Map (sections of California) • Landsat 7 ETM imagery • Derived data sets from Stanford faculty NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Future directions, cont. • Format Registry investigation - what should be included in a format registry for geospatial • Contact with key vendors, e.g. ESRI, SafeSoftware, etc. • Monitoring what others are doing with e-science & social science data sets, e.g., • NCSU, Johns Hopkins • National Australian Archive (NAA) • JISC and DPC in the UK • NDIIPP US Multi-state project • Those using new DDI v 3.0 schema NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
References, contact info • JISC – DPC studies on significant properties: http://www.dpconline.org/graphics/events/080407workshop.html • See Duce and Nielsen papers • Full paper available at: http://www.ngda.org/research.php • National Geospatial Digital Archive: http://www.ngda.org/index.php • Examples of METS with PREMIS on METS public wiki: • http://www.socialtext.net/mim-2006/index.cgi?profile_playground NDIIPP Annual Partners Mtg, Arlington VA 9 July 2008
Questions? / comments? Nancy J. Hoebelheinrich nhoebel@stanford.edu John Banning [jwbanning@gmail.com]