150 likes | 266 Views
Steve Morris North Carolina State University Libraries. Ingest Workflow Issues: Metadata North Carolina Geospatial Data Archiving Project. How the Data is Received. Data is delivered as is – no control over organization of received data Contributing organizations
E N D
Steve Morris North Carolina State University Libraries Ingest Workflow Issues:MetadataNorth Carolina Geospatial Data Archiving Project
How the Data is Received • Data is delivered as is – no control over organization of received data • Contributing organizations • County and municipal agencies • State agencies • Regional councils of government • Data transfer modes • CD/DVD, External Drive • FTP or Web Download
Ingest Challenges: General Data consists of multi-file, multi-format objects Ancillary data files can be shared by datasets Some formats require conversion now Some format conversions involve one-to-many relationships Compressed archive files are common and behave unpredictably And all the usual challenges: format validation, validity checking, threat scanning,…
Ingest Challenges: Metadata • Metadata is encoded in a variety or ways • The FGDC content standard for metadata lacked an encoding standard (arrived pre-XML), will soon be addressed in ISO 19115/19139 FGDC implementation • XML (varied schemas), TXT, HTML • Metadata is missing • Only about 25% of local agencies use FGDC • Metadata is wrong • Metadata is commonly asynchronous with the data • Inconsistent use of dataset naming, etc.
Some Key Decisions • Capture “transfer set” metadata • Normalize, synchronize, and remediate existing metadata, and retain original metadata record • Treat contact information as archival • Update metadata with format conversions • Use ESRI Profile of FGDC • added technical and administrative elements • Has an XML schema • ArcCatalog tool support • Use simple rights encoding scheme • Record metadata in a workflow management database
What is Transfer Set Metadata? • Administrative and technical metadata associated with a transfer device or download • Propagates to individual data objects PHP Application Interface for Transfer Set Metadata Capture
If No Metadata, What Then? • Autoextract a subset of technical and descriptive metadata through ArcCatalog • Apply an agency-specific metadata template (many elements are static within the context of the agency) • Acquire information from the NC OneMap Inventory • Data Source • Contact Info • Datum, Coordinate System • Acquire information from agency web site • Avoid direct inquiries to local agencies (“contact fatigue”)
What Gets Remediated and Why? • Key technical elements that are wrong • Datum, coordinate system, format, … • Title • Qualify to the agency (e.g. “Streets” becomes “Henderson County Streets”) • Keywords • Add ISO keywords • NCSU GIS Lookup terms added later if needed for access These are basic requirements for accessand use
Metadata Tools • ArcCatalog • Automated metadata extraction • ArcGIS Toolbar • Metadata synchronization, normalization, templating • cns and mp • Raw text handling • Python classes • Ingest workflow
Source Metadata Translation • Hub-and-spoke model a la Echo DEPository • repository agnostic • modular conversion hub • facilitate repository software migration & inter-archive exchange
What is the Rights Encoding? • Purpose: Define a basic set of codes to hold dataset rights information in a script-actionable form. To assign related text for use in constructing brief rights statements. Propagates to individual data objects • Structure: Codes are assigned on a fixed string position basis. Rights assigned to particular user types are grouped after a flag character for that user group. • Initial User Groups: • NCSU Faculty/Staff/Students (Code “N”) • General Public (Code “P”) • Library of Congress (Code “L”) • Initial Rights Types: • Use • Redistribute • Commercial Use
Sample Rights Record M01N110P110L110 Interpretation: This dataset was acquired in a mediated transaction directly from the data producer (acquired on media or via arranged download). There is no data agreement but there is a data disclaimer. NCSU, General Public, and LC all can use and redistribute the data but commercial use is not allowed.
Deferred Activities • Implementing METS and PreMIS • Developing a serial object metadata scheme
Ongoing Challenges • When to automate and when not to • Learn first from human intervention • Minimizing risk of error related to human intervention • Accepting that ingest packages used will evolve over time (implications for archive?) • Handling post-ingest migrations
Engagement Opportunities • NCGDAP partner NCCGIA runs the NC OneMap Metadata Outreach Program • Provide feedback to spatial data infrastructure about metadata inconsistencies, lack of adherence to best practices • Partner with industry and standards organizations on addressing metadata issues such as poor standards support for versioned data (e.g., through OGC Data Preservation Working Group)