Automating the Extraction of Metadata From Archaeological Data Using iRods Rules

Automating the Extraction of Metadata From Archaeological Data Using iRods Rules David Walling, Maria Esteva 6th International Digital Curation Conference

Outline • Challenges of digital archaeological data at ICA • Collaboration with TACC • File and directory naming convention • Metadata mapping • iRODS and Jython metadata extractor script • Client interfaces • Conclusions

Archaeological Data Site is destroyed

Data Creation and Management at ICA • Increasing use of digital photography, GIS, databases, 3D models, digital text, spreadsheets, etc. • Archaeological Recording Kit (ARK) – doesn’t support all data types, raw vs. lower quality images • Multiple data pipelines • Raw data stored in multiple and scattered storage devices • Bottlenecks that complicate the research process

ICA’s Needs • Inventory and organize the raw data • Ensure there is enough metadata for raw data so we know what it is • Preserve raw data object’s relationship to ARK entries and other data objects (context) • Ensure raw data useable in the long-term (site destroyed) • Extract metadata from ARK so it can be both preserved and shared

TACC • Mission: to enable discoveries that advance science and society through the application of advanced computing technologies • Provide HPC, visualization, and data management resources and consulting, training, R&D • Traditionally physical sciences and engineering • Traditionally HPC and visualization • Humanities, Art, Herbarium

TACC/ICA Collaboration • ICA data managers, TACC providing resource and consulting • Inventory of ICA data collection • Document archaeological research workflow and identify points where data is both created and moved to storage • Development of file and directory naming convention • Development of metadata extractor script to generate metadata on iRODS ingestion

Directory Hierarchy • Folder name = subject category • Familiar concept • Flexible, developed by users

File Naming Convention • Example: sfi_CH01PA_8_a2_m.JPG • sfi_CH01PA_8 • The object code as it exists in ARK • Connect to metadata in ARK • a2 • Research Stage (before/after/during conservation, lifting, microscope, studio) • m • ‘m’ = master/raw • ‘p’ = publication/re-purposing

Metadata Schemas • Descriptive metadata -> Dublin Core Qualified (DC-Q) • Technical metadata -> Preservation Metadata: Implementation Strategies (PREMIS) • Encapsulate both in Metadata Encoding and Transmission Standard (METS)

Mapping to DC-Q • Directory names -> dcterms:subject • Object code in file name (sfi_CH01PA_8) -> dcterms:isPartOf • Stage of research -> dcterms:isPartOf • Context code from ARK -> dcterms:isPartOf • Description in ARK -> dcterms:description

Mapping to PREMIS • File Information Tool Set (FITS) aggregates several open source preservation metadata generators and outputs to XML. • Jhove, Exiftool, NLNZ Metadata Extractor, DROUID, FFIdent, File Utility • Straightforward mapping to PREMIS schema

iRODS and Corral • Integrated Rule Oriented Data System (iRODS) provides a virtual file system and rule engine for long term data preservation. • Hosted on TACC’s 1.2+ PB multi-purpose storage application resource Corral. • This basic infrastructure provides flexible backbone for wide variety of data collection projects.

Task Workflow

Client Interfaces

Conclusion • Organizational method allowing for the automatic extraction of metadata as it is ingested to long-term storage soon after creation • Template for running external tools with iRODs • Template for extracting PREMIS data from files • Technical details and code/configuration available in paper

Automating the Extraction of Metadata From Archaeological Data Using iRods Rules