150 likes | 367 Views
Software Tools for Automated Metadata Creation, Metadata-mediated Data Processing and Quality Control Analysis – real time processing solutions for real-time data. Wade Sheldon GCE-LTER University of Georgia. Ecoinformatics Challenges.
E N D
Software Tools for Automated Metadata Creation, Metadata-mediated Data Processing and Quality Control Analysis – real time processing solutions for real-time data Wade Sheldon GCE-LTER University of Georgia
Ecoinformatics Challenges • Ecologists use heterogeneous data from many sources for synthesis (often processed using multiple tools, technologies) • manually-collected data (spreadsheets, text files) • instrument data loggers (text files, telemetry) • WWW/network data stores (text/HTML/XML files, streams) • Data volumes increasing exponentially • Expectations and requirements for metadata quality/quantity increasing • Expectations for data accessibility increasing
Problems • Data processing, QA/QC, metadata creation are often the limiting factor in IM (not acquisition) • Disparity between capacity, expectations forces trade-offs: • rapid posting of provisional data – no QA/QC, minimal metadata • slow posting of finalized data (months-years)
Potential Solutions • Improve “scope for automation” at all levels • Develop dynamic, flexible QA/QC process to improve efficiency • Adopt a more unified approach to data processing, so data processing, QA/QC and metadata creation occur simultaneously
Typical Data Processing Scenario Processing Stages Raw/Unprocessed Digitized/Acquired Validated Standardized Quality-Controlled Finalized Metadata Customized/Modified
Ideal Data Processing Scenario Processing Stages Raw/Unprocessed Digitized/Acquired Metadata Validated Metadata Standardized Metadata Quality-Controlled Metadata Finalized Metadata Metadata Customized/Modified
Approach at GCE • Developed a universal tabular data storage format (GCE Data Structure) and modular software (GCE Data Toolbox) for data processing • Used MATLAB® • Local expertise, large scientific user base • Cross-platform (Win32, Solaris, *nix, Mac OS/x) • Rapid development environment • Supports multiple interfaces • Good interoperability with other technologies (Java, PERL, SQL)
GCE Data Structures GCE Data Structure Specification (v1.1)
GCE Data Toolbox • Toolbox functions support: • Importing data from all common formats (ASCII, ML, SQL) • Performing dynamic, rule-based QA/QC flagging (with support for inter-column dependencies) plus interactive manual flagging • Dynamically generating metadata using a combination of “templating”, automatic, and manual entry approaches • Exporting data and metadata in multiple ASCII/ML formats • Data transformation, including unit conversions, geographic coordinate re-projection, date/time conversions • Statistical analysis, sub-setting, super-setting, data visualization on plots, maps • Metadata queried for all operations (mediation) • All operations and data changes transparently logged and synchronized with metadata • Metadata from multiple structures “meshed” after merge/join to retain information during synthesis
Interfaces • Developed multiple interfaces for the software • Command line (supports unattended batch-mode & interactive processing) • Desktop GUI application (requires no MATLAB expertise, uses standard dialogs/controls) • Web application with HTML forms, query string input
Current Applications • Processing, QA/QC of all GCE monitoring data • Data packaging for WWW distribution (linked to Metadata RDMS) • WWW application for data set customization • Automatic near-real-time data harvesting, processing, WWW-posting: • USGS data (2 GCE stations) • CSI climate station • YSI hydrographic data logger • USGS Data Harvester for HydroDB (31 stations/7 LTER sites)
Software Development • What resources were available • None – completely de novo project • Need for the tool • Efficiently process monitoring data from sensor networks and stations with near 0 manpower • Time to develop • Software is a core component of the GCE-IS, with development spread over 2.5 years (effort hard to quantify, but likely 3-4 months) • Scalability • Performs well with data sets <100k records (10-20k commonly used), but memory and speed may become limiting >100k • Some extensibility features incorporated (import filters, templates, metadata styles, unit conversions) • Portability • Requires MATLAB 5.3-6.5 (commercial), but both code and binary data files fully compatible with any Java-supported platform
Availability • Description, screen-shots, fully-functional software available on WWW: http://gce-lter.marsci.uga.edu/lter/research/tools/data_toolbox.htm • Requires MATLAB 5.3+ (6.0+ recommended) on any supported platform (Win32, Solaris, *nix, Mac OS/x) • “Public” version compiled but includes command-line help and some user extensibility • Source code requests considered on case-by-case basis
Future Development Plans • EML 2.0 support • Fully automated metadata-mediated data set integration • Automatic unit conversions • Scaling (e.g. time frequency) • More WWW interface development