Managing Data Quality in a Terabyte-scale Sensor Archive

Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia, Canada ramon-lawrence@ubc.ca http://people.ok.ubc.ca/rlawrenc/

Data-Driven Scientific Discovery • Modern scientific discovery uses vast quantities of data generated from instruments, sensors, and experimental systems. • The quality and impact of the research is highly dependent on the quality of the data collection and analysis. • Challenges: • The amount of data required for research is exploding. • The types and sources of data is increasing and data generated by different experiments or devices must be integrated. • These two factors require scientists to be concerned with how their experimental data is collected, archived, analyzed, and integrated to lead to research contributions.

Fundamental Sensor Data Archive Issue • Sensors produce vast amounts of data that is valuable for historical as well as real-time applications and analysis. • Due to the number of sensors and volume of data collected, manual curation and data validation is difficult. • By their nature, sensors are prone to failures, inaccuracies, and periods of intermittent or substandard performance. • Despite this, the historical data record should be as clean and accurate as possible despite device limitations.

Key Question (and Answer) • Question: How can we achieve high quality historical archives of sensor data? • Answer: In addition to operational monitoring of the data archive system, the data stream should be analyzed using metadata properties to detect errors. Operational monitoring – Are the system components and workflow functioning properly? Metadata validation – Does the data stream conform to known ranges? Can data cleansing and correction be performed?

NEXRAD Archive System Overview • We will briefly overview: • The data collected by the NEXRAD system and its scientific value. • The current state of NEXRAD data archiving and its use in scientific discovery, including its data quality limitations. • An extension of the system that uses metadata properties to validate and clean archived data. Our goal is to provide the science community with ready access to the vast archives and real-time information collected by the national network of NEXRAD radars. [This requires hiding the numerous data management issues.]

NEXRAD System and Generated Data • There are over 150 NEXt generation RADars (NEXRAD) that collect real-time precipitation data across the United States. • The system has been operational for over 10 years, and the amount of collected data is continually expanding. • A radar emits a coherent train of microwave pulses and processes reflected pulses. • Each processed pulse corresponds to a bin. There are multiple bins in a ray (beam). Rotating the radar 360º is a sweep. After a sweep the radar elevation angle is increased, and another sweep performed. All sweeps together form a volume.

Usefulness of NEXRAD Data • Although the NEXRAD system was designed for severe weather forecasting, data collected has been used in many areas including: • flood prediction • bird and insect migration • rainfall estimation • The value of this data has been noted by a NRC report which labeled it a “critical resource.” • Enhancing Access to NEXRAD Data—A Critical National Resource.National Academy Press, Washington D.C. ISBN 0-309-06636-0, 1999

Archiving NEXRAD Data • Researchers have two options for acquiring NEXRAD data: • 1) Retrieve RAW data from the National Climatic Data Center (NCDC) tape archive. • 2) Capture real-time data distributed by University Corporation for Atmospheric Research (UCAR) using their Unidata Internet Data Distribution (IDD) system. • Acquiring, archiving, and analyzing the data requires significant computational and operational knowledge which makes it impractical for many researchers.

NEXRAD Archive System • The NEXRAD archive system is a NSF-funded project that aims to simplify the analysis of NEXRAD data for researchers. • The NEXRAD archive: • Collects and archives RAW data from the real-time stream. • Analyzes and indexes data for retrieval by metadata properties. • Performs data cleansing such as removing ground clutter. • Allows researchers access to historical and real-time data in RAW form. • Provides an analysis workflow system that will generate derived products (such as rainfall maps) using the RAW data, known algorithms, and researcher parameters. • The NEXRAD archive is hosted at the University of Iowa and the development is done in conjunction with NCDC, Unidata, and Princeton University.

NEXRAD Archive Architecture • Files are added from real-time stream or from other sources. • Metadata extractor produces XML description of each data file used for indexing. • Clients can access archive directly using C library and their own program. • All data files are web accessible. • Metadata directory can be queried using web services interface. • Most clients use pre-constructed web workflow system and do not access RAW data. • Data and metadata will be replicated at a supercomputing center and eventually at NCDC.

User/Client’s View Distributed Data Archive (NCDC, Iowa, etc.) “Find all the 2002 storms over the Ralston Creek watershed with mean arealprecipitation greater than X mm, and with a spatial extent of more than Z km2, with a duration of less than N hours. I want the data in GeoTIFF” Query Metadata Metadata Archive User/Client Get URIs Web Services Get data HTTP “Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km2, with a duration of less than N hours. I want the data in GeoTIFF.” Metadata Archive

NEXRAD System Current Statistics • The NEXRAD archive system: • Has been running for over 2 years • Collects data for 30 of the 150 radars • Has indexed over 8 million radar scans • Has RAW data that is over 8 TB in compressed form • Processes real-time data stream of 10-20 GB/day • Supports a sophisticated workflow system that produces derived data products (e.g. rainfall maps) for users on demand • Has an operational monitoring system (is the archive workflow pipeline functioning properly?) but only simple data validation checks • Question: What is the quality of the data being archived?

Archive Monitoring System • We developed a new archive monitoring system that: • Explicitly tracks all archive workflow events in logs that are stored and queried using a database • Detects data corruption using metadata properties as well as pipeline failures • Produces reports on a web interface to simplify the task of administration of the archive • The monitoring system was developed and operated separately from the main archive to compare performance and to prevent issues with the operational system.

Archive System with Monitor • Basic archive workflow components unchanged except for logging: • Converter – translate RAW form to compressed RLE • Metadata Extractor – analyze data properties/check for inconsistencies • Loader – load metadata into database and files onto web servers • Monitoring system: • Loads XML log records from each archive component into DB. • Provides metadata ranges for checking data validity. • Tracks files through pipeline (lineage) and handles corrupt files. • Has separate log database that is accessed using web front-end. • Can restart any workflow software.

Validating Sensor Data using Metadata • Operational ranges of data produced by sensors is commonly known. • For example, the timestamp of sensor readings for the radars should be close to the current time. Reflectivity readings are within known ranges given weather conditions. • Monitoring system provides these operational ranges to the metadata extractor component that can verify that data is within accepted ranges. • Data outside ranges causes files to be dropped if not recoverable, fixed if possible (date changes), or flagged as warnings of potential corruption otherwise. • The goal is to get as much data through the pipeline as possible, but make sure compromised data is flagged.

Monitoring System Implementation • The monitoring system required that each workflow component be changed to use XML log records instead of separate files. • Each XML log record is loaded into a Postgres database by a log processor. The log processor and logging is separate from the archive system to ensure that logging does not slow the archiving. • As a RAW file proceeds through the pipeline, log events for it are recorded at each stage. Files that do not make it through the pipeline are not “lost” to the archive as before. • Administrator has a web front-end to control archive processes, monitor events on per file or per process level, and track operational characteristics across the entire workflow.

Monitor System Administrator Interface

Operational Results • A duplicate NEXRAD archive system that included the monitoring system processed two radars in parallel with the live system for six months. • Key results: • Data errors occur in about 5% of input files. Expectation was less than 1% given sophistication of sensors. • One radar had data corruption for a two week period that went unnoticed in the live system as radar was indicating good operational status. • Most errors were not fixable, but significant # of correctable date errors. • Administrator time reduced dramatically compared to manual log investigation. • The cost of logging a sensor stream is high. Storing log records in a database is a bottleneck and must be separated from archive. (Database loading is also a bottleneck in archive itself.)

Future Work and Conclusions • Archiving sensor data is going to be an increasing challenge. • Ensuring high quality archives requires more than operational monitoring and should also include data validation using metadata properties. • Live archive system is being updated to use monitoring system. • The bottleneck in archiving and monitoring sensor data is the database system. Monitor should be separated from archive. • Loading the metadata into the archive takes an order of magnitude longer than generating it. • Metadata is growing beyond the capabilities of a single database. It will be replicated and distributed for performance and political reasons. • Logging to the database provides easy access to information, but you must be aware of performance issues.

Project Participants • The University of Iowa (Lead) • W.F. Krajewski (PI) • A.A. Bradley, A. Kruger, R. Lawrence • Princeton University • J.A. Smith (PI) • M. Steiner, M.L.Baeck • National Climatic Data Center • S.A. Delgreco (PI) • S. Ansari • UCAR/Unidata Program Center • M. K. Ramamurthy (PI) • W.J. Weber Research supported by NSF ITR Grant ATM 0427422: “A Comprehensive Framework for Use of NEXRAD Data in Hydrometeorology and Hydrology”.

Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia, Canada ramon-lawrence@ubc.ca http://people.ok.ubc.ca/rlawrenc/ Thank You!

Managing Data Quality in a Terabyte-scale Sensor Archive

Managing Data Quality in a Terabyte-scale Sensor Archive

Presentation Transcript

Sensor Data Management In Sensor Networks

Archive IFM Data

Managing Dataset DOIs and Versions in a Changing Archive

Managing Air Quality Data 101

CERN Data Archive

A Magellan Data Archive

Distributed Multi-Scale Data Processing for Sensor Networks

Syteline Data Archive

Semantic Data Management for Organising Terabyte Data Archives

Large – Scale Sensor network

Swift data archive

Management of large scale Terabyte Store information servers

Managing Air Quality Data 101

The Oracle9i Multi-Terabyte Data Warehouse

Data Archive Centres

Background Program Managing Data Applying Data at a Larger Scale

4TU.ResearchData data archive

Archive IFM Data

Sensor Data Management In Sensor Networks

Data Archive: A Case Study