280 likes | 451 Views
DTC Archive: data repositories in the fight against diffuse pollution. Mark Hedges, Richard Gartner: King’s College London Mike Haft, Hardy Schwamm: Freshwater Biological Association. Open Repositories 2012, Edinburgh, Scotland/UK, 10 th July 2012. A message from our sponsors.
E N D
DTC Archive: data repositories in the fight against diffuse pollution Mark Hedges, Richard Gartner: King’s College London Mike Haft, Hardy Schwamm: Freshwater Biological Association Open Repositories 2012, Edinburgh, Scotland/UK, 10th July 2012
A message from our sponsors • Collaboration between the Freshwater Biological Association and King’s College London (Centre for e-Research) • Funded by DEFRA (Department for the Environment, Food and Rural Affairs) • A UK government ministry • Runs from Jan. 2011 – Dec. 2014
Diffuse Pollution – what is it? • Pollution processes that: • Individually, have minimal effect • Cumulatively, have significant impact • Some examples: • Run-off of water/rain (e.g. from road, commercial properties) • Farm fertilisers and waste • Seepage from developed landscapes
Water Framework Directive • What is an EU Directive? • An EU Directive is a European Union legal instruction or secondary European legislation which is binding on all Member States but which must be implemented through national legislation within a prescribed time-scale. • Water Framework Directive concerns water quality • Freshwater (rivers, lakes, groundwater,) adversely affected by diffuse pollution • Failure to comply means problems!
DTC Project • DTC = Demonstration Test Catchment • Investigate measures for reducing impact of diffuse water pollution on ecosystems • Evaluate the extent to which on-farm mitigation measures can reduce impact of water pollution on river ecology • cost-effectively • maintaining food production capacity
DefraDemonstration Test Catchments (DTCs) 3 catchment areas in England selected for tests
How does the DTC project work? • The procedure is (roughly speaking): • Monitor various environmental markers • Try out mitigation measures • Analyse changes in baseline trends of markers in response to these measures • All this produces a great variety of data • The DTCs create data, the DTC Archive project has to make it usable and useful!
Equipment for data capture Bank-side water-quality monitoring station Drilling a borehole for monitoring groundwater Images thanks to Wensum DTC
Mains power LHS view RHS view Nitrate probe Ammonium analyser ISCO automatic water sampler Pump Flow cell YSI multi-parameter sonde Meteor telemetry unit Total P and Total reactive P analyser Bank-side water-quality monitoring station [Image from Wensum DTC]
Purpose of the archive • Curating data generated and captured by DTC projects • DTCs create data, we have to make it useful! • Data archive, but also querying, browsing, visualising, analysing, other interactions • Integrated views across diverse data • Need to meet needs of different users – researchers, also land managers, civil servants, planners, ...
The Data • Mostly numerical in some form: spreadsheets, databases, CSV files • Sensor data (automated, telemetry) • Manual samples/analyses • Species/ecological data • Geo-data • Also less highly structured information: • Time series images, video • Stakeholder surveys • Unstructured documents
Example: water quality data 61,752 data points per year for all stations
Challenges of data • Not primarily an issue of scale • Datasets diverse in terms of structure • Different degrees of structuring: • Highly structured (e.g. sensor outputs) • Highly unstructured (e.g. surveys, interviews) • Different types of structure (tables of data, geospatial) • Some small, hand-crafted data sets. • Idiosyncratic metadata, description, vocabularies • Varying provenance and reliability
INSPIRE • Another EU directive • An Infrastructure for Spatial Information in the European Community • Create a European Spatial Data Infrastructure for improved sharing of spatial information • Includes standards for describing, representing, disseminating geo-spatial data, e.g. • Gemini2 for catalogue metadata • GML (Geography Markup Language) • Builds on ISO standards (ISO 19100 series)
Generic Data Model ISO 19156:Observation & Measurements
Multiple Data Representations Generic data model implemented in several ways for different purposes: • Archival representation • based on library/archive standards • Data representation for data integration • “Atomic” representation as triples • Various derived representations • Generated for input to specific tools/analysis
Model for Integration • RDF triples • Atomic statements forming network of node/relations • Discrete datasets mapped into common format Subject Object predicate Identified by URIs predicate Species Genus memberOf Literal value hasCommonName Water flea
Example dataset Tarn Name English Lake District rainfall dataset – from FISH.Link project CollectionMethod Location GridReference Easting Northing Latitude Longitude Dataset Site Name Actor ObservationSet About:Rainfall Type:Raw Unit:Inch ObservationSet About:Rainfall Type:Raw Unit:Inch ObservationSet About:Rainfall Type:Derived Unit:mm DependsOn: OS1, OS2 Duration: 1Day ObservationSet About:Rainfall Type:Derived Unit:mm DependsOn: OS1, OS2 Duration: 1Day Observation StartDate: EndDate Value: Observation StartDate: EndDate Value: Observation StartDate: EndDate Value: Observation StartDate: EndDate Value:
Dataset capture and mapping • Columns, concepts, entities mapped to formal vocabularies • Mappings defined in archive objects • Automated • e.g. sensor output files • Computer-assisted • e.g. some spreadsheets • Manual • by domain experts • e.g. mark up values in texts Spreadsheet transformation workflow – from FISH.Link project
Architectural Overview Browsing Visualisation Search Analysis Mappings RDF triples Mappings Archive Objects Source datasets
Current Status and Next Steps • Archive project started Jan. 2011, runs till end 2014. • Datasets are already being generated in large quantities. • Prototype functionality • Modelling and Ingestion of data (incremental) • Next steps: • Extend types of dataset covered. • User interactions (queries, visualisation etc.)
Thank you mark.hedges@kcl.ac.uk MHaft@fba.org.uk http://dtcarchive.org/