1 / 33

Data, Metadata, and Ontology in Ecology

Data, Metadata, and Ontology in Ecology. Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara and many major collaborators:

judith
Download Presentation

Data, Metadata, and Ontology in Ecology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data, Metadata, and Ontology in Ecology Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California Santa Barbara and many major collaborators: Mark Schildhauer, Josh Madin, Jing Tao, Chad Berkley, Dan Higgins, Peter McCartney, Chris Jones, Shawn Bowers, Bertram Ludaescher, and others April 24, 2007

  2. Scaling-up Synthesis • More than 400 projects at NCEAS • have produced over 1000 publications that synthesize and re-use existing data • massive investment in compiling, integrating, and analyzing data • Building custom database for each project is not logistically feasible • Instead, need loosely-coupled systems that accommodate heterogeneity

  3. Dilemma: no unified model • No single database suffices • Data warehouses use federated schemas • any data that does not fit is not captured • original data transformed to fit federation • this is a form of data integration for one purpose • Numerous data warehouses exist • not extensible for all data • VegBank, ClimbDB, GenBank, PDB, etc.

  4. Data Collections • Metadata-based data collections • Loosely-coupled metadata and data collections • No constraints on data schemas • Data discovery based on metadata • Dynamic data loading and query based on metadata descriptions

  5. What is EML? <EML> Identity and Discovery Information Coverage: Space, Time, Taxa Methods Physical Data Format Logical Data Model Access and Distribution A … • modular • extensible • comprehensive • Ecological Metadata Language

  6. ‘96 ‘01 ‘91 ‘06 ‘92 ‘07 ‘97 ‘02 ‘03 ‘98 ‘08 ‘93 ‘09 ‘94 ‘99 ‘04 EML: Selected relationships Michener ’97 paper NBIIBDP FGDC created ISO 19115 CSDGM 1.0 EML 1.0.0 1990 EML 1.3.0 ESA FLED Report EML 1.4.x EML 2.0.0 EML 2.0.1 OBOE XML 1.0 Dublin Core 2005 1995 2000

  7. A simple EML example eml packageId: sbclter.316.18 individualName individualName surName: Evans surName: Reed title: Kelp Forest Community Dynamics: Benthic Fish system: knb dataset creator contact

  8. Data Discovery Geographic, Temporal, and Taxonomic coverage

  9. Species Codes Value bounds Date Format Code definitions Logical Model: Attribute structure • Describes data tables and their variables/attributes • a typical data table with 10 attributes • some metadata are likely apparent, other ambiguous • missing value code is present • definitions need to be explicit, as well as data typing YEAR MONTH DATE SITE TRANSECT SECTION SP_CODE SIZE OBS_CODE NOTES 2001 8 2001-08-22 ABUR 1 0-20 CLIN 5 06 . 2001 8 2001-08-22 ABUR 1 21-40 OPIC 11 06 . 2001 8 2001-08-22 ABUR 1 21-40 OPIC 10 06 . 2001 8 2001-08-22 ABUR 1 21-40 OPIC 14 06 . 2001 8 2001-08-22 ABUR 1 21-40 OPIC 7 06 . 2001 8 2001-08-22 ABUR 1 21-40 OPIC 19 06 . 2001 8 2001-08-22 ABUR 1 21-40 COTT 5 06 . 2001 8 2001-08-22 ABUR 2 0-20 CLIN 5 06 . 2001 8 2001-08-22 ABUR 2 21-40 NF 0 06 . 2001 8 2001-08-27 AHND 1 0-20 NF 0 03 .

  10. EML Measurement Scale Categories Ordered Categories Points on calendar timescale Equidistant on number scale Equidistant on number scale, meaningful ratio Male Female Low Medium High 6-Oct-2004 3 Celsius 5 meter Textual Numeric Dates Nominal Ordinal Interval Ratio Datetime

  11. Logical Model: unit Dictionary • Consistent assignment of measurement units • Quantitative definitions in terms of SI units • ‘unitType’ expresses dimensionality • time, length, mass, energy are all ‘unitType’s • second, meter, gram, pound, joule are all ‘unit’s UnitType Unit gram Mass x1000 kilogram

  12. Collating metadata • Most scientists know all of this information about their data • EML simply provides a standardized format for recording the information • Enables data exchange across organizations and software systems

  13. AND KNB 1 LTER KNB II GCE ... (26) Knowledge Network for Biocomplexity (KNB) Building a community data network • Simplified data sharing • Immediate change tracking • Redundant backup • Data maintained by individuals • Access controlled by individuals PISCO ESA NCEAS OBFS

  14. EML-described data in the KNB 12000 Data Packages in the KNB 10000 8000 Cumulative count 6000 4000 2000 0 2002 2003 2004 2005 2006 Year

  15. Kepler: dynamic data loading Data source from EcoGrid (metadata-driven ingestion) R processing script res <- lm(BARO ~ T_AIR) res plot(T_AIR, BARO) abline(res) • Kepler supports dynamic data loading: • Data sources are discovered via metadata queries • EML metadata allows arbitrary schemas to be loaded into an embedded database • Data queries can be performed before data flows downstream

  16. Importance of semantics • So far we’ve dealt only with the logical data model • any semantics in EML in natural language • The computer doesn’t really understand: • what is being measured • how measurements relate to one another • how semantics map to logical structure • Analysis depends on understanding the semantic contextual relationships among data measurements • e.g., density measured within subplot

  17. Goal: semantically describe the structure of scientific observation and measurement as found in a data set Observations are made about particular entities. Observations can provide context for other observations. Entities represent real-world objects or concepts that can be measured. Every measurement has a characteristic, which defines the property of the entity being measured. Observation ontology (OBOE) Provide extension points for loading specialized domain ontologies slide from J. Madin

  18. Semantic annotation • Relational data lacks critical semantic information • no way for computer to determine that “Ht.” represents a “height” measurement • no way for computer to determine if Plot is nested within Site or vice-versa • no way for computer to determine if the Temp applies to Site or Plot or Species Observation Ontology Mapping between data and the ontology via semantic annotation Data set slide from J. Madin

  19. Date Site Plot Species Height 10/12 Hendricks 1 AHYA 12.2 10/12 Hendricks 1 AHYA 11.0 10/12 Hendricks 1 AHYA 9.7 … … … … … Entity: Characteristic: hasContext hasContext hasContext Time Date Space LocationName Space Label Area TaxonomicName Organism Height h

  20. hasContext hasContext Entity: Organism Space Organism Characteristic: Label Replicate Area TaxonomicName Abundance Tree Plot Species Count A 1 AHYA 3 A 2 AHYA 2 A 3 AHYA 8 … … … … B C A

  21. Extension points Observation ontology slide from J. Madin

  22. Observation ? A high-level assertion that a thing was observed

  23. Entity All things (concrete and conceptual) that are observable

  24. Entity extension An extension point for domain-specific terms

  25. Context Asserts a “containment” relationship between entities

  26. Context Context is transitive

  27. Measurement Observations are composed of measurements, which refer measurable characteristics to the entity being observed

  28. Characteristic

  29. Summary • EML captures critical metadata • OBOE adds critical semantic descriptions • Data discovery and integration tools can be built that leverage metadata and ontologies • Metadata and ontologies permit: • Loosely-coupled systems • Schema independence in data systems • Semantic data integration • Capturing data that is collected, rather than derived product

  30. Vegetation Schema Questions • Vegetation schema • Exchange standard or federation? • Can we accommodate all data that is collected in vegetation plots? • or just a transformed subset • XML? RDF? OWL? other? • Should a vegetation schema link to other evolving community standards? • EML? • OBOE?

  31. Questions? • http://www.nceas.ucsb.edu/ecoinformatics/ • http://knb.ecoinformatics.org/ • http://seek.ecoinformatics.org/ • http://kepler-project.org/

  32. Acknowledgements • Knowledge Representation Working Group • Mark Schildhauer, Matt Jones (NCEAS) • Shawn Bowers, Bertram Ludaescher, Dave Thau (UCD) • Deana Pennington (UNM) • Serguei Krivov, Ferdinando Villa (UVM) • Corinna Gries, Peter McCartney (ASU) • Rich Williams (Microsoft)

  33. Acknowledgments • This material is based upon work supported by: • The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676. • Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis • The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. • The Andrew W. Mellon Foundation. • Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence

More Related