480 likes | 643 Views
Ecological Informatics: Challenges and Benefits Presentation to ESA Visions Committee March 31, 2003. Mark Schildhauer, Ph.D. Director of Computing, NCEAS. http://knb.ecoinformatics.org http://seek.ecoinformatics.org. Research Team and Collaborators. PISCO LTER Network
E N D
Ecological Informatics: Challenges and BenefitsPresentation to ESA Visions CommitteeMarch 31, 2003 Mark Schildhauer, Ph.D. Director of Computing, NCEAS http://knb.ecoinformatics.orghttp://seek.ecoinformatics.org
Research Team and Collaborators • PISCO • LTER Network • San Diego Supercomputer Center • Arizona State University • University of Kansas • University of North Carolina • OBFS Network • UC NRS • Sandy Andelman • Chad Berkley • Matthew Brooke • John Harris • Dan Higgins • Matt Jones • Jim Reichman • Mark Schildhauer • Jing Tao
What is Ecoinformatics? Data Acquisition Integration Storage, archiving Distributed Access Results
Ecoinformatics • The Goal: to develop technology tools and services to enable more efficient acquisition, integration, and analysis of ecological data • Specific Challenges • An Approach to Technology Solutions (KNB) • Future Directions • a Science Environment for Ecological Knowledge, SEEK
Status of Ecological Data • Highly dispersed • Different individuals, organizations, and locations • Extreme heterogeneity • in Form, Content, and Meaning • Lack of Documentation (metadata) • Lack of metadata overall • Many standards in use, many custom types • Implementations are not modular
Data are Highly Dispersed… • Data are distributed among: • Independent researcher holdings • Research station collections • LTER Network (24 sites) • Org. of Biological Field Stations (160+ sites) • Univ. Cal Natural Reserve System (36 sites) • Agency databases • Museum databases
Data are physically dispersed… Visitors to NCEAS Field Stations in North America
Data are very heterogeneous… • Population survey • Experimental • Taxonomic survey • Behavioral • Meteorological • Oceanographic • Hydrology • … • Syntax (format) • Schema (organization) • Semantics (meaning/methods)
Thematic heterogeneity due to Vast Scope of Ecology Biosphere Abiotic Biomes Communities Organisms Genes
Classifying Data Heterogeneity • Syntax (format) • Schema (organization) • Semantics (knowledge/meaning/methods)
Data Lacking in Documentation • Majority of ecological data undocumented • Lack information on syntax, structure and semantics of data • Impossible to understand data without contacting the original researchers; even then memoriescan fail, individuals retire or expire • Documentation conventions widely vary • Requires large time investment to understand each data set
Summary of Technical Challenges • Because of: • Data dispersion • Data heterogeneity • Lack of documentation • Integration and synthesis are limited to a manual process • --difficult to scale integration efforts up to large numbers of data sets
Solutions • Standardized measurements • Changes needed in culture, training • Technology development- metadata, data servers, desktop tools
Ecoinformatics Research Objectives • Enhance access to ecological and environmental data • Promote data sharing & re-use • Enable national data discovery • Provide access to research stations’ data resources • Maintain local autonomy for data management • Synthesis and Analysis • Promote cross-cutting analysis • Taxonomic, Spatial, Temporal, Conceptual integration of data • Data preservation • Long term data description • Provide archiving capabilities
Functional breakdown for Analysis • Data discovery • Data access • Data storage/archive • Data interpretation • Quality assessment • Data Conversion & Integration • Analysis & Modeling • Visualization
KNB Development Projects(Knowledge Network for Biocomplexity) • Ecological Metadata Language (EML) • Prospective standard for ecological metadata • Metacat • A freely available database for storing metadata • Morpho • A freely available tool for creating metadata
KNB Overview Metadata (EML) Data Client Server Morpho Morpho Metacat Web Browser Web Browser Metacat
KNB Development Projects • Ecological Metadata Language (EML) • Metacat • Morpho
Why the big buzz about Metadata • Metadata are the basis for the next generation of the Web: • “The Semantic Web is a web of data, in some ways like a global database… The driver for the Semantic Web is …metadata” --Tim Berners-Lee, father of the Web • Digital Library Community– “Era of Metadata 1998-200?” – Carol Mandel, Digital Librarian
Central Role of Metadata • What are metadata? • Data documentation • Ownership, attribution, structure, contents, methods, quality, etc. • Critical for addressing data heterogeneity issues • Critical for developing extensible systems • Critical for long-term data preservation • Allows advanced services to be built
Data – just numbers 072998 29.5 17.0 073098 29.7 6.1 073198 29.1 0
Data + Metadata =numbers + context Date Temp (C) Precip. (mm) Obs. #1072998 29.5 17.0 Obs. #2 073098 29.7 6.1 Obs. #3 073198 29.1 0
Rules of Thumb (Michener 2000) • the more comprehensive the metadata, the greater the longevity (and value) of the data • structured metadata can greatly facilitate data discovery, encourage “best metadata practices” and support data and metadata use by others • metadata implementation takes time!!! • start implementing metadata for new data collection efforts and then prioritize “legacy” and ongoing data sets that are of greatest benefit to the broadest user community
EML 2.0a formal ecological metadata specification • eml-resource -- Basic resource info • eml-dataset -- Data set info • eml-literature -- Citation info • eml-software -- Software info • eml-party -- People and Organizations • eml-entity -- Data entity (table) info • eml-attribute -- Attribute (variable) info • eml-constraint -- Integrity constraints • eml-physical -- Physical format info • eml-access -- Access control • eml-distribution -- Distribution info • eml-project -- Research project info • eml-coverage -- Geographic, temporal and taxonomic coverage • eml-protocol -- Methods and QA/QC
KNB Development Projects • Ecological Metadata Language (EML) • Metacat • Morpho
Metacat – metadata storage • Metadata storage, search, presentation • Schema independent – supports arbitrary XML types • Multiple metadata standards • Ecological Metadata Language • NBII Biological Data Profile • Data storage + preservation • Replication • Flexible access control system • National distributed directory service • Strong version control • Configurable web interface (XSLT)
Metacat network SEV NRS Metacat OBFS AND SEV Metacat NCEAS Metacat CAP LTER Metacat Key Metacat Catalog Morpho clients Web clients SDSC Metacat Site metadata system XML output filter
KNB Development Projects • Ecological Metadata Language (EML) • Metacat • Morpho
Morpho Features • Guided Metadata creation • Wizards & editor • Automatically extract metadata during data import • Search all metadata – structured + free text • Contribute to KNB • Windows, Mac, Linux • Multiple metadata standards • EML • NBII Biological Data Profile • Extensible • Standalone (non-networked) mode
Objectives of the KNB & SEEK • National network for ecological data • Data discovery • Data access • Data interpretation • Enable advanced services • Quality management • Data integration thru advanced queries • Visualization and analysis
Solutions • KNB • Ecological Metadata Language (EML) • Metacat -- flexible metadata database • Morpho -- data management for ecologists • SEEK (partners include NCEAS, KU, SDSC, LTER Netw Offc, CAP, Napier Univ., UVM, UNC) • Unified Portal to Ecological Data (ECOGRID) • Quality Assurance engine • Semantic Query Processor • Data integration and Analytical Pipelines
SEEK – addressing semantic integration Ontologies EcoGrid One-stop access to ecological and environmental data Semantic Mediation Data integration using logic-based reasoning Science Environment for Ecological Knowledge Analysis and Modeling Pipelines Analysis workflows using semantic mediation
Quality Assessment • Integrity constraint checking • Data type checking • Metadata completeness • Data entry errors • Outlier detection • Check assertions about data • e.g., trees don’t shrink • e.g., sea urchins do
Semantic metadata • Describes the relationship between measurements and ecologically relevant concepts • Drawn from a controlled vocabulary • Ontology for ecological measurements
Representing ontologies • OWL –Web Ontology Language • CKML – Conceptual Knowledge Markup Language • RDF – Resource Description Framework
Semantic Data Discovery • Knowledge of SQL or database languages is a barrier to data access and re-use SELECT dsname FROM dslist WHERE meas_type LIKE ‘pop_den’ AND location = ‘GBNPP’ AND common_name = ‘barnacles’; • Semantic Queries: allow scientists to express data queries in familiar scientific terms What data sets contain population density estimates for barnacles in Glacier Bay National Park and Preserve? • Functionality enabled through semantic metadata
Data Integration Data Semantic Metadata Researcher Decisions + + + Integrated Data Set
Re-using data from the KNB • Goal – support visualization & analysis • Scalability-- • Efficiently process more data from investigators • Broader Spatial extent, longer temporal extent, robust taxonomic extent • Analytical Pipelines (Monarch prototype) • Flexible tool for exploratory analysis of data • Directly process data in the network • Utilize powerful analytical environments (SAS, Matlab, R, …) • Analysis audit trail • Reproduce analyses • Communicate about analyses • Automate new analyses based on earlier ones
Analysis Step Analysis Step Analysis Step Analysis Step Analysis Step Analysis Step Analysis Step Analysis Step Description And Code Description And Code Description And Code Description And Code Description And Code Description And Code Description And Code Description And Code Inputs Inputs Inputs Inputs Inputs Inputs Inputs Inputs Outputs Outputs Outputs Outputs Outputs Outputs Outputs Outputs Analysis Pipelines Runtime Data Binding
Data Acquisition (Jalama prototype) • Application to assist in data collection • Capture relevant metadata (e.g., EML) during initial data collection • Encourage good informatics practice via automating design of field data forms • Integration with Metadata and Data storage frameworks (e.g., Metacat)
Ecoinformatics Solutions! Integration: MORPHO Data Acquisition: JALAMA Storage, archiving: ECOGRID Distributed Access: METACAT Analysis & Viz: MONARCH
Fin http://knb.ecoinformatics.org