410 likes | 716 Views
Data Integration, Analysis, and Synthesis. Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara Scalable Information Networks for the Environment. http://knb.ecoinformatics.org
E N D
Data Integration, Analysis, and Synthesis Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara Scalable Information Networks for the Environment http://knb.ecoinformatics.org Funding: National Science Foundation (DEB99-80154, DBI99-04777)
NCEAS’ Mission • Integrate existing data for broad ecological synthesis • Use synthesis to inform policy and management
Synthesis at NCEAS • Research • Management • Policy • 200+ synthesis projects • 1900+ participating scientists
Research projects • Hunsaker – Quantification of Uncertainty in Spatial Data for Ecological Applications • Ives & Frost – Intrinsic and Extrinsic Variability in Community Dynamics • Osenberg -- Meta-Analysis, Interaction Strength and Effect Size; Application of Biological Models to the Synthesis of Experimental Data • Murdoch – Complex Population Dynamics
Management projects • Andelman – Designing and Assessing the Viability of Nature Reserve Systems at Regional Scales: Integration of Optimization, Heuristic and Dynamic Models • Boersma & Kareiva – Prospectus For An Analysis of Recovery Plans and Delisting • Kareiva – Habitat Conservation Planning for Endangered Species • Lubchenco, Palumbi, & Gaines – Developing the Theory of Marine Reserves
Policy projects • Costanza & Farber -- The Value of the World's Ecosystem Services and Natural Capital: Toward a Dynamic, Integrated Approach • http://www.nceas.ucsb.edu/
Synthesis projects • Use existing data... • Distributed sources • Varying protocols • Varying formats • Obtained via personal collaboration
Functional breakdown • Functional breakdown for synthesis • Data discovery • Data access • Data storage • Data interpretation • Quality assessment • Data Conversion & Integration • Analysis & Modeling • Visualization
Presentation Outline • Integration, Analysis, and Synthesis: • Challenges
Data Heterogeneity • Economic • Social (urban ecology) • Paleoecological • Historical • Land use • Demographics • Population survey • Experimental • Taxonomic survey • Behavioral • Meteorological • Oceanographic • Hydrology • …
Types of Heterogeneity • Intensional vs. Arbitrary Heterogeneity • Syntax (format) • CSV, Fixed ASCII, proprietary binary • Schema (organization) • Non-normalized models • Semantics (meaning/methods) • Protocol semantics (e.g., scale) • Parameter semantics (e.g., bodysize (g)) • Conceptual framework (e.g., experimental trts) • Taxonomy + nomenclature
Data Dispersion • Data are distributed among: • Independent researcher holdings • Research station collections • LTER Network (24 sites) • Org. of Biological Field Stations (168 sites) • Univ. Cal Natural Reserve System (36 sites) • MARINE (62 sites) • PISCO • Agency databases • Museum databases • Access via personal networking • Not scalable
Lack of Metadata • Majority of ecological data undocumented • Lack information on syntax, schema and semantics of data • Impossible to understand data without contacting the original researchers • Documentation conventions widely vary • Requires large time investment to understand each data set
Scaling Data Integration • Because of: • Data heterogeneity • Data dispersion • Lack of documentation • Integration and synthesis are limited to a manual process • Thus, difficult to scale integration efforts up to large numbers of data sets
Data Integration A B C
Presentation Outline • Integration, Analysis, and Synthesis: • Challenges • Current work • Knowledge Network for Biocomplexity • Partnership for Biodiversity Informatics
Knowledge Network for Biocomplexity (KNB) • National network for biocomplexity data • Data discovery • Data access • Data interpretation • Enable advanced services • Data integration • Analysis framework • Hypothesis modeling • Visualization
Central Role of Metadata • What metadata? • Ownership, attribution, structure, contents, methods, quality, etc. • Critical for addressing data heterogeneity issues • Critical for developing extensible systems • Critical for long-term data preservation • Allows advanced services to be built
KNB Components • Ecological Metadata Language (EML) • Morpho -- data management for ecologists • Cross platform Java application • Metacat -- flexible metadata & data system • Analysis and Modeling engine • Data integration engine • Semantic Query Processor • Hypothesis Modeling Engine
Ecological Metadata Language • XML syntax for representing metadata • Extensible – can add new metadata • Modular – can subset metadata for specific applications
EML 2.0beta3 modules • eml-resource -- Basic resource info • eml-dataset -- Data set info • eml-literature -- Citation info • eml-software -- Software info • eml-party -- People and Organizations • eml-entity -- Data entity (table) info • eml-attribute -- Attribute (variable) info • eml-constraint -- Integrity constraints • eml-physical -- Physical format info • eml-access -- Access control • eml-distribution -- Distribution info • eml-project -- Research project info • eml-coverage -- Geographic, temporal and taxonomic coverage • eml-protocol -- Methods and QA/QC
Metacat metadata system SEV NRS Metacat OBFS AND SEV Metacat NCEAS Metacat CAP LTER Metacat Key Metacat Catalog Morpho clients Web clients SDSC Metacat Site metadata system XML wrapper
OBFS Network UC Natural Reserve System LTER Network
Functional breakdown • Functional breakdown for synthesis • Data discovery • Data access • Data storage • Data interpretation • Quality assessment • Data Conversion & Integration • Analysis & Modeling • Visualization
Quality Assessment system Data Semantic Metadata Researcher Decisions + + + Quality Assessment Report
Quality Assessment • Integrity constraint checking • Data type checking • Metadata completeness • Data entry errors • Outlier detection • Check assertions about data • e.g., trees don’t shrink • e.g., sea urchins do
Data Integration Data Semantic Metadata Researcher Decisions + + + Integrated Data Set
Data Integration A B C
Semantic metadata • Describes the relationship between measurements and ecologically relevant concepts • Drawn from a controlled vocabulary • Ontology for ecological measurements
What drives synthesis • Science questions • Hypotheses • Analyses + Models • Integrated Data • Original Data
Conclusions • Barriers to integration can be addressed using structured metadata • Can accomplish a lot with ‘just’ mechanical transformations • Domain ontologies + semantic mediation are paths to scaling integration • Analysis drives all other phases of integration