Mark Schildhauer 1 & Tim McPhillips 2

Formalized data acquisition platforms to support data collection in the laboratory, field, and across sites Mark Schildhauer1 & Tim McPhillips2 1 National Center for Ecological Analysis and Synthesis, U. Cal Santa Barbara 2Genomics Center, U. Cal Davis

Cyberinfrastructure for Holistic Biology • Increasing need for collaboration and synthesis to solve vital, complex questions in biology– from gene to ecosystem • Cyberinfrastructure supports synthesis by: • Providing data access infrastructure • Dealing with the integration of heterogeneous data • Basing analysis and modeling on robust, shared code • Standardizing and exchanging protocols, methods • Wide variety of informatics research projects • KNB, SEEK, MaNIS, CIPRes, VegBank, Kepler, GEON, …

Data Discovery and Integration Challenges • Ecological data are highly heterogeneous-- • Variable syntax (csv, xls), structures (tables, rasters, hierarchical), and semantics (terms, methods) • Highly dispersed– many autonomous holdings (hard drives, floppies), few repositories • Derived from many disciplines: genomic, cellular, physiology, morphology, populations, communities, ecosystems, biodiversity (specimen collections, range) • Need for abiotic data too: hydrology, geospatial, climatology • Human factors: demographic, economic, land-use data

Collaboration and data sharing • Personal data management problems are vastly compounded in collaboration • NEED FOR BETTER-- • Data organization – standardized formats, structures • Data documentation – standardized descriptions of data (metadata); loose coupling but compatible • Data analysis – documented and executable • Data & analysis preservation – archived, discoverable, retrievable, and interpretable (archives for both)

Technological solutions • Confederated data sharing framework (standardized protocols, rich metadata, controlled vocabularies/ontologies, compatible querying mechanisms, distributed management and ownership) • Analytical software that is scripted, verifiable, re-usable (e.g., R, Matlab, SAS, C); and orchestrated with scientific workflows (e.g., Kepler to allow heterogeneous execution environments) • Free, open-source, multi-platform software for data management & analysis (whenever possible) • Virtualized “Central collaborative workspace” for organizing communications about data, analyses, protocols, findings, etc. • Broad compatibility with other frameworks for efficient resource discovery and interpretation across projects

Ecoinformatics Products (NCEAS, UCD, SDSC, KU, LTER and other collaborators) • Ecological Metadata Language • structure, semantics, and context of scientific data • Morpho • desktop metadata and data management software • Metacat • distributed data server • KNB, UCNRS, OBFS, NCEAS, PISCO, ESA • VegBank • plot, species and community vegetation data • EcoGrid • interfacing distinct data systems and networks • Kepler • analysis and modeling using scientific workflows <EML>

SONet: A Community-Driven Scientific Observations Network to achieve Semantic Interoperability of Environmental and Ecological Data (Ontologies/controlled vocabularies; annotations and applications) OCI INTEROP VDC: Creation of an International Virtual Data Center for the Biodiversity, Ecological and Environmental Sciences (Confederating disparate data resources for cross-scale, cross disciplinary science) OCI INTEROP SEMTOOLS: Semantic enhancements for ecological data management (linking metadata with ontologies and reasoning engines; enhancing personal data management) DEB DBI NSF has OCI programs--Data Net (Digital data preservation and access) and others--- iPlant should collaborate and leverage where relevant!!! Other Cyberinfrastructure Efforts

Existing and emerging standards– metadata: FGDC, GenBank, OGC, etc. ontologies: OBO, GO, others Technologies– NSF NMI, W3C iPlant should not reinvent technology wheels Partner/collaborate with other CI efforts Assure broad compatibility with other CI efforts in biological science Other Considerations

GCW should identify what are the main CI needs! Data contributions/acquisitions (how to collect and organize contributions from “autonomous” researchers Data discovery/sharing (availability? precision/recall/grain of search) Data storage and delivery (massive archival needs? transfer over the wire) Data mining and pattern detection in massive data? Computational bottlenecks? (need for focused tuning/parallelization of algorithms) IP/access and attribution concerns? Recommend proceed with “Use Case/Scenario” method to define specific biological GC; then create CI with sensitivity to generalization. Data acquisition for iPlant

Acknowledgements This material is based upon work supported by: The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676. Collaborators: NCEAS (UC Santa Barbara), University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research), University of Vermont, University of North Carolina, Napier University, Arizona State University, UC Davis The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. Kepler contributors: SEEK, Ptolemy II, SDM/SciDAC, GEON, RoadNet, EOL, Resurgence

Mark Schildhauer 1 & Tim McPhillips 2