240 likes | 377 Views
Cyberinfrastructure Overview. Core Cyberinfrastructure Team Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California, Santa Barbara DataONE Kick-off Meeting October 20-22, 2009. Cyberinfrastructure Objectives.
E N D
Cyberinfrastructure Overview Core Cyberinfrastructure Team Matthew B. Jones National Center for Ecological Analysis and Synthesis (NCEAS) University of California, Santa Barbara DataONE Kick-off Meeting October 20-22, 2009
Cyberinfrastructure Objectives • Support synthesis in earth observation sciences • Support full lifecycle of scientific process • Data acquisition and management • Data preservation • Data discovery and access • Data integration • Data analysis and visualization • Process management and preservation • Evolve to accommodate technology change
Design goals • Distributed management at Member Nodes • Replication and caching for preservation and performance • Software must provide benefits for scientists today • Evolution of software and standards • Support and adapt existing community software efforts • Emphasize Free and Open Source Software
What data are in scope? • Biological • e.g., Gene, Organism, Population, Species, Community, Biome, Ecosystem • Environmental • e.g., Atmospheric, Chemical, Ecological, Hydrological, Oceanographic, Physical • Social • e.g., Land use, human population • Economic • e.g., trade, ecosystem services, resource extraction
Who are the providers and consumers? • Providers • Academic and Agency Scientists • Research networks • Environmental observatories • Citizen groups • Students • Consumers • Academic and Agency Scientists • Research networks • Environmental observatories • Citizen groups • Students Same people, different roles driving needs
Metadata and data integration • Every community has • multiple metadata schemas • Biological Data Profile, Darwin Core, Dublin Core, Ecological Metadata Language, Open GIS schemas • multiple data formats • ASCII, NetCDF, HDF, GeoTiff, ... • Some communities have general and domain specific ontologies • Addressing this heterogeneity is critical • Integrated analysis of datasets requires • Syntax mapping • Semantics mapping • Sophisticated integration tools that do not exist
Integrating with existing infrastructure KNB, ESDIS, and Waters Networks
Overview of Components • Member Nodes • Earth observing institutions, projects, and networks • Provide resources for their own data and replicated data • Focused on serving their constituencies • Coordinating Nodes • Provide network-wide services to Member Nodes • Geographically replicated services • Investigator Toolkit • Tools for researchers to access DataNetONE • General Purpose and discipline-specific tools • Adapt existing tools where possible
Node Design • Member nodes • Geographically Distributed Nodes • Authoritative repository for many datasets • Diversity tolerant (less tightly coordinated) • Freedom to try new tools, methods, and leapfrog forward • Partial replication • Coordinating nodes • Completely replicated • Complete metadata catalogue • Data Subset (initially a large fraction) • Tightly coordinated, stable service platform
DataONE Service Interface • Federated Identity and Authorization Services • Object Management Services • Discovery and Usage Services • Preservation Services • Network Services
Service Interface for Interoperability • Create common access methods for different clients • Create a mechanism to map heterogeneous services • Provide an interface between nodes and service requests • Simplicity of construction • Lightweight • Ease of implementation • Implementations are opaque to service consumers
What is the Investigator Toolkit? • Suite of software tools for researchers • Emphasize Free and Open Source, but support commercial • General analysis frameworks (e.g., R, MATLAB) • Domain-specific tools (e.g., GARP, Phylocom) • Organized using scientific workflows • Supports the scientific lifecycle • Data management and preservation • Data query and access • Data analysis and visualization • Process management and preservation • Communication via the Service Interface
Toolkit Functions • Supports the scientific lifecycle • Data management and preservation • Data query and access • Data analysis and visualization • Process management and preservation • Portal software
Who will build the Toolkit? • Many existing open source efforts exist • Data management: MATT, UDig, Specify • Analysis and modeling: R, Octave • Workflow systems: Kepler, Taverna, Triana, Pegasus • Grid systems: Condor, Globus, BOINC • Data and workflow portals: VegBank, myExperiment • Commercial tools important too • MATLAB, SAS, ArcGIS • DataONE: help communities build their own tools • Integrate, interoperate, stabilize • Create libraries to DataONE Service Interface
Data Management and Preservation • Data management functions • Data creation, input, editing, versioning • Metadata creation, editing, annotation • Local data storage, indexing, searching • Example applications • Morpho metadata editor • Mercury metadata editor • MATT metadata editor • ESRI ArcCatalog • Metacat Data Server -- lab group data management
Data Analysis and Visualization • Need community-standard analysis frameworks • R, Octave, GRASS • SPlus, MATLAB, ArcGIS • Thousands of domain-specific analytical tools exist • GARP: Genetic Algorithm for Rule Processing • Blast search • ClustalW • Phlylocom • Mesquite
Workflow system capabilities • Workflow systems: • Enable communication • Support preservation of scientific processes • Enable component re-use • Allow integration across many software frameworks • Example workflow engines • Kepler, Taverna, Pegasus, Triana
Community tools have been successful • Investigator Toolkit will build upon these successes • Adapt tools to work together with Service Interface • Support Free and Open Source Software • Supported tools will build over time
DataONE discovery portals • Data discovery portal at Coordinating Nodes • Workflow discovery portal at Coordinating Nodes • Other portals as needed
Outstanding issues • Data Discovery, Access, and Availability • Federated Identity, Authentication, and Access Control • Metadata and data standards • Evolution of specifications • Data Integration and Interoperability • Data and Metadata preservation, longevity, and migration • Versioning and identifiers • Scalability
NIH Syndrome • Lots of: • metadata catalogs and specifications • data standards • service definitions • architectures and protocols • Many communities of practice • GEOSS, KNB, CUAHSI, NBII, GBIF, TDWG, Ameriflux, EOS, OGC, W3C, LTER, NEON, OOI • and on and on and on... • DataONE can not just be Community n+1 • Easy to get entrained in the details • Have to save people work • Have to engage groups early and earnestly
I am here NCEAS GEOSS LTER SONet GBIF KNB OGC Kepler TDWG W3C EOS DataONE ME Where are you?