230 likes | 368 Views
http://knb.ecoinformatics.org http://seek.ecoinformatics.org. Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara.
E N D
http://knb.ecoinformatics.org http://seek.ecoinformatics.org Science Environment for Ecological Knowledge: EcoGrid Matthew B. Jones National Center for Ecological Analysis and Synthesis University of California Santa Barbara
Science Environment for Ecological Knowledge Research Objectives • Access to ecological, environmental, and biodiversity data • Enable data sharing & re-use • Enhance data discovery at global scales • Scalable analysis and synthesis • Taxonomic, Spatial, Temporal, Conceptual integration of data • Address data heterogeneity issues • Enable communication and collaboration for analysis • Enable re-use of analytical components • Collaborators • NCEAS, UNM, SDSC, U Kansas • Vermont, Napier, ASU, UNC
SEEK Components Science Environment for Ecological Knowledge • Kepler • Modeling scientific workflows • EcoGrid • Making diverse environmental data systems interoperate • Semantic Mediation System • “Smart” data discovery and integration • Knowledge Representation WG • Taxon WG • BEAM WG • Education, Outreach, Training
Scientific Workflows • Model the way scientists work with their data now • Mentally coordinate export and import of data among software systems • Workflows emphasize data flow • Output generation includes creating appropriate metadata • The analysis workflow itself becomes metadata • The workflow describes the data lineage as it has been transformed • Derived data sets can be stored in EcoGrid with provenance Query EcoGrid to find data Archive output to EcoGrid with workflow metadata
Kepler: scientific workflows • Collaborative effort of SEEK, SciDAC/SDM, GEON, Ptolemy Project
SEEK EcoGrid • Goal: allow diverse environmental data systems to interoperate • Hides complexity of underlying systems using lightweight interfaces • We have standardized data via EML, need standard APIs • Integrate diverse data networks from ecology, biodiversity, and environmental sciences • Data systems • Any system can implement these interfaces • Prototyping using: • Metacat, SRB, DiGIR, Xanthoria, etc. • Supports multiple metadata standards • EML, Darwin Core as foci
EcoGrid client interactions • Modes of interaction • Client-server • Fully distributed • Peer-to-peer • EcoGrid Registry • Node discovery • Service discovery • Aggregation services • Centralized access • Reliability • Data preservation
EcoGrid Query Interfaces Result Query • Provides a mechanism for search and retrieval of metadata and federated data • Supports third party interaction with search results – forwarding of result set identifiers to another service instance for retrieval • Different levels of compliance • Low barrier for participation • Bulk of data will be accessible through Type I
Query Interfaces Implemented • Initial prototype to support query and retrieval from: • Storage Resource Broker (SRB) • Metacat • Distributed Generic Information Retrieval (DiGIR) • Xanthoria • Encourage additional experimentation with and feedback based on other system implementations
EcoGrid Query Level I Result Query • Basic, entry level exposure of data and metadata for EcoGrid and SEEK • Response contains data – intended for direct communications rather than 3rd party indirection ResultsetType query(SessionID,QueryType) byte[] get(SessionID,objectID)
Query Conditions Query • Language independent representation of a query structure • Transformed into the appropriate native language of the data store Example: <AND> <condition operator="LIKE“ concept="ScientificName">peromyscus%</condition> <condition operator="NOT EQUALS“ concept="DecimalLatitude">NULL</condition> </AND>
Specifying the Resultset Query • Specify the list of concepts (fields) to be returned in the resultset • Simple paths used to identify elements or document subtrees • Effectively flattens the structure of the records, but allows generic representation Example: <returnfield>/ScientificName</returnfield> <returnfield>/Longitude</returnfield> <returnfield>/Latitude</returnfield>
Full Query Example Query <egq:query queryId="query-digir.1.1" system="http://knb.ecoinformatics.org" xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1 ../../src/xsd/query.xsd"> <namespace prefix="darwin">http://digir.net/schema/conceptual/darwin/2003/1.0</namespace> <returnfield>/ScientificName</returnfield> <returnfield>/Longitude</returnfield> <returnfield>/Latitude</returnfield> <title>Peromyscus genus query</title> <condition operator="LIKE" concept="Genus">Peromyscus</condition> </egq:query>
Query Result Set Structure Result <rs:resultset resultsetId="foo.1.1" system="urn:not://sure/what/to/put/here" xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1 ../../src/xsd/resultset.xsd"> <resultsetMetadata> <sendTime>2003-05-02T16:45:50-09:00</sendTime> <startRecord>1</startRecord> <endRecord>2</endRecord> <recordCount>2</recordCount> <namespace>http://digir.net/schema/conceptual/darwin/2003/1.0</namespace> <system id="1">http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2</system> </resultsetMetadata> <record number="1" system="1" identifier="mvz1"> <returnField name="ScientificName">PEROMYSCUS LEUCOPUS NOVEBORACENSIS</returnField> <returnField name="Longitude">100</returnField> <returnField name="Latitude">200</returnField> </record> … </rs:resultset>
EcoGrid Query Level II • More detailed handling of results • Uses RSIDs to identify resultsets- handles that can be passed to a third party RSID search(SessionID,query) Resultset retrieve(SessionID,RSID,start,numrecs) query decodeResultsetIdentifier(SessionID,RSID) statusinfo getResultStatus(SessionID) int transfer(SessionID,sourceURL,destURL,ObjectID)
EcoGrid Write • Used to push data back to sources (e.g. publishing EML documents) • Depends on the availability of an authentication and access control system put(sessionID, objectID, object, type) delete(sessionID,objectID)
Data Instance Query • New requirement to support direct query and retrieval with arbitrary data sets • Generally no common schemas between different instances • Could either • Push data instance to service that can query object (e.g. the SRB) • Implement interface at the data instance location • Simple JDBC / SQL interface? dbSchema getDataSchema(sessionID,objectID) dbResultset search(sessionID,objectID,SQL)
Building the EcoGrid LUQ AND HBR VCR NTL LTER Network (24) Natural History Collections (>> 100) Organization of Biological Field Stations (180) UC Natural Reserve System (36) Partnership for Interdisciplinary Studies of Coastal Oceans (4) Multi-agency Rocky Intertidal Network (60) Metacat node SRB node VegBank node DiGIR node Xanthoria node Legacy system
Acknowledgements This material is based upon work supported by: The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676. The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. PBI Collaborators: NCEAS, University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research)