220 likes | 397 Views
Science Environment for Ecological Knowledge: Ecogrid Interfaces Dave Vieglais The Natural History Museum and Biodiversity Research Center University of Kansas. Science Environment for Ecological Knowledge. Research Objectives Access to ecological and environmental data
E N D
Science Environment for Ecological Knowledge: Ecogrid Interfaces Dave Vieglais The Natural History Museum and Biodiversity Research CenterUniversity of Kansas
Science Environment for Ecological Knowledge Research Objectives • Access to ecological and environmental data • Enable data sharing & re-use • Enhance data discovery at global scales • Scalable analysis and synthesis • Taxonomic, Spatial, Temporal, Conceptual integration of data • Enable communication and collaboration for analysis • Address data heterogeneity issues • Enable re-use of analytical components
Informatics Challenges for SEEK • Data is Heterogeneous • Syntax • Schema • Semantics • From many disciplines • Biodiversity surveys, hydrology, atmospheric chemistry, spatial data, behavioral experiments,… • Data on economics, demographics, legal issues,… • Data is distributed
SEEK Components • EcoGrid • Ecological, biodiversity and environmental data • Computational access • Analysis and Modeling System • Modeling scientific workflows • Semantic Mediation System • “Smart” data discovery • Knowledge-based data integration • Knowledge-based analysis integration • Knowledge Representation • Ontologies for describing ecology
Building the EcoGrid OBFS NRS LUQ SEV AND NCEAS HBR VCR NTL PISCO 2 PISCO 1 NET KU SDSC LTER Network (24) Organization of Biological Field Stations (180) UC Natural Reserve System (36) Partnership for Interdisciplinary Studies of Coastal Oceans (4) Multi-agency Rocky Intertidal Network (60) Metacat node SRB node DiGIR node Site node
SEEK EcoGrid • Integrate diverse data networks from ecology, biodiversity, and environmental sciences • Metacat, DiGIR, SRB, Xanthoria, ... • EML is the core for data documentation • Access to computational resources via the Grid (OGSA)
Ecological Metadata Language (EML) • Metadata: a means to manage ecological data • There is no universal data model for ecology • Accommodate heterogeneity and dispersion • EML • Discovery information • Creator, Title, Abstract, Keyword, etc. • Coverage • Geographic, temporal, and taxonomic extent • Logical and physical data structure • Data semantics via unit definitions and typing • Protocols and methods
DiGIR Overview DiGIR Client DiGIR Provider DataResource 1..n 1..n • DiGIR = Distributed Generic Information Retrieval • A DiGIR client may communicate with any number of data providers • A DiGIR data provider may expose any number of resources (databases) • A DiGIR resource is a collection of objects described by a single federation schema
EcoGrid Interfaces • Resolves references to objects • Interface definitions • Data structures • Service instances Authentication Details on session information Coarse granularity of resource restriction Search and retrieve metadata and data Different levels of “conformance” Low bar for participation in SEEK System to reduce ambiguity in scientific names Commonly used to address synonomy Mechanism for relating and resolving data andmetadata concepts Registry Session Query Taxon SMS
EcoGrid Query Interfaces • Provides a mechanism for search and retrieval of metadata and federated data • Supports third party interaction with search results – forwarding of result set identifiers to another service instance for retrieval • Different levels of compliance • Low barrier for participation • Bulk of data will be accessible through Type I
Query Interfaces Implemented • Initial requirement to support query and retrieval from: • SRB • Metacat • DiGIR • Xanthoria • Federated data sets that subscribe to a small set of federation schemas
EcoGrid Query Level I • Basic, entry level exposure of data and metadata for EcoGrid and SEEK • Response contains data – intended for direct communications rather than 3rd party indirection ResultsetType query(SessionID,QueryType) byte[] get(SessionID,objectID)
Query Example <egq:query queryId="query-digir.1.1" system="http://knb.ecoinformatics.org" xmlns:egq="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-query-1.0.0beta1 ../../src/xsd/query.xsd"> <namespace prefix="darwin">http://digir.net/schema/conceptual/darwin/2003/1.0</namespace> <returnfield>/ScientificName</returnfield> <returnfield>/Longitude</returnfield> <returnfield>/Latitude</returnfield> <title>Peromyscus genus query</title> <condition operator="LIKE" concept="Genus">Peromyscus</condition> </egq:query>
Query Structure • Language independent representation of a query structure • Transformed into the appropriate native language of the data store Example: <AND> <condition operator="LIKE“ concept="ScientificName"> peromyscus man%</condition> <condition operator="NOT EQUALS“ concept="DecimalLatitude"> NULL</condition> </AND>
Specifying the Resultset • Specify the list of concepts (fields) to be returned in the resultset • Simple paths used to identify elements or document subtrees • Effectively flattens the structure of the records, but allows generic representation Example: <returnfield>/ScientificName</returnfield> <returnfield>/Longitude</returnfield> <returnfield>/Latitude</returnfield>
Query Result Set Structure <rs:resultset resultsetId="foo.1.1" system="urn:not://sure/what/to/put/here" xmlns:rs="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ecogrid://ecoinformatics.org/ecogrid-resultset-1.0.0beta1 ../../src/xsd/resultset.xsd"> <resultsetMetadata> <sendTime>2003-05-02T16:45:50-09:00</sendTime> <startRecord>1</startRecord> <endRecord>2</endRecord> <recordCount>2</recordCount> </resultsetMetadata> <record number="1" system="http://speciesanalyst.net/digir/DiGIR.php?resource=MammalsDwC2" identifier="mvz1" namespace="http://digir.net/schema/conceptual/darwin/2003/1.0" lastModifiedDate="2003-03-03T10:42:13" creationDate="2003-03-03T10:42:13"> <darwin:ScientificName>PEROMYSCUS LEUCOPUS NOVEBORACENSIS </darwin:ScientificName> <darwin:Longitude>121</darwin:Longitude> <darwin:Latitude>33</darwin:Latitude> </record>
EcoGrid Query Level II • More detailed handling of results • Uses RSIDs to identify resultsets- handles that can be passed to a third party Resultset retrieve(SessionID,RSID,start,numrecs) RSID search(SessionID,query) query decodeResultsetIdentifier(SessionID,RSID) statusinfo getResultStatus(SessionID) int transfer(SessionID,sourceURL,destURL,ObjectID)
EcoGrid Write • Used to push data back to sources (e.g. publishing EML documents) • Depends on the availability of an authentication system put(sessionID, objectID, object, type) delete(sessionID,objectID)
Data Instance Query? • New requirement to support direct query and retrieval with arbitrary data sets • Generally no common schemas between different instances • Could either • Push data instance to service that can query object (e.g. the SRB) • Implement interface at the data instance location • Simple JDBC / SQL interface? dbSchema getDataSchema(sessionID,objectID) dbResultset search(sessionID,objectID,SQL)
Convergence with Globus? • EcoGrid originally intended to use Globus since it provided much of the infrastructure • Globus is not a viable infrastructure layer due to installation and reliability concerns • Should SEEK implement Globus infrastructure to support project requirements? • Likely to duplicate minimal service definitions and re-implement
Acknowledgements This material is based upon work supported by: The National Science Foundation under Grant Numbers 9980154, 9904777, 0131178, 9905838, 0129792, and 0225676. The National Center for Ecological Analysis and Synthesis, a Center funded by NSF (Grant Number 0072909), the University of California, and the UC Santa Barbara campus. The Andrew W. Mellon Foundation. PBI Collaborators: NCEAS, University of New Mexico (Long Term Ecological Research Network Office), San Diego Supercomputer Center, University of Kansas (Center for Biodiversity Research)