480 likes | 600 Views
Distributed Software Systems: Cyberinfrastructure and Geoinformatics Chaitan Baru. San Diego Supercomputer Center. Domain-specific Cybertools (software). Shared Cybertools (software). Distributed Resources (computation, storage, communication, etc.).
E N D
Distributed Software Systems: Cyberinfrastructure and GeoinformaticsChaitan Baru San Diego Supercomputer Center
Domain-specific Cybertools (software) Shared Cybertools (software) Distributed Resources (computation, storage, communication, etc.) Integrated Cyberinfrastructure System Source: Dr. Deborah Crawford, Chair, NSF CI Working Committee • Applications • Geosciences • Environmental Sciences • Neurosciences • High Energy Physics … DevelopmentTools & Libraries Education and Training Discovery & Innovation Middleware Services Hardware
Your Specific Tools & User Apps. Shared Tools ScienceDomains Community Cyberinfrastructure Projects Friendly Work-Facilitating Portals Authentication - Authorization - Auditing - Workflows - Visualization - Analysis Adapted from: Prof. Mark Ellisman, UC San Diego DevelopmentTools & Libraries Ecological Observatories (NEON) High Enegy Physics (GriPhyN) Ocean Observing (ORION) Biomedical Informatics (BIRN) Geosciences (GEON) Earthquake Engineering (NEES) Middleware Services Hardware Distributed Computing, Instruments and Data Resources
Data, Tools, & Computation • Data • Field observations • Laboratory analyses • Sensor-based data (land, airborne, satellite) • Tools • QA/QC, simple transformations and analyses • Complex models • Computation • Community codes • Access to high-performance computing • Data Intensive Computing
Variety of Geoinformatics Efforts • Data collection • Digital data collection in the field • “When does it become cyberinfrastructure”? • Database curation • E.g. EarthChem, Paleobiology, MorphoBank, Paleo Pollen, etc…. • When does it become “tools” and “community codes” • Software Development • Tools: gravity and magnetics, paleogeography, geochemistry, seismic data products, … • Community codes: SCEC-CME, CIG, …
Variety of Geoinformatics Efforts • High Performance Computing • LiDAR data management • Seismic analyses • Petascale initiative • Data Integration • E.g. CUAHSI HIS • Also, a pressing need in projects like EarthScope
Cyberinfrastructure: The Common Platform Across Distributed Projects Cyberinfrastructure To provide access to all of these “resources” and support “interoperability” among them Data Management And Curation Modeling and Integration Data Collection Tool Development
Example: USArray Data Flow • Deploy field sensor arrays • Across US • Collect data from sensor arrays and perform QA/QC • One of the sites is SIO, San Diego • Archive data for community access • IRIS, Seattle EarthScope/USArray: Single project, multiple participants.
Survey Example: LiDAR Workflow Courtesy: Chris Crosby, ASU Interpolate / Grid D. Harding, NASA Single goal: Multiple projects, multiple participants, e.g. NCALM, GEON, ASU, NASA, USGS, … Point Cloud x, y, z, … Analyze / “Do Science”
GEON Cyberinfrastructure • Funded by NSF IT Research program • Multi-institution collaboration between IT and Earth Science researchers • GEON Cyberinfrastructure provides: • Authenticated access to data and Web services • Registration of data sets, tools, and services with metadata • Search for data, tools, and services, using ontologies • Scientific workflow environment and access to HPC • Data and map integration capability • Scientific data visualization and GIS mapping
Key Informatics Areas • Portals • Authenticated, role-based access to cyber resources: data, tools, models, model outputs, collaboration spaces, … • Data Integration • Search, discovery and integration of data from heterogeneous information sources (“mediation” and “semantic integration”) • Use of workflow systems, and access to HPC • Ability to “program” at a higher level of abstraction • Sharing of models, along with “provenance” information • Gateways to HPC environments • Management of Geospatial Information • Using GIS capabilities, map services, geospatial data integration • Visualization of 3D, 4D geospatial data and information
Distributed System Definition • A Distributed System is • one in which the hardware and software components in networked computers communicate and coordinate their activities only by passing messages, e.g. the Internet • A Distributed Database System is • one in which data is stored at several sites, each managed by a database system (DBMS) that can run independently
invocation Client A Network Server 1 Network Client C Client B response Process 2 Network Process 1 Network Process 3 Distributed System Models • Client – Server • Peer to Peer
Remote Service Invocation • TCP/IP • Basic Internet protocol for computer communications • Platform for building a number of other open or proprietary, “higher-level” communications protocols • Communication at a higher-level of abstraction • http • Open protocol based on TCP/IP for the Web • Fixed set of “verbs” (actions) used to transfer HTML documents • CORBA, Java RMI • Protocols based on an object model
Java, NT Browsers Prolog Predicate C, C++, Linux I/O Unix Shell Web SRB Databases DB2, Oracle, Sybase Archives HPSS, ADSM, UniTree, DMF File Systems Unix, NT, Mac OSX SDSC Storage Resource Broker “Virtualizing” storage User Resource, Mthd, User Metadata Extraction User Defined Remote Proxies MCAT Dublin Core DataCutter Application Meta-data http://www.sdsc.edu/srb
Network SRB Server B HPSS Client Oracle Client SRB peer-to-peer protocol Network Network HPSS server Oracle Server SRB Client/Server Model Data are requested using an SRB ID and a “file abstraction” (open, close, read, write) SRB Client Network SRB Server
OpenDAP Servers Network OpenDAP Clients OpenDAP • Client/Server model
Servers Flat Binary CODAR General netCDF HDF4 Matlab DSP Tables SQL FITS CDF CEDAR Data Data Data Data Data Data Data Data Data Data Data Data ESML netCDF Matlab JGOFS FITS FreeFrom CODAR IDL Client Matlab Client HDF4 DSP JDBC CDF CEDAR netCDF Java netCDF C Ferret GrADS IDV VisAD ncBrowse Access Matlab IDL Excel Clients OpenDAP From: Peter Cornillon & Jim Gallagherhttp://www.opendap.org/support/stennis_tutorial.html
?sst[10:10][0:90][0:180] Constraint OpenDAP Data Request • Data are requested with a URL. • http://www.cdc.noaa.gov/cgi-bin/nph-nc/datasets/Reynolds_sst • Protocol Machine name OPeNDAP server Directory File name • User can impose a constraint on the data to be acquired from a data set by appending a constraint expression to the end of the URL
Remote Service Invocation with Web Services • A Web Service is a simple protocol for invoking remote services on the Web. It is: • A network “endpoint”, i.e. server, that implements one or more “ports”. • `Each port is defined by the message types that accepts and the messages it returns. • Specified by a “Web Service Definition Language” xml document. • Given the WSDL for a web service you know all you need to interact with it. • Web Service Standards also exist for security, policy, reliability, addressing, notification, choreography and workflow. • It is the basis for MS .NET, IBM Websphere, SUN, Oracle, BEA, HP, … • It is the basis for the new Grid standards like WSRF and OGSA.
Web Site Designed to pass http get/post/put request to between a browser and a web server. Google has a web site. Web Service Designed for services to talk to other services by exchanging xml messages Google also provides a web service so Google may be used in distributed apps Client’s Browser Web Site vs Web ServiceFrom: “Building Grid Applications and Portals, An Approach Based on Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004 Web Server Web Service Web Service Web Service
Open Grid Service Architecture Layer Data Management Service Registries and Name binding Security Policy Logging Accounting Service Administration & Monitoring Reservations And Scheduling Grid Orchestration Event Service Grid ServicesFrom: “Building Grid Applications and Portals, An Approach Based on Components, Web Services and Workflow Tools,” Gannon et al, Euro-Par 2004 • Grid: A distributed, heterogeneous set of resources • Integrated by a pervasive layer of services • Goal: allow users to view it as a single system • More than the Internet (which forms part of the resource layer) • Builds on the Web by building on web services Web Services Resource Framework – Web Services Notification Physical Resource Layer
Access Interfaces and Levels of Access • Web service, native application program interface, ODBC/JDBC, filesystem Application can also be “wrapped” as a Web Service SOAP server stack WSDL and SOAP SRB, OpenDAP, etc… Web Server “stack” URLs and http Application Program DBMS Expose ODBC/JDBC interface (and full SQL) filesystem Mount remote filesystems
User Server 2 Server 3 Authentication • Client – Server models Network Server 1 Client A Client-side authentication ? Server-side authentication ?
Common Authentication Certificate Authority Obtain Credentials Verify Credentials Client Server 1 Server 2 Server 3 Invoke with Credentials
CACL Myproxy CAS … Grid Account Management Architecture (GAMA): Single sign-on in GEON (also used in a number of other projects)Karan Bhatia, Kurt Mueller, Choonhan Youn, Sandeep Chandra gama create user gridportlets DB GridSphere import user OGSA Grid services wrapper retrieve credential Servlet container Java keystore Portal server 1 retrieve credential Portal server 2 Servlet container Java keystore GAMA server Stand-alone applications
Data replication Systems Issues • Load Balancing, Failover, Replication Server 1 Multiple servers for load balancing, failover Server 2 Client Server 3
Distributed Data Access • What is the issue? • Ability to access data stored in multiple, different databases using a single request, e.g. • Get geologic information from multiple geologic databases • Get employee information from all branches • Ability to update data stored in multiple databases, e.g. • Transfer salary amount from University to my bank account • Transfer funds from Visa account to vendor’s account
mySQL Excel ASCII flat file Distributed data access Sources may be data repositories or metadata catalogs Client How about creating a “cached” local copy? Homogeneous: mySQL mySQL mySQL Heterogeneous: mySQL Oracle DB2 Database 1 Database 2 Database 3
2. Query processing interaction only between client and warehouse • Extract • Transform • Load ETL ETL ETL Data Warehousing But, warehouse data could be “stale”, i.e. out of synch with source data… Client Data Warehouse (common schema) 1. Load data from sources to warehouse Data Source 1 Data Source 2 Data Source 3
1. Each client request goes to sources, via middleware 2. Result collected by middleware and returned to client Data integration via middleware Client Data integration Middleware (aka Mediator) Database 1 Database 2 Database 3
Warehousing vs Mediation • Warehousing: User ETL to “massage” local data to fit into a common global, warehouse schema • Mediation: Modify user query to match schemas exported by each source • But, which schema does the user query? • The Integrated View Schema • Sources “export” a view (the export schema) • Federated databases • Local sources belong to different “administrative domains”, i.e. different owners. • Local autonomy
Mediator (Integrated view in mediator data model, e.g. relational, XML) Cached data Q11 Q12 Q13 Q14 Wrapper Wrapper Wrapper Wrapper Local schema Local schema Local schema Local schema q14 Data source 1 Data source 2 Data source 3 Data source 4 The Canonical Mediator / Wrapper Architecture Client Application Wrapper processes could execute at sources, at mediator, or elsewhere Q1 Export view in mediator data model Local view in local data model
Example: A Relational Mediator Client Application Mediator (Relational data model) Wrapper Wrapper Shape file Relational DBMS e.g. PostGIS
Example: A Shape-file Based Mediator Client Application Mediator (Shape file-based data model) Wrapper Wrapper Shape file Relational DBMS e.g. PostGIS
Example: An XML Mediator User / Applications Mediator (XML-based data model, e.g. GML) Wrapper Wrapper Wrapper Shape file XML file e.g. ArcXML Relational DBMS e.g. PostGIS
User Authentication and Access Control How about using GAMA for authentication? 1. User authenticates to system Client Application 2. User connects to mediator (passes credentials to mediator) Mediator • Mediator connects to sources • Using original user credentials • Or, mapped credentials (role-based access) Wrapper Wrapper 4. Need to define users or roles in sources Data source 1 Data source 2
Different types of heterogeneity in data integration • Platform heterogeneity: different OS platforms • DBMS heterogeneity: different database systems, e.g. SQLServer, mySQL, DB2 • Data type heterogeneity • Schema heterogeneity • Heterogeneity in units, accuracy, resolution • Semantic heterogeneity
Sample ID: Rock type: Age: … varchar varchar int Sample ID: Rock type: Age: … varchar varchar varchar Schema Integration • A long standing Computer Science problem • Simple case • Mediator View: (SampleID varchar, Rock_Type varchar, Age int) • In Source2 Table, map Age to int Wrapper Source 1 Table Source 2 Table Wrapper: convert between int and varchar for Age
Phanerozoic Mesozoic Jurassic Another integration scenario • Mediator View: (SampleID varchar, Rock_Type varchar, Age varchar, Era varchar, Period varchar) • In Source 2 Table, parse Age to obtain sub-components of the field Source 1 Table Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Source 2 Table Sample ID: Rock type: Age: varchar varchar varchar “Phanerozoic/mesozoic;jur”
Phanerozoic Mesozoic Jurassic Source 2 Table Sample ID: Rock type: Age: varchar varchar int A more advanced integration scenario • Mediator View: (SampleID varchar, Rock_Type varchar, Eon varchar, Era varchar, Period varchar) • Same as Source1 table schema • Query: Get rock types for all rocks from the Jurassic period Sample ID: Rock type: Eon: Era: Period: varchar varchar varchar varchar varchar Source 1 Table 150
Source 2 Table Sample ID: Rock type: Age: varchar varchar int Geologic_Time Table Eon: Era: Period: Min Max varchar varchar varchar int int Doing the integration • Query sent to mediator: SELECT DISTINCT(Rock_Type) FROM Mediator_View WHERE Period=‘Jurrasic’ • Query to Source 1: SELECT DISTINCT(Rock_Type) FROM Source1_Table WHERE Period=‘Jurrasic’ • For Source2, need to map Period=“Jurassic” to Age values
Query “fragment” sent to Source 2 • SELECT DISTINCT (S2.Rock_Type) FROM Source2_Table S2, Geologic_Time_Table GT WHERE GT.Period = ‘Jurrasic’ AND (S2.Age >= GT.Min) AND (S2.Age <= GT.Max) Where is the Geologic_Time table stored ?
Data Integration Carts™ • Integrating data sets without explicitly creating views • An example request: Plot all gravity data points that fall within the spatial extent of rocks of a given type, in the Rocky Mountain testbed region • Use GEONsearch to find all gravity and geologic data using bounding box for “Rocky Mountain testbed region” • Need gazeteer / spatial ontology to determine Rocky Mountain region • Need to know classification of datasets (as gravity and geology) • Intersect extent of gravity and geologic datasets (from metadata) with extent of Rocky Mountain region • Plot gravity point data that fall within polygons of rocks of given type
Plot map Map Query Ad hoc integration Search Metadata Catalog “Geologic and gravity data in Rocky Mountains” GEONsearch Data Integration Cart™
Spatial Ontology Location Rock Classification Ontology Igneous Point Polygon Granite Quartzmonzonite Latitude Longitude (X, Y) Lat, Long, RockType Metadata Gravity dataset Geologic dataset Metadata Data Registration Item Registration (Schema registration) Item Detail Registration
Another complex query • Query: Get rock types for all rocks from the mesozoic era • Easy to do for Source 1: Era = “Mesozoic” • For Source 2: • Need to find numeric age range for Mesozoic • Find age range across all subclasses of Mesozoic (Cretaceous, Jurassic, Triassic) • Select all Source 2 Table records whose age range falls within the Mesozoic age range