760 likes | 922 Views
Architecting Scientific Data Services Using XML, SOAP, & the Grid. Brian Wilson, Tom Yunck, Gerald Manipon, and Dominic Mazzoni Jet Propulsion Laboratory. Networks. Portal. Data. Storage. Algorithms. CPU. A Conceptual Grid. “On-Demand”. Virtualization. Scientist. Vision.
E N D
Architecting Scientific Data ServicesUsing XML, SOAP, & the Grid Brian Wilson, Tom Yunck, Gerald Manipon, and Dominic Mazzoni Jet Propulsion Laboratory
Networks Portal Data Storage Algorithms CPU A Conceptual Grid “On-Demand” Virtualization Scientist
Vision • Open Your Databases! • Support Query on and Retrieval of ALL Metadata • Could support remote SQL queries on your entire database. • Distributed/Local Metadata - not centralized as in ECHO • Support XML formats with schemas & OWL/RDF semantics • Examples: Amazon & Google SOAP Web services • Open Your Architectures – Modularity, Reuse • Think in XML!! (design in XML first, not Java objects) • Expose algorithms & application modules via SOAP • Massive distributed computing via XML message exchange • Reuse modules from OpenDAP/DODS and WMS/WCS • Web Services Choreography / Grid Dataflow • Leverage WS-Coordination, -Context, -Security, etc. • Leverage WS-Resource Framework (WSRF, Globus v4) • Semantic Web Reasoning on top of WS substrate Carbon Cycle
Outline • Web Services • Using XML, SOAP, & WSDL • Thinking in XML: REST versus SOAP • Architecting Data Query/Access Services using SOAP • SOAP as a Universal Interoperability Glue • Classifying Data Services • Dimensions / Categories • Interop between SOAP, OpenDAP, & OGC WMS/WCS • Semantic Web Support • Data Discovery to Inventory Level -- Enabling SOAP services • WS Standards • WS-Coordination, -Notification, -Security, -ReliableMessaging • Convergence of Web & Grid Standards • WS-Resource Framework (WSRF) • Globus Toolkit v4 Carbon Cycle
Three Generations of the Web • 1st: Static HTML pages with Pictures! • Hyperlinks: click and jump! • Easy authoring of text with graphics (HTML layout). • Killer App: Having your own home page is hip. • Cons: Too static, One-way communication. • 2nd: Dynamic HTML with streaming audio & video • Browser as an all-purpose, ubiquitous user interface. • Fancy clients using embedded Javaapplets. • Killer App: Fill out your time card on-line. • Cons: Applets clunky, ease of authoring disappears,information is still HTML (semi-structured). • 3rd: SOAP-based Web Computing & Semantic Web • Exchange structured data in XML format (no fragile HTML); semantics (“meaning”) kept with data. • Programmaticinterfaces rather than just GUI for a human. • Killer Apps: Grid Computing, automated data processing. Carbon Cycle
What is Simple Object Access Protocol (SOAP)? • Distributed Computing by Exchange of XML Messages • Lightweight, Loosely-Coupled API • Programming language independent (unlike Java RMI) • Transport protocol independent • Multiple Transport Protocols Possible • HTTP or HTTPS (HTTP using SSL encryption) POST • Email (SMTP), Instant Messaging (MQSeries, Jabber) • Store & Forward Reliable Messaging Services (Java JMS) • Even Peer-to-Peer (P2P) protocols • SOAP Toolkits for all languages • Apache Axis for Java, modules for python/perl, Visual C# .Net. • Web Service Description Language (WSDL) • Generate call to a service automatically from WSDL document. • Data types and formats expressed in XML schema. • Publish services in UDDI catalogs for automated discovery. Carbon Cycle
Third Generation of the Web • SOAP-based Web Computing & Semantic Web • Exchange structured data in XML format (not HTML) • Semantics or “meaning” kept with the data • Emphasize programmatic interfaces • Grid Services – WSRF and Globus Toolkit v4 • Leverage many WS-* standards (Coordination, etc.) • Simple Object Access Protocol (SOAP) • Distributed Computing by Exchange of XML Messages • Lightweight, Loosely-Coupled API • Programming language independent • Multiple Transport Protocols Possible (HTTP, P2P) • Web Services Description Language (WSDL) • Publish Services in catalogs for automated discovery Carbon Cycle
Why use SOAP to create scientific services? • New paradigm of loosely-coupled distributed computing • SOAP messaging is exploding in the business world. • Used in Grid Computing: WSRF and Globus Toolkit v4 • Asynchronous workflow (return results when available) • Leverage XML standards (WSDL, UDDI, WS-*) • Service API is XML messages, not code. • Data types and formats expressed in XML schema. • Exchanging large scientific data sets • Small objects in XML format • Even medium-size objects (2D grid slices) can be in XML. • Large objects as references (ftp or http URL) to HDF-EOS or netCDF binary containers (files). • Security & authentication less important in science • Use HTTPS, Basic (user/password) Authentication, or WS-Security & WS-Authentication Carbon Cycle
SOAP Software Toolkits • Java (free) • Install Java 1.4, Tomcat 4.1, and Apache Axis 1.0 • Optionally install WSDL4J, UDDI4J, & WSIF from IBM • Configure Tomcat & Axis: classpaths, etc. • Perl (free) • Install perl 5.6 & SOAP::Lite module (contains UDDI::Lite) • Python (free) • Install python 2.2 and SOAP.py • Optionally install UDDI4py from IBM • DotNet (not free, but cheap) • Install Visual Studio .Net with C# or VB compilers • Send a payment to MS • C++ (some free libraries) • But more work required
Toolkits Do the Work • Transparent Serialization • Java JDBC ResultSet in the server -> XML format -> Java ResultSet Object in the client • Or ResultSet -> C# DataTable Object in a C# client • Or ResultSet -> Perl 2D array or hash in a perl client • Custom Serialization Possible • Write a custom serializer for any object • Automatic Deployment • Axis: Place Java code in the deployment directory and the service is magically exposed. • Can customize deployment using WSDD • C# .Net: Simply annotate the function as a [Web Method] • Perl & Python: A few lines of code. • Security & Authentication: Use capabilities of Apache
Interoperable SOAP Web Services Python Client Example #!/usr/local/bin/python # soap_sql.py from SOAPpy import SOAP #import SOAP db = sys.stdin.readline() lines = sys.stdin.readlines() sql = string.join(lines, '\n') server = SOAP.SOAPProxy( "https://genesis:passwd@gilaan.usc.edu:80/axis/GenesisSQL.jws", namespace = "urn:Genesis_SQL", encoding = None) print server.executeQuery(SOAP.stringType(sql), SOAP.stringType(db)) Input: jplsacc select * from L2data where datasetid == '20010801_1350sac_g43_1p0' and datatype == ‘gpsProfile’
REST versus SOAP • REpresentational State Transfer (REST) • Roy Fielding’s Ph.D. Thesis • Everything is an object or document with a URI. • Problems: • Hidden state: cookies, sessions, CGI db, etc. • What about virtual datasets? • Need to generate too many URI’s. • Service Oriented Architecture (SOA) • Everything is a service. • What about stateful services? WS-Resource • Same Old Tradeoff • Like object-oriented programming versus functional programming (need both). • It’s both an object and a service.
Amazon E-Commerce Services • Item Search – SOAP Service input • <SubscriptionId>[Your SOAP key here]</><Request> <Keywords>free lunch hoagie</> <SearchIndex>Books</></Request> • WSDL File • http://webservices.amazon.com/AWSECommerceService/AWSECommerceService.wsdl • Returns XML Document • Contains TotalResults count and a list of Items • Each item contains many attributes (title, authors, ASIN, etc.)
Amazon E-Commerce Services • Also support REST paradigm for request • http://webservices.amazon.com/onca/xml?Service=AWSECommerceService&Operation=ItemSearch&SubscriptionId=[Your SOAP Key here]&Keywords=free%20lunch%20hoagie&SearchIndex=Books&ResponseGroup=Medium • Response is same XML Document • With results tailored by ResponseGroup(s) • Why are they opening their database up? • Allow 3rd parties to build apps on top of db • Build partnerships • Promulgate their software architecture/standard • Goal: Amazon lookup part of every business web app. • Makes sense even for commercial entities • Certainly best for scientific data archives
Example: General SQL Query Service • Using Informix Object-Relational Database • Contains all Level 2 GPS occultation profile retrievals • Multiple parameters (temperature, water vapor pressure, refractivity, & pressure) as a function of altitude • GeoSpatial Datablade provides 4D R-tree index on time, latitude, longitude, & altitude. • Support “nearest” semantics (multiple hits sorted by distance) • GetNearest service & BatchGetNearest (batch of requests) • Implemented using Informix ORDBMS & Geodetic DataBlade • SQL queries using 4D R-tree index on time, lat, lon, height • executeQuery SOAP Service • Two input arguments • Database to search • Text of arbitrary SQL Query • Single output • SQL ResultSet in XML format
Output of the executeQuery Service version="1.0" encoding="utf-8" ?> <?xml-stylesheet type="text/xsl" href="http://gilaan.usc.edu:80/axis/xsldir/xsl271059363124909.xsl"?> <ResultSet> <Result> <datasetid>20010801_1350sac_g43_1p0</datasetid> <datatype>gpsProfile</datatype> <height>0.0840427222</height> <lat>6.6839553</lat> <lon>-38.501096</lon> <refractivity>359.099346</refractivity> <temperature>308.641494</temperature> <pressure>1009.2816</pressure> <wvpressure>23.3353547</wvpressure> </Result> <Result> <datasetid>20010801_1350sac_g43_1p0</datasetid> <datatype>gpsProfile</datatype> <height>0.404808732</height> <lat>6.71812814</lat> <lon>-38.513238</lon> <refractivity>346.437323</refractivity> <temperature>302.964981</temperature> <pressure>973.213908</pressure> <wvpressure>21.5868849</wvpressure> </Result> . . .
executeQuery SOAP Message Accept: text/xml Accept: multipart/* Content-Length: 663 Content-Type: text/xml; charset=utf-8 SOAPAction: "urn:Genesis_SQL#executeQuery" <?xml version="1.0" encoding="UTF-8"?> <SOAP-ENV:Envelope xmlns:SOAP-ENC="http://schemas.xmlsoap.org/soap/encoding/” xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/1999/XMLSchema” xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance” SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/”> <SOAP-ENV:Body> <namesp1:executeQuery xmlns:namesp1="urn:Genesis_SQL"> <c-gensym3 xsi:type="xsd:string">select * from L2data where datasetid == '20010801_1350sac_g43_1p0’ and datatype == 'gpsProfile' </c-gensym3> <c-gensym5 xsi:type="xsd:string">jplsacc</c-gensym5> </namesp1:executeQuery> </SOAP-ENV:Body> </SOAP-ENV:Envelope>
executeQueryResponse SOAP Message <?xml version="1.0" encoding="UTF-8"?> <soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsd="http://www.w3.org/1999/XMLSchema" xmlns:xsi="http://www.w3.org/1999/XMLSchema-instance"> <soapenv:Body> <ns1:executeQueryResponse soapenv:encodingStyle= "http://schemas.xmlsoap.org/soap/encoding/" xmlns:ns1="urn:Genesis_SQL"> <ns1:executeQueryReturn xsi:type="ns2:string" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:ns2="http://www.w3.org/2001/XMLSchema"> [Returned XML doc here, encoded] </ns1:executeQueryReturn> </ ns1:executeQueryResponse> </ soapenv:Body> </ soapenv:Envelope>
SOAP XML-RPC Productivity / Ease of Use DCOM CORBA DCE Specialized Time History of Distributed Computing History of Programming Languages Python Perl Java Java RMI C++ Productivity / Ease of Use Object Pascal Turbo Pascal C Fortran Time
Sad History of Function Call Interfaces • Fortran – partially typed • Positional arguments only, by address, partially typed • C/Pascal – fully typed • Added varargs for unknown number of arguments • But used pointers to return multiple results • C++ • Added function/operator overloading • Java • Pointers hidden (yes!), but lost operator overloading • Perl, Python, IDL – func(arg1, arg2, *array, **hash) • Added implicit array to catch variable number of args • Added implicit hash to catch named args • XML – input & output are hierarchical XML documents • Flexible interface since can ignore unneeded tags • Hierarchy & dynamic typing imply rich interface • Use XQuery/XPath to extract/update needed tags
Thinking in XML • Design the XML metadata and data representations first. Add an ontology. • Then create objects in your favorite programming language. • Then create objects in more languages for broader use. • 4DRectilinearGrid datatype • <4DRectilinearGrid> <Latitudes> <Series>-90, 90, 2 <Longitudes> <URL>http://gsfc.nasa.gov/cgi-bin/dods/AIRS.hdf?Longitude <Altitudes> <Times> <Variable> <Name> <Unit> <MissingValueMarker> </Variable> <Grid> <URL>http://gsfc.nasa.gov/cgi-bin/dods/AIRSgranule.hdf?TAirStd </Grid></4DRectilinearGrid>
New Paradigm for Scientific Computing • Use two or more channels • XML Messaging on one • Binary data protocol (like OpenDAP) on the other • Two channels can use different transports • XML Messaging over HTTP or dark P2P • OpenDAP over LambdaRail or scalable P2P caching network • XML Messaging Channel for: • Control, Service Chaining, Workflow, Metadata Exchange • Configuration of Binary channel • Binary Data Channel(s) • Out-of-band from XML point of view • Bulk data transfer, replication, or caching • Could be multi-protocol (OpenDAP, GridFtp, LambdaGrid)
Data Query/Access Services • Services Chain goes beyond simple data access • Time & Geolocation Query yielding data inventory • Query on Quality Flags • General Metadata Query catalog satisfying conditions • Data Access - by object or granule ID and variable name • Data Slicing - as in OpenDAP • Data Subsetting – by lat, lon, alt, & time ranges • Parameter Subsetting – select only desired physical variables • Data Reformatting – choose output format as in WMS/WCS • On-Demand Grid Computations – grid diff in GraDS/DODS • Variable Bundles – external metadata from scientist • Return Composite Data Objects • 4DRectilinearGrid, 4DCurvilinearGrid (swath) • XML & Native Binary representations • User Composites – add custom object to data model • Semantic Support • Use generic variable names • Reason about variables to be bundled
Classifying Data Services - Dimensions • Services Provided • Is it discovery, query, access, and/or subsetting? • Degree of Metadata Support • Richness of the Data Model • Request Interface / Protocol • Response Interface / Protocol • Representation of Large Binary Data Objects • Transport Layers (http, GridFtp, p2p, LambdaRail) • Permanent Names & Version Support • Service Chaining • Semantic Web Support • Interoperability with other services • Maturity, Support, Open or Proprietary, Cost
Classifying Data Services (I) • Services Provided • Is it discovery, query, access, and/or subsetting? • Taxonomy on earlier slide • Degree of Metadata Support • Time & Geolocation Query? • General Metadata Query? • Retrieve selected metadata fields? • Bind external metadata into the dataset? • Richness of the Data Model • Supports point, swath, grid, and vector data types? • Is the data model exposed to the user? • Interoperability? (i.e., between OpenDAP, OGC, HDF-EOS) • Composite objects, like 4D geo-referenced grid? • Add user composites? (as in Java object serialization) • Compute grid averages or differences on demand? (Ferret)
Classifying Data Services (II) • Request Interface / Protocol • REST (one-line HTTP URL’s) versus SOAP • OpenDAP & WMS/WCS use REST • REST tends to be more ad hoc than structured XML/SOAP • Response Interface / Protocol • OpenDAP uses custom binary wire protocol • OGC WMS/WCS uses REST • REST uses HTTP with MIME types & multi-part • SOAP uses XML documents with possible attachments • Representation for Large Binary Data Objects • OpenDAP: custom binary wire protocol • WMS/WCS: MIME multi-part embedding • SOAP: embedded, attachments, or URL references • Binary files referenced by ftp, http, or DODS URL’s • Serialized Binary Objects (as in C++, Java, or C# .Net)
Classifying Data Services (III) • Transport Layers (http, GridFtp, p2p, LambdaRail) • REST uses http for both request & response • SOAP can use http, smtp, Instant Messaging, Secure Messaging, or p2p • Degree of Modularity & Layering • Is the software separated into well-defined protocol layers? • Can functions/modules be reused separately? • Example: OpenDAP contains a rich data model, data slicing syntax, and a binary wire protocol. Reuse each? • Clean design or somewhat ad hoc? • Permanent Names & Version Support • Globally unique ID’s for data objects / granules • Permanent names for service endpoints • Versions for data objects & services • Is it a service or an object? (REST vs. SOAP) • It’s both. And they both need to be permanent.
Classifying Data Services (IV) • Service Chaining • Can modules be chained together? • In REST, here one reaches limits of one-line URL’s • In SOAP, have rich world of WS Choreography or Grid Workflow solutions • Use semantic web to discover and chain services • Semantic Web Support • Exposes scientific domain info. for data and service discovery? • Generic dataset and variable names? • Are the data categorized in an ontology? • Is the service categorized in an ontology? • Can the “meaning” of the service and the meaning of each of its inputs and outputs be inferred from an ontology? • Is the service interoperable with others through semantic mediation?
Time & Geolocation Query Service • Inputs • StartTime, EndTime, Lat/Lon Box, DatasetName, VariableName, MetadataGroup(s) • Output • List of items • Each item has unique ID, start & end time, lat/lon box • More (optional) metadata groups • Datasets • Implemented for GENESIS GPS database, and for AIRS, MODIS, & MISR granules in DataPool.
SciFlo Data Query/Access (SOAP) Services • GeoLocationQuery(startTime, endTime, lat, lon, timeTolerance, distanceTolerace, variable, metadataGroups) • Returns granule ID’s and geolocation info. for AIRS L2 swaths that intersect lat/lon point or are near enough. • GeoRegionQuery(startTime, endTime, lat/lon region, . . .) • Returns granule ID’s and geolocation info. for AIRS L2 swaths that intersect the lat/lon region or are near enough. • QueryByMetadata(variable, ListOfConstraints, groups) • Returns granule ID’s and selected metadata for AIRS granules that satisfy the metadata constraint expression:(min1 <= field1 <= max1 and/or min2 <= field2 <= max2 . . .). • GetDataById(IdList, UrlOptions) • Given list of unique ID’s, returns list of ftp, http, or DODS URL’s pointing to the granules (files). Uses cache and redirection server. • -- Other semantic interfaces possible
SciFlo Data Query/Access (SOAP) Services • GetMetadata(type, groups) • Returns metadata describing scientific domain, dataset info, list of metadata fields, generic name translation table, location of XML schema documents, location of related ontologies (OWL/RDF), etc. • GetHelp(type, groups) • A SOAP service that itself returns help documentation describing the available SOAP services. • --These Services Combined Can Support: • Data Catalog, General Query, Data Access, Generic Names, • Data Discovery by Domain and Dataset Keywords • Type checking via XML schemas • Hooks to semantic web
Data Discovery to Inventory Level • SOAP Services just described provide: • Domain description Metadata • Dataset Info. and Keywords • Generic VariableName Lookup Table • Query by Metadata or just Time & GeoRegion • Layer more semantics on top of these services • Dataset Discovery by keyword search • Bind in additional metadata provided by scientist • Tie ontologies into keywords and genericVariableNames • Semantic Inference: AIRS and MODIS both provide atmosphericTemperature and cloudProperties. Hmmm. Compare them. • XML/SOAP provides substrate; possibilities only limited by ontology development.
Evolving Grid Computing Standards Carbon Cycle GT4 [From Globus Toolkit “Ecosystem” presentation at GGF11 by Lee Liming]
Evolving Grid Computing Standards • History of Scientific Computing as a Utility • The Grid began as effort to tightly couple multiple super- or cluster computers together (e.g., Globus Toolkit v1 & v2). • Needed job scheduling, submission, monitoring, steering, etc. • SETI@HOME success • OGSI: Open Grid Services Infrastructure • Globus v3.2 is open-source implementation using Java/C. • A service is Grid-enabled by inheriting from Java class. • Standard is complex and growing. • Challenge: Ease of installation & use. • WS-Resource Framework (WSRF) • Capabilities treated as storage or computing resources exposed on the web. • Globus Toolkit v4 coming Carbon Cycle
Evolving Grid Computing Standards Carbon Cycle GT4 [From Globus Toolkit “Ecosystem” presentation at GGF11 by Lee Liming]
Grid Applications A ComputeServer SimulationTool B ComputeServer WebBrowser WebPortal RegistrationService Camera TelepresenceMonitor DataViewerTool Camera C Database service ChatTool DataCatalog D Database service CredentialRepository E Database service Certificate authority Users work with client applications Application services organize VOs & enable access to other services Collective services aggregate &/or virtualize resources Resources implement standard access & management interfaces
Grid Apps Using Globus Middleware GlobusGRAM ComputeServer SimulationTool GlobusGRAM ComputeServer WebBrowser CHEF Globus IndexService Camera TelepresenceMonitor DataViewerTool Camera GlobusDAI Database service CHEF ChatTeamlet GlobusMCS/RLS GlobusDAI Database service MyProxy GlobusDAI Database service CertificateAuthority Users work with client applications Application services organize VOs & enable access to other services Collective services aggregate &/or virtualize resources Resources implement standard access & management interfaces
WU GridFTP MDS2 GSI RFT (OGSI) WS-Index (OGSI) WS-Security RLS CAS (OGSI) OGSI-DAI SimpleCA XIO Components in Globus Toolkit 3.2 JAVA WS Core (OGSI) Pre-WS GRAM WS GRAM (OGSI) OGSI C Bindings OGSI Python Bindings (contributed) pyGlobus (contributed) Security Data Management Resource Management Information Services WS Core
New GridFTP JAVA WS Core (WSRF) Pre-WS GRAM MDS2 GSI RFT (WSRF) C WS Core(WSRF) WS-GRAM (WSRF) WS-Index (WSRF) WS-Security CSF (contribution) RLS CAS (WSRF) OGSI-DAI SimpleCA XIO Planned Components in GT 4.0 pyGlobus (contributed) Authz Framework Security Data Management Resource Management Information Services WS Core
Evolving Grid Computing Standards Carbon Cycle GT4 [From Globus Toolkit “Ecosystem” presentation at GGF11 by Lee Liming]
Compare/Contrast Template • Services Provided • Implemented Standards / Protocols • Inputs, Outputs, & Parameters • Metadata • Maturity • Number of Users, Data Volume • Support • Level, Type, Availability • Open / Proprietary • Open? Proprietary? Published? Costs? • Activities • Interoperability? • Future Plans • Services to be Implemented
Services Provided by WS • Distributed Computing via XML Message Exchange • Service Oriented Architecture (SOA) • Many transport protocols supported (http, smtp, messaging, p2p) • Security, Authentication, Authorization • Reliable Messaging • Notification • Transactions • Stateful Resources – creation, lifetime (WSRF) • WS Choreography • Grid Workflow
Implemented Standards / Protocols • XML schema • SOAP, WSDL, UDDI • WS-Resource Framework • WS-ResourceProperties, -ResourceLifetime, -BaseFaults • WS-Addressing (endpoints with state pointers) • WS-Notification (publish & subscribe) • WS-Security, -Policy, -SecurityPolicy • WS-Transaction, -ReliableMessaging • WS-Coordination, -Context (choreography) • BPEL4WS (IBM workflow engine) • SciFlo Engine coming
Inputs, Outputs, & Parameters • Too numerous to list. • See SOAP and WS-* specifications.