1 / 42

Lessons on Process and Standards in other science communities

Lessons on Process and Standards in other science communities. IMAG Model Sharing Strategies Workshop NIH April 10 2007 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401

lawson
Download Presentation

Lessons on Process and Standards in other science communities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lessons on Process and Standards in other science communities IMAG Model Sharing Strategies Workshop NIH April 10 2007 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 http://grids.ucs.indiana.edu/ptliupages/presentations/ gcf@indiana.eduhttp://www.infomall.org

  2. What is a Model Electronically? • This should have a label – a URI • It should have a collection of data or metadata defining it • It might have some way of building composite models by joining multiple smaller models together • Need to be able to define connections • Maybe there are also “mechanisms” to manipulate model or evolve it in time • A computer program defines the data as values and the mechanisms as subroutines/methods • Programs can be Fortran, Python, C#, Prolog • Declarative or Imperative; Scripted or Compiled • However in spite of software engineering, computer programs are very hard to share and re-use

  3. What are Questions? • What are the models we are trying to define? • What is Process to decide on needed standards and their Syntax • Are we mainly concerned about data defining the model and/or the programs that build the model • Where are overlaps between IMAG requirements and other computer science or science fields • Is the barrier to sharing models “science” (i.e. it is not clear what the common interfaces are) or “systematization” (we agree on interface points but don’t have a common syntax)

  4. Some Examples • There are many examples of relevant efforts to encourage sharing of models • DMSO (Defense Modeling and Simulation Office) produced HLA (High Level Architecture) as a (pre-CORBA/Web Service) way of defining military models as discrete event simulations • Good but out of date • The Open Geospatial ConsortiumOGC http://www.opengeospatial.org/ is a consortium of 339 organization setting excellent standards for Geographical Information Systems • We could develop a BIS Biological Information System? • International Virtual Observatory Alliance IVOAhttp://www.ivoa.net/ is 16 organizations (each of which is a collection like EVO the European Virtual Obsevatory) is defining sharing standards for astronomy data

  5. Virtual Observatory Astronomy GridIntegrate Experiments Radio Far-Infrared Visible Dust Map Visible + X-ray Galaxy Density Map

  6. OGC Standards I

  7. OGC Standards II

  8. WMS uses WFS that uses data sources <gml:featureMember> <fault> <name> Northridge2 </name> <segment> Northridge2 </segment> <author> Wald D. J.</author> <gml:lineStringProperty> <gml:LineStringsrsName="null"> <gml:coordinates> -118.72,34.243 -118.591,34.176 </gml:coordinates> </gml:LineString> </gml:lineStringProperty> </fault> </gml:featureMember> Defines Earthquake Fault

  9. OGC Standards • Typify a common competition – there is a similar effort by Technical Committee tasked by the International Standards Organization (ISO/TC211). • Are very complex – GML specification itself is over 600 pages • Underlie the success of GIS and enabled through first through ESRI (ArcInfo) and Minnesota Map Server and now through Google Maps • Are built in XML (as they should be) but for efficiency one • Transmits through binary XML • Stores in SQL databases not in XML databases • Define some tings (catalog) which are unnecessary as provided by a broader community • Observations and Measurements work for any time series and so are also broader but no competition!

  10. OGC Standards Structure • Have a language GML that defines the field – this would be CellML and SBML in the case of Biology and CML for ChemInformatics • Have a user interface (the Map) captured as a Web Map Service • Have a “pixel data” service WCS the Web Coverage Service • Have a “vector” (feature, property) data service WFS the Web Feature Service • Note any Earth Science simulation or data analysis can be thought of as accepting WFS compatible data and producing WFS or WCS compatible output

  11. Streaming Data Support Archival Transformations Data Checking Hidden MarkovDatamining (JPL) Real Time Display (GIS) Grid Workflow Datamining in Earth Science NASA GPS • Work with Scripps Institute • Grid services controlled by workflow process real time data from ~70 GPS Sensors in Southern California Earthquake

  12. Data Federation • The IVOA activities is aimed largely at supporting interoperable data repositories that can feed into the image processing filtering needed to extract signals • There us not so much simulation • ChemInformatics has most data in NIH’s PubChem but will need to federate additional repositories such as those produced by individual Chemistry groups and the raw data from NIH screening centers • Every county (total 92) in Indiana has its own GIS and something equivalent to a WFS holding information not yet known to Google! (e.g. our house pinpoint address and assessment) • Need to federate all these to support state agencies • So federation of distributed resources a major issue and WFS uses “capabilities” to support this

  13. Indiana County Map Grid GIS Grid of “Indiana Map” and ~10 Indiana counties with accessible Map (Feature) Servers from different vendors. Grids federate different data repositories (cf Astronomy VO federating different observatory collections)

  14. Adapter Adapter Adapter Tile Server Cache Server Google Maps Server Marion County Map Server (ESRI ArcIMS) Hamilton County Map Server (AutoDesk) Cass County Map Server (OGC Web Map Server) Must provide adapters for each Map Server type . Browser client fetches image tiles for the bounding box using Google Map API. Tile Server requests map tiles at all zoom levels with all layers. These are converted to uniform projection, indexed, and stored. Overlapping images are combined. The cache server fulfills Google map calls with cached tiles at the requested bounding box that fill the bounding box. Browser + Google Map API

  15. Searched on Transit/Transportation Searched on Transit/Transportation

  16. Service or Web service Approach • One uses GML, CML etc. to define the data in a system and one uses services to capture “methods” or “programs” • In eScience, important services fall in three classes • Simulations • Data access, storage, federation, discovery • Filters for data mining and manipulation • Services use something like WSDL (Web Service Definition Language) to define interoperable interfaces (see OPAL talk!) • WSDL establishes a “contract” independent of implementation between two services or a service and a client • Services should be loosely coupled which normally means they are coarse grain • Services will be composed (linked together) by mashups (typically scripts) or workflow (often XML – BPEL) • Software Engineering and Interoperability/Standards are closely related

  17. Philosophy of Web Service Grids • Much of Distributed Computing was built by natural extensions of computing models developed for sequential machines • This leads to the distributed object (DO) model represented by Java and CORBA • RPC (Remote Procedure Call) or RMI (Remote Method Invocation) for Java • Key people think this is not a good idea as it scales badly and ties distributed entities together too tightly • Distributed Objects Replaced by Services • Note CORBA was considered too complicated in both organization and proposed infrastructure • and Java was considered as “tightly coupled to Sun” • So there were other reasons to discard • Thus replace distributed objects by services connected by “one-way” messages and not by request-response messages

  18. Web services • Web Services build loosely-coupled, distributed applications, (wrapping existing codes and databases) based on the SOA (service oriented architecture) principles. • Web Services interact by exchanging messages in SOAPformat • The contracts for the message exchanges that implement those interactions are described via WSDL interfaces.

  19. PortalService Security Catalog A typical Web Service • In principle, services can be in any language (Fortran .. Java .. Perl .. Python) and the interfaces can be method calls, Java RMI Messages, CGI Web invocations, totally compiled away (inlining) • The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python PaymentCredit Card Web Services WSDL interfaces Warehouse Shipping control WSDL interfaces Web Services

  20. OSCAR Document Analysis InChI Generation/Search Computational Chemistry (Gamess, Jaguar etc.) Varuna.net Quantum Chemistry Grid Services Service Registry Job Submission and Management Local Clusters IU Big Red TeraGrid, Open Science Grid Portal Services RSS Feeds User Profiles Collaboration as in Sakai CICC Web Service Infrastructure

  21. Where Does The Functionality Come From? University of Michigan • PkCell Cambridge University • InChi generation / search • OSCAR DigitalChemistry • BCI fingerprints • DivKMeans gNova Consulting NIH • PubChem • PubMed CDK • Cheminformatics European Chemicals Bureau • ToxTree toxicity predictions OpenEye • Docking R Foundation • R package Indiana University • VOTables • NCI DTP predictions • Database services

  22. Service Modeling Language (SML) • Submitted to W3C by industry giants 21 March 2007 • A model in SML is realized as a set of interrelated XML documents. The XML documents contain information about the parts of an IT service, as well as the constraints that each part must satisfy for the IT service to function properly. Constraints are captured in two ways: • Schemas – these are constraints on the structure and content of the documents in a model. SML uses a profile of XML Schema 1.0 as the schema language. SML also defines a set of extensions to XML Schema to support inter-document references. • Rules – are Boolean expressions that constrain the structure and content of documents in a model. SML uses a profile of Schematron (goes between documents) and XPath 1.0 for rules.

  23. Models in SML • Models focus on capturing all invariant aspects of a service/system that must be maintained for the service/system to be functional. • Models are units of communication and collaboration between designers, implementers, operators, and users; and can easily be shared, tracked, and revision controlled. This is important because complex services are often built and maintained by a variety of people playing different roles. • Models drive modularity, re-use, and standardization. Most real-world complex services and systems are composed of sufficiently complex parts.  Re-use and standardization of services/systems and their parts is a key factor in reducing overall production and operation cost and in increasing reliability. • Models represent a powerful mechanism for validating changes before applying the changes to a service/system. Also, when changes happen in a running service/system, they can be validated against the intended state described in the model. The actual service/system and its model together enable a self-healing service/system – the ultimate objective. Models of a service/system must necessarily stay decoupled from the live service/system to create the control loop • Models enable increased automation of management tasks. Automation facilities exposed by the majority of IT services/systems today could be driven by software – not people – for reliable initial realization of a service/system as well as for ongoing lifecycle management.

  24. Structured v Unstructured Metadata • The schema’s that are defined by GML etc. are structured definitions • The traditional semantic web approach is largely based on structured metadata (OWL) that one can analyze precisely • UML was for example used by OGC in developing standards • In the “real world”, unstructured annotation has been very successful as seen in Connotea, del.icio.us and CiteULike

  25. How to set standards • If one is Google, you can just define the standard and not bother to discuss it! • Google maps does not support OGC standards • The growth in distributed computing has spurred a great deal of standards work as we need the different parts of system built by different people • Often meet every few weeks to build a standard in 12 months • OASIS defines a process and doesn’t define an architecture • W3C is most prestigious • OGF Open Grid Forum has an eScience section that is currently led by me • Or do it outside any standards body as in fact most domain specific standards are done • Note IVOA has meetings from time to time at OGF to coordinate their astronomy standards with general Grid standards

  26. The Grid and Web Service Institutional Hierarchy 4: Application or Community of Interest (CoI)Specific Services such as “Map Services”, “Run BLAST” or “Simulate a Missile” XBMLXTCE VOTABLE CML CellML 3: Generally Useful Services and Features (OGSA and other GGF, W3C) Such as “Collaborate”, “Access a Database” or “Submit a Job” OGSA GS-*and some WS-* GGF/W3C/….XGSP (Collab) 2: System Services and Features (WS-* from OASIS/W3C/Industry) Handlers like WS-RM, Security, UDDI Registry WS-* fromOASIS/W3C/Industry 1: Container and Run Time (Hosting) Environment (Apache Axis, .NET etc.) Apache Axis.NET etc. Must set standards to get interoperability

  27. The Ten areas covered by the 60 core WS-* Specifications

  28. Activities in Global Grid Forum Working Groups

  29. Two-level Programming I Service Data • The Web Service (Grid) paradigm implicitly assumes a two-level Programming Model • We make a Service (same as a “distributed object” or “computer program” running on a remote computer) using conventional technologies • C++ Java or Fortran Monte Carlo module • Data streaming from a sensor or Satellite • Specialized (JDBC) database access • Such services accept and produce data from users files and databases • The Grid is built by coordinating such services assuming we have solved problem of programming the service

  30. Service1 Service3 Service2 Service4 Two-level Programming II • The Grid is discussing the composition of distributed serviceswith the runtime interfaces to Grid as opposed to UNIX pipes/data streams • Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs • Such interpretative environments are the single processor analog of Grid Programming • Some projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately

  31. Grid Workflow Data Assimilation in Earth Science • Grid services triggered by abnormal events and controlled by workflow process real time data from radar and high resolution simulations for tornado forecasts Typical graphical interface to service composition

  32. 3 Layer Programming Model Application (level 1 Programming) MPI Fortran C++ etc. Semantic Web Application Semantics (Metadata, Ontology) Level 2 “Programming” Basic Web Service Infrastructure Web Service 1 WS 2 WS 3 WS 4 Workflow (level 3) Programming BPEL Workflow can be built on top of NaradaBrokering as messaging layer

  33. SS Database SS SS SS SS SS SS SS Raw Data  Data  Information  Knowledge  Wisdom AnotherGrid Decisions AnotherGrid SS SS SS SS FS FS OS MD MD FS Portal OS OS FS OS SOAP Messages OS FS FS FS AnotherService FS FS MD MD OS MD OS OS FS Other Service FS FS FS FS MD OS OS OS FS FS FS MD MD FS Filter Service OS AnotherGrid FS MetaData FS FS FS MD Sensor Service SS SS SS SS SS SS SS SS SS SS AnotherService

  34. Information Management/Processing • SOAP messages transport information expressed in a semantically rich fashion between sources and services that enhance and transform information so that complete system provides • Semantic Web technologies like RDF and OWL help us have rich expressivity • Data  Information  Knowledgetransformation • We build application specific information management/transformation systems ASIS for each application domain • One special domain is the system itself where the metadata associated with services, sessions, Grids, messages, streams and workflow is itself managed and supported by an SIIS

  35. Generalizing a GIS • Geographical Information Systems GIS have been hugely successful in all fields that study the earth and related worlds • They define Geography Syntax (GML) and ways to store, access, query, manipulate and display geographical features • In SOA, GIS corresponds to a domain specific XML language and a suite of services for different functions above • However such a universal information model has not been developed in other areas even though there are many fields in which it appears possible • BIS Biological Information System • MIS Military Information System • IRIS Information Retrieval Information System • PAIS Physics Analysis Information System • SIIS Service Infrastructure Information System

  36. ASIS Application Specific Information System I • a) Discovery capabilities that are best done using WS-* standards • b) Domain specific metadata and data including search/store/access  interface. (cf WFS). Lets call generalization ASFS (Application Specific Feature Service) • Language to express domain specific features (cf GML). Lets call this ASL (Application Specific language) • Tools to manipulate information expressed in language and key data of application (cf coordinate transformations). Lets call this ASTT (Application specific Tools and Transformations) • ASL must support Data sources such as sensors (cf OGC metadata and data sensor standards) and repositories. Sensors need (common across applications) support of streams of data • Queries need to support archived (find all relevant data in past)   and streaming (find all data in future with given properties) • Note all AS Services behave like Sensors and all sensors are wrapped as services • Any domain will have “raw data” (binary) and that which has been filtered to ASL. Lets call ASBD (Application Specific Binary Data)

  37. Filter, Transformation, Reasoning, Data-mining, Analysis ASRepository AS Tool (generic) AS Service (user defined) AS Tool (generic) ASVS Display AS“Sensor” Messages using ASL ASIS Application Specific Information System II • Lets call this ASVS (Application Specific Visualization Services) generalizing WMS for GIS • The ASVS should both visualize information and provide a way of navigating (cf GetFeatureInfo) database (the ASFS) • The ASVS can itself be federated and presents an ASFS output interface • d) There should be application service interface for ASIS from which all ASIS service inherit • e) There will be other user services interfacing to ASIS • All user and system services will input and output data in ASL using filters to cope with ASBD

  38. Mashup Tools are reviewed at http://blogs.zdnet.com/Hinchcliffe/?p=63 Workflow Tools are reviewed by Gannon and Fox http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-overview.pdf Both include scripting in PHP, Python, sh etc. as both implement distributed programming at level of services Mashups use all types of service interfaces and do not have the potential robustness (security) of Grid service approach Typically “pure” HTTP (REST) Mashups v Workflow?

  39. Web 2.0 APIs • http://www.programmableweb.com/apis currently (March 3 2007) 388 Web 2.0 APIs with GoogleMaps the most used in Mashups • This site acts as a “UDDI” or “OGC Catalog” for Web 2.0

  40. The List of Web 2.0 API’s • Each site has API and its features • Divided into broad categories • Only a few used a lot (34 API’s used in more than 10 mashups) • RSS feed of new APIs

  41. Growing number of commercial Mashup Tools 3 more Mashups each day • For a total of 1609 March 3 2007 • Note ClearForest runs Semantic Web Services Mashup competitions (not workflow competitions) • Some Mashup types: aggregators, search aggregators, visualizers, mobile, maps, games

  42. google maps del.icio.us virtual earth 411sync yahoo! search yahoo! geocoding technorati netvibes yahoo! images trynt amazon ECS yahoo! local live.com google search flickr ebay youtube amazon S3 REST SOAP XML-RPC REST, XML-RPC REST, XML-RPC, SOAP REST, SOAP JS Other APIs/Mashups per Protocol Distribution Number of APIs Number of Mashups

More Related