720 likes | 841 Views
Grids for Chemical Informatics. Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org. Why are Grids Important.
E N D
Grids for ChemicalInformatics Chemistry, IU Bloomington Oct. 21 2005 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org
Why are Grids Important • Grids are important for Chemistry because they support key functionalities that grow in importance as we are deluged with data from instruments and simulations • Grids provide information access, storage and management • Grids manage multiple simulations with different defining parameters • Grids allow complex workflows with data flowing between filters • Grids define models for portals • Grids are built on top of commodity web service technology with broad industry support – the next generation information technology • Grids are used in multiple NIH and other life science/chemistry projects across the world (BIRN, caBIG, myGrid, Comb-e-Chem )
Internet Scale Distributed Services • Grids use Internet technology and are distinguished by managing or organizing sets of network connected resources • Classic Web allows independent one-to-one access to individual resources • Grids integrate together and manage multiple Internet-connected resources: People, Sensors, computers, data systems • Organization can be explicit as in • TeraGrid which federates many supercomputers; • Deep Web Technologies IR Grid which federates multiple data resources; • CrisisGrid which federates first responders, commanders, sensors, GIS, (Tsunami) simulations, science/public data • Organization can be implicit as in Internet resources such as curated databases and simulation resources that “harmonize a community”
Different Visions of the Grid • Grid just refers to the technologies • Or Grids represent the full system/Applications • DoD’s vision of Network Centric Computing can be considered a Grid (linking sensors, warfighters, commanders, backend resources) and they are building the GiG (Global Information Grid) • Utility Computing or X-on-demand (X=data, computer ..) is major computer Industry interest in Grids and this is key part of enterprise or campus Grids • e-Science or Cyberinfrastructure are virtual organization Grids supporting global distributed science (note sensors, instruments are people are all distributed • Skype (Kazaa) VOIP system is a Peer-to-peer Grid (and VRVS/GlobalMMCS like Internet A/V conferencing are Collaboration Grids) • Commercial 3G Cell-phones and DoD ad-hoc network initiative are forming mobile Grids
Types of Computing Grids • Running “Pleasing Parallel Jobs” as in United Devices, Entropia (Desktop Grid) “cycle stealing systems” • Can be managed (“inside” the enterprise as in Condor) or more informal (as in SETI@Home) • Computing-on-demand in Industry where jobs spawned are perhaps very large (SAP, Oracle …) • Support distributed file systems as in Legion (Avaki), Globus with (web-enhanced) UNIX programming paradigm • Particle Physics will run some 30,000 simultaneous jobs • Distributed Simulation HLA/RTI style Grids • Linking Supercomputers as in TeraGrid • Pipelinedapplications linking data/instruments, compute, visualization • Seamless Access where Grid portals allow one to choose oneof multiple resources with a common interfaces • Parallel Computing typically NOT suited for a Grid (latency)
Analysis and Visualization Large Disks Old Style Metacomputing Grid Large Scale Parallel Computers Original: Spread a single large Problem over multiple supercomputers Now-1: Control multiple smallish jobs each on independent Computers Now-2: Choose which of a few supercomputers to use
Towards an International Compute Grid Infrastructure UK NGS Leeds Manchester Starlight (Chicago) US TeraGrid Netherlight (Amsterdam) Oxford RAL SDSC NCSA PSC UCL UKLight SC05 Local laptops in Seattle and UK All sites connected by production network (not all shown) Computation Steering clients Service Registry Network PoP
Information/Knowledge Grids • Distributed (10’s to 1000’s) of data sources (instruments, file systems, curated databases …) • Data Deluge: 1 (now) to 100’s petabytes/year (2012) • Moore’s law for Sensors • Possible filters assigned dynamically (on-demand) • Run image processing algorithm on telescope image • Run Gene sequencing algorithm on compiled data • Needs decision support front end with “what-if” simulations • Metadata (provenance) critical to annotate data • Integrate across experiments as in multi-wavelength astronomy Data Deluge comes from pixels/year available
Data Deluged Science • Now particle physics will get 100 petabytes from CERN using around 30,000 CPU’s simultaneously 24X7 • Exponential growth in data and compare to: • The Bible = 5 Megabytes • Annual refereed papers = 1 Terabyte • Library of Congress = 20 Terabytes • Internet Archive (1996 – 2002) = 100 Terabytes • Weather, climate, solid earth (EarthScope) • Bioinformatics curated databases (Biocomplexity only 1000’s of data points at present) • Virtual Observatory and SkyServer in Astronomy • Environmental Sensor nets • In the past, HPCC community worried about data in the form of parallel I/O or MPI-IO, but we didn’t consider it as an enabler of new science and new ways of computing • Data assimilation was not central to HPCC • DoE ASCI set up because didn’t want test data!
Virtual Observatory Astronomy GridIntegrate Experiments Radio Far-Infrared Visible Dust Map Visible + X-ray Galaxy Density Map
International Virtual Observatory Alliance • Reached international agreements on Astronomical Data Query Language, VOTable 1.1, UCD 1+, Resource Metadata Schema • Image Access Protocol, Spectral Access Protocol and Spectral Data Model, Space-Time Coordinates definitions and schema • Interoperable registries by Jan 2005 (NVO, AstroGrid, AVO, JVO) using OAI publishing and harvesting • So each Community of Interest builds data AND service standards that build on GS-* and WS-*
myGrid Project • Imminent ‘deluge’ of data • Highly heterogeneous • Highly complex and inter-related • Convergence of data and literature archives
The Williams Workflows A B C A: Identification of overlapping sequence B: Characterisation of nucleotide sequence C: Characterisation of protein sequence
Web services • Web Services build loosely-coupled, distributed applications, (wrapping existing codes and databases) based on the SOA (service oriented architecture) principles. • Web Services interact by exchanging messages in SOAPformat • The contracts for the message exchanges that implement those interactions are described via WSDL interfaces.
PortalService Security Catalog A typical Web Service • In principle, services can be in any language (Fortran .. Java .. Perl .. Python) and the interfaces can be method calls, Java RMI Messages, CGI Web invocations, totally compiled away (inlining) • The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python PaymentCredit Card Web Services WSDL interfaces Warehouse Shipping control WSDL interfaces Web Services
Two-level Programming I Service Data • The Web Service (Grid) paradigm implicitly assumes a two-level Programming Model • We make a Service (same as a “distributed object” or “computer program” running on a remote computer) using conventional technologies • C++ Java or Fortran Monte Carlo module • Data streaming from a sensor or Satellite • Specialized (JDBC) database access • Such services accept and produce data from users files and databases • The Grid is built by coordinating such services assuming we have solved problem of programming the service
Service1 Service3 Service2 Service4 Two-level Programming II • The Grid is discussing the composition of distributed serviceswith the runtime interfaces to Grid as opposed to UNIX pipes/data streams • Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs • Such interpretative environments are the single processor analog of Grid Programming • Some projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately
Field Trip Data Database ? GISGrid Discovery Services RepositoriesFederated Databases Streaming Data Sensors Database Sensor Grid Database Grid Research Education SERVOGrid Compute Grid Customization Services From Researchto Education Data FilterServices ResearchSimulations Analysis and VisualizationPortal EducationGrid Computer Farm Grid of Grids: Research Grid and Education Grid
SERVOGrid Requirements • Seamless Access to Data repositories and large scale computers • Integration of multiple data sources including sensors, databases, file systems with analysis system • Including filtered OGSA-DAI (Grid database access) • Rich meta-data generation and access with SERVOGrid specific Schema extending openGIS (Geography as a Web service) standards and using Semantic Grid • Portalswith component model for user interfaces and web control of all capabilities • Collaboration to support world-wide work • Basic Grid tools: workflow and notification • NOT metacomputing
Earthquake Grid DoD NCOW Grid … … CoI Specific Grids/Services Earthquake Data & Simulation Service ServoIS C2 (JBI CEE etc.) NCOW-IS Services 7: Portals Compute Grid Information Grid 6: Collaboration Grid Sensor Grid GIS Grid 9: Application Services 10: Policy (ECS) 8: Data Access/Storage 11: Metadata 4: Discovery Core Low Level Grid Services 2: Security 3: Messaging 5: Mediation 1: Management Physical Network n: Service refers to core services identified by DoD CoI Community of Interest GIS Geographical Information System
BioInformatics Grid Chemical Informatics Grid Sequencing Tools Biocomplexity Simulations BIS … … HTS Tools Quantum CalculationsCIS Domain Specific Grids/Services 7: Portals Compute Grid Information Grid 6: Collaboration Grid Instrument Grid MIS Grid 9: Application Services 10: Policy 8: Data Access/Storage 11: Metadata 4: Discovery Core Low Level Grid Services 2: Security 3: Messaging 5: Workflow 1: Management Physical Network M(B,C)IS Molecular (Bio, Chem) Information System
GIS Grid with WMS, WFS, data sources and GML <gml:featureMember> <fault> <name> Northridge2 </name> <segment> Northridge2 </segment> <author> Wald D. J.</author> <gml:lineStringProperty> <gml:LineStringsrsName="null"> <gml:coordinates> -118.72,34.243 -118.591,34.176 </gml:coordinates> </gml:LineString> </gml:lineStringProperty> </fault> </gml:featureMember> GML becomes CML, CellML, SBML
Electric Power and Natural Gas data from LANL Interdependent Critical Infrastructure Simulations Zoom-in Zoom-out FeatureInfo mode Measure distance mode Clear Distance Drag and Drop mode Refresh to initial map
Integrating Archived Web Feature Services and Google Maps Google maps can be integrated with Web Feature Service Archives to filter and browse seismic records.
What is Happening? • Grid ideas are being developed in (at least) four communities • Web Service – W3C, OASIS, (DMTF) • Grid Forum (High Performance Computing, e-Science) • Enterprise Grid Alliance (Commercial “Grid Forum” with a near term focus) • Service Standards are being debated • Grid Operational Infrastructure is being deployed • Grid Architecture and core software being developed • Apache has several important projects as do academia; large and small companies • Particular System Services are being developed “centrally” – OGSA or GS-* framework for this in GGF; WS-* for OASIS/W3C/Microsoft-IBM • Lots of fields are setting domain specific standards and building domain specific services • USA started but now Europe is probably in the lead and Asia will soon catch USA if momentum (roughly zero for USA) continues
4: Application or Community of InterestSpecific Services such as “Run BLAST” or “Look at Houses for sale” 3: Generally Useful Services and Features Such as “Access a Database” or “Submit a Job” or “ManageCluster” or “Support a Portal” or “Collaborative Visualization” 2: System Services and Features Handlers like WS-RM, Security, Programming Models like BPELor Registries like UDDI 1: Container and Run Time (Hosting) Environment The Grid and Web Service Institutional Hierarchy OGSA GS-*and some WS-* GGF/W3C/…. WS-* fromOASIS/W3C/Industry Apache Axis.NET etc. Must set standards to get interoperability
Location of software for Grid Projects in Community Grids Laboratory • htpp://www.naradabrokering.org provides Web service (and JMS) compliant distributed publish-subscribe messaging (software overlay network) • htpp://www.globlmmcs.org is a service oriented (Grid) collaboration environment (audio-video conferencing) • http://www.crisisgrid.org is an OGC (open geospatial consortium) Geographical Information System (GIS) compliant GIS and Sensor Grid (with POLIS center) • http://www.opengrids.org has WS-Context, Extended UDDI etc. • The work is still in progress but NaradaBrokering is quite mature • All software is open source and freely available
Project Goals • Establish Requirements from stakeholders • Research • Pharmaceutical Industry • Government • Consider educational implications • e-Science v Bio/Chem/Molecular Informatics • Consider other national and international projects to ensure we either lead or use best practice • Design a Grid architecture and staged implementation • Start pilot projects led by Chemistry/Chemical Informatics • Evaluate and iterate • Design and implement ?(Chem, Life Science, Science, Molecular) Informatics educational program that will attract students • Write winning center grant in 2006-7
Web Services Introduction • What are “Web Services”? • A distributed invocation system built on Grid computing • Independent of platform and programming language • Built on existing Web standards • A service oriented architecture with • Interfaces based on Internet protocols • Messages in XML (except for binary data attachments)
Web Services Introduction • A web-based architecture providing for interoperability among resources • Centralized service registry • Solves problems associated with finding, using, and combining online resources • Employ standard Internet protocols for: • Communication with resources • Automated discovery using centralized registries • Communicate with devices, people, and each other with the protocols and computer languages
Service Oriented Architecture (SOA) • Goal is to achieve loose coupling among interacting software agents • Define service: a unit of work done by a service provider to achieve desired end results for a service consumer • Both provider and consumer are roles played by software agents on behalf of their owners.
How does SOA work? • Two architectural constraints are employed • Small set of simple and ubiquitous interfaces to all participating software agents • Descriptive messages constrained by an extensible schema delivered through the interfaces
Web Services Architectures • Individual services are registered globally • Broken down into individual services with inputs and outputs specified • Services are published • Services are requested • Open registry, publishing, and requesting
Service-Oriented Architecture • From Curcin et al. DDT, 2005, 10(12),867
Web Services for Science • Invisible Services, Semantic Web, and Grid • Easy-to-use tools for any scientist • High throughput, resource intensive computing done for low cost/resources • Shared community • Collaborations between labs and fields • Shared data • Shared tools
e-Science and the Grid 1 • e-Science:Major UK Program • global collaboration in key areas of science and the next generation of infrastructure that will enable it • reflects growing importance of international laboratories, satellites and sensors and their integrated analysis by distributed teams • total investment of some £200M over the five-year period from 2001 to 2006 • CyberInfrastructure: the analogous US initiative • Grid Technology: supports e-Science & Cyberinfrastructure
Basic Architectures:Servlets/CGI and Web Services Browser Browser GUI Client Web Server HTTP GET/POST WSDL SOAP Web Server WSDL Web Server WSDL WSDL SOAP JDBC JDBC DB or MPI Appl. DB or MPI Appl.
Importance of Web Services • Building a true science community • Enabling interoperability between tools and the integration of data • Less time coding, more time for science • Change the way scientists work by achieving new levels of integration
When To Use Web Services? • Applications do not have severe restrictions on reliabilityandspeed. • Two or more organizations need to cooperate. • One needs to write an application that uses another’s service. • Services can be upgraded independently of clients. • Services can be easily expressed with simple request/responsesemantics and simple state.
Web Services Benefits • Web services provide a clean separation between a capability and its user interface. • Increase in productivity • Increase in flexibility • Rapid return on investment • Integration across multiple applications
Web Services Advantages • Output in human- and computer-readable formats • I/O formats based on standard Internet protocols • Resources accessible server to server allow automated I/O • Integration based on specific services: you select services or data needed without downloading the entire data set
Web Services Advantages • Description protocols provide details of service provided and interface components • Semantic Web standards increase efficiency • Use a central registry and standardized description of services • Quality and status of the information is dynamically available
Web Services Drawbacks • Based on new technologies • Time and commitment required to learn • Standards still in a state of rapid flux • Issues with quality of data, (and for chemistry, quantity of open data), security, and privacy
Components of Web Services • Protocols • SOAP • WSDL • UDDI • XML as a basis for the protocols • Ontologies • OWL: Ontology Web Language • Semantic Web
Components of the Semantic Web for Chemistry • XML – eXtensible Markup Language • RDF – Resource Description Framework • RSS – Rich Site Summary • Dublin Core – allows metadata-based newsfeeds • OWL – for ontologies • BPEL4WS – for workflow and web services • Murray-Rust et al. Org. Biomol. Chem. 2004, 2, 3192-3203.
SOAP: Simple Object Access Protocol • Flexible protocol to communicate information between server and server or client and server using XML • Supports Remote Procedure Calls • Allows layers (security, authentication, transactions) over the basic SOAP elements
WSDL: Web Service Definition Language • Describes a service’s interface to clients • Services register themselves with Web Services • WSDL describes how to contact and interact with services • I/O, operations and messages to aid interaction with client
WSDL Overview • An XML-based Interface Definition Language. • You can define the APIs for all of your services in WSDL. • WSDL docs are broken into five major parts: • Data definitions (in XML) for custom types • Abstract message definitions (request, response) • Organization of messages into “ports” and “operations” (classes and methods). • Protocol bindings (to SOAP, for example) • Service point locations (URLs) • Some interesting features • A single WSDL document can describe several versions of an interface. • A single WSDL doc can describe several related services.
UDDI: Universal Description, Discovery, and Integration • Provides ways for clients and services to interact with other services • Uses XML • Defines the means of access, e.g., • URL • E-Mail • Defines services hosted by an entity • Business-oriented tags • Uses SOAP for communicating