460 likes | 605 Views
Building Science Gateways. Marlon Pierce Community Grids Laboratory Indiana University. Tutorial Overview. There’s More. Slides and Demo Site. Tutorial slides are available from http://www.collab-ogce.org/ogce/index.php/Tutorials
E N D
Building Science Gateways Marlon Pierce Community Grids Laboratory Indiana University
Slides and Demo Site • Tutorial slides are available from http://www.collab-ogce.org/ogce/index.php/Tutorials • We run a permanent demo portal at https://community.ucs.indiana.edu:8443/gridsphere/ • Also aliased as https://ogceportal.iu.teragrid.org:8443/gridsphere • Portal accounts train01-train30 have been created for the workshop. Password is the same as the account name. • Also train31-train49 from TG08 workshop. • We also have TeraGrid training accounts with names train01-train30 that can be used to retrieve TG proxy credentials. • These should be active all week. • You can also log into the TeraGrid User Portal with this account and the secret password.
Concept #1: Web Portal • Web container that aggregates content from multiple sources into a single display. • “Start Pages” • Typically consume RSS/Atom news feeds. • More powerful versions these days support Flickr, calendars, games, etc. • Gadgets, widgets • Examples: iGoogle, Netvibes, My Yahoo!
Gadget RSS Feeds
Concept #2: Grid Computing • Grid computing software is designed to integrate large supercomputing facilities. • TeraGrid, Open Science Grid, EGEE, etc. • This is done via network services • Software providers in the US include Globus and Condor • Key Service Components (and example services) • Authentication and authorization framework (MyProxy) • Remote process access and control (GRAM, Condor) • Remote file, I/O access (GridFTP, SRB, RFT) • Additional Services • Information services, replica management, database federation, storage management, schedulers, etc. • Example Grid Software Stacks: CTSS and VDT • For TeraGrid and Open Science Grid, respectively • Being pushed by Cloud Computing (Amazon, Google, Microsoft, others)
Science Portals and Gateways • Science Gateways adapt Web portal technology to build user interfaces to the Grid. • Science portals resemble standard portals, but must also • Support access to computing and storage resources. • Allow users remote, direct access to these resources. • You often want to run applications and access data that you own directly. • Provide access to science applications and data sets. • And we must provide value added services as well as user interfaces.
Example Science Gateways • Many listed here: • http://www.teragrid.org/programs/sci_gateways/ • Cover many different scientific fields: • Atmospheric science, geophysics, computational chemistry, bioinformatics, etc • See also GCE08 workshop at SC08 and earlier proceedings • http://www.collab-ogce.org/gce08/index.php/Main_Page • GCE05-07 also linked.
TeraGrid Science Gateways Program Slides courtesy of Nancy Wilkins-Diehr TeraGrid Area Director for Science Gateways wilkinsn@sdsc.edu
Today, there are approximately 29 gateways using the TeraGrid
Does a gateway have to use TeraGrid to be a gateway? • No, but the TeraGrid does fund the development and support of these gateways • Using high end resources is more work and is not recommended unless it serves a demonstrated need • Gateways are an excellent way to extend the impact of high-end resources • Are they all funded by TeraGrid? • Can TeraGrid claim success for all gateways? • No, we don’t make the gateways you use, we make the gateways you use better • TeraGrid does fund a small number of developers to provide advanced support. • More later.
Why are gateways worth the effort? ======= # Full path to executable executable=/users/wilkinsn/tutorial/bin/mcell # Working directory, where Condor-G will write # its output and error files on the local machine. initialdir=/users/wilkinsn/tutorial/exercise_3 # To set the working directory of the remote job, we # specify it in this globus RSL, which will be appended # to the RSL that Condor-G generates globusrsl=(directory='/users/wilkinsn/tutorial/exercise_3') # Arguments to pass to executable. arguments=nmj_recon.main.mdl # Condor-G can stage the executable transfer_executable=false # Specify the globus resource to execute the job globusscheduler=tg-login1.sdsc.teragrid.org/jobmanager-pbs # Condor has multiple universes, but Condor-G always uses globus universe=globus # Files to receive sdout and stderr. output=condor.out error=condor.err # Specify the number of copies of the job to submit to the condor queue. queue 1 • Increasing range of expertise needed to tackle the most challenging scientific problems • How many details do you want each individual scientist to need to know? • PBS, RSL, Condor • Coupling multi-scale codes • Assembling data from multiple sources • Collaboration frameworks #! /bin/sh #PBS -q dque #PBS -l nodes=1:ppn=2 #PBS -l walltime=00:02:00 #PBS -o pbs.out #PBS -e pbs.err #PBS -V cd /users/wilkinsn/tutorial/exercise_3 ../bin/mcell nmj_recon.main.mdl +( &(resourceManagerContact="tg-login1.sdsc.teragrid.org/jobmanager-pbs") (executable="/users/birnbaum/tutorial/bin/mcell") (arguments=nmj_recon.main.mdl) (count=128) (hostCount=10) (maxtime=2) (directory="/users/birnbaum/tutorial/exercise_3") (stdout="/users/birnbaum/tutorial/exercise_3/globus.out") (stderr="/users/birnbaum/tutorial/exercise_3/globus.err") )
Not just ease of useWhat can scientists do that they couldn’t do previously? • LEAD - access to radar data • NVO – access to sky surveys • OOI – access to sensor data • PolarGrid – access to polar ice sheet data • SIDGrid – analysis tools • GridChem – developing multiscale coupling • How would this have been done before gateways?
Gateways Greatly Expand Access • Almost anyone can investigate scientific questions using high end resources • Not just those in the research groups of those who request allocations • Gateways allow anyone with a web browser to explore • Opportunities can be uncovered via google • Nancy’s 11-year-old son discovered nanoHUB.org himself while his class was studying Bucky Balls • Fosters new ideas, cross-disciplinary approaches • Encourages students to experiment • But used in production too • Significant number of papers resulting from gateways including GridChem, nanoHUB • Scientists can focus on challenging science problems rather than challenging infrastructure problems
TeraGrid Pathways Activities • Program funding to involve MSI communities • 2 Gateway components • Adapt gateways for educational use by underrepresented communities • GEON – SDSC, Navajo Tech • Teach participants from underrepresented communities how to build gateways • PolarGrid – IU, ECSU
Navajo Technical College and gateways • Incorporating the use of gateways in their curricula • GEON, GISolve areas of initial interest
PolarGrid • Cyberinfrastructure Center for Polar Science (CICPS) • Experts in polar science, remote sensing and cyberinfrastructure • Indiana, ECSU, CReSIS • Satellite observations show disintegration of ice shelves in West Antarctica and speed-up of several glaciers in southern Greenland • Most existing ice sheet models, including those used by IPCC cannot explain the rapid changes http://www.polargrid.org/polargrid/images/4/42/C0050-polargrid-big.m4v Source: Geoffrey Fox
Source: Geoffrey Fox • Components of PolarGrid • Expedition grid consisting of ruggedized laptops in a field grid linked to a low power multi-core base camp cluster • Prototype and two production expedition grids feed into a 17 Teraflops "lower 48" system at Indiana University and Elizabeth City State (ECSU) split between research, education and training. • Gives ECSU a top-ranked 5 Teraflop MSI high performance computing system • Access to expensive data • High-end resources for analysis • MSI student involvement
Recent Gateways using TeraGrid Significantly • SCEC • SIDGrid • CIG
SCEC using gateway to produce hazard map • PSHA hazard map for California using newly released Earthquake Rupture Forecast (UCERF2.0) calculated using SCEC Science Gateway • Warm colors indicate regions with a high probability of experiencing strong ground motion in the next 50 years. • High resolution map, significant CPU use
Social Informatics Data Grid • Heavy use of “multimodal” data. • Subject might be viewing a video, while a researcher collects heart rate and eye movement data. • Events must be synchronized for analysis, large datasets result • Extensive analysis capabilities are not something that each researcher should have to create for themselves. http://www.ci.uchicago.edu/research/files/sidgrid.mov
Social scientists have traditionally worked in isolated labs without the capability to share data or insights with others. • SIDGrid enables a number of capabilities. • Data that is expensive to collect can now be shared with others, increasing the potential for scientific impact. • Geographically distant researchers can collaborate on the analysis of the same data set. • Complex analysis tools and workflows are now available for all to use, rather than having each lab duplicate efforts. • All researchers now have access to the highest quality computational resources • SIDGrid uses TeraGrid resources for computationally-intensive tasks such as media transcoding algorithms for pitch analysis of audio tracks and fMRI image analysis • SIDGrid is unique among social science data archive projects • Focused on streaming data which change over time • Provides the ability to investigate multiple datasets, collected at different time scales, simultaneously • Active users of the SIDGrid system include a human neuroscience group and linguistic research groups from the University of Chicago and the University of Nottingham, UK
40 institutional members • 9 foreign affiliates • Researchers request synthetic seismograms for any given earthquake • Allows scientists to understand the ground motion associated with any given earthquake • Requested and received advanced support from TeraGrid
Talks at E-Science • See the PSE Workshop: http://escience2008.iu.edu/workshops/innovative/index.shtml • Friday, 10:00 am-4:30 pm • Nancy Wilkins-Diehr will have more to say about some of these gateways. • See also Rich Wolski’s keynote on cloud computing. Next generation gateways will (need to) support cloud computing and virtual machine-based backends. • Purdue’s NanoHUB and HUB0 software have done this for some time.
Getting Started Building a Gateway Should you? And how can you get help?
When might a gateway be appropriate? • Researchers using defined sets of tools in different ways • Same executables, different input • GridChem, CHARMM • Creating multi-scale or complex workflows • Datasets • Common data formats • National Virtual Observatory • Earth System Grid • Some groups have invested significant efforts here • caBIG, extensive discussions to develop common terminology and formats • BIRN, extensive data sharing agreements • Difficult to access data/advanced workflows • Sensor/radar input • LEAD, GEON
Advanced support for OCI resourcesIncluding gateway integration • Same peer review process used to request resources • 30,000 CPUs • + 6 months of Nancy • Reviews based on appropriate use of resources, science is not reviewed if already funded • Petascale • Multisite workflows • Gateways • Domain expertise Or someone really talented
Support is Very Targeted • Start with well-defined objectives • Focus on efficient or novel use of OCI resources • Access to minimum 0.25 FTE for months to a year • Enough investment to really understand and help solve complex problems • Must have commitment from PIs • Want to make sure work is incorporated into production codes and gateways • Good candidates for targeted support include: • Large, high impact projects • Ability to influence new communities • Lessons learned move into training and documentation
My 2002 “octopus” SOA diagram, from the archives. Browser Interface HTTP(S) Portlets + Client Stubs SOAP/HTTP WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL WSDL DB Service Job Sub/Mon And File Services Visualization Service JDBC DB DB Operating and Queuing Systems Host 1 Host 2 Host 3
Terminology • Portlet: this is a standard Java component that generates HTML and can also act as a client to a remote service. • Lives in a portal container. • I will also use this term generically. • Web Service: a remotely invoke-able function on the Internet. • SOAP: the XML message envelop for carrying commands over HTTP. • WSDL: describes the service’s API in XML. • REST: A variation of this approach. • Lots more info: http://grids.ucs.indiana.edu/ptliupages/presentations/I590WebService.ppt
But Why? • Three-tiered Service Oriented Architecture is the network equivalent of the the famous Model-View-Controller design pattern. • View: the user interface components. • Controller: Web service middleware • Model: the backend resources. • Independence of tiers gives flexibility • Services can be reused with alternative user interfaces • Workflow composers like Taverna, Xbaya, Kepler • User interfaces can work with different service implementations. • Drawback: reliability and robustness are issues.
Two Approaches to the Middle Tier Fat Client Thin Client Portal Comp. Portal Comp. Grid Client HTTP + SOAP Web Service Grid Protocol (SOAP) Grid Client Grid Protocol (SOAP) Grid Service Grid Service Backend Resource Backend Resource
Managing Scientific Workflows A Preview for Suresh’s Talks and Demos
Scientific Workflows • Portal interfaces encode scientific use cases. • If you have a rich set of services, it is a lot of work to make portlets for all possible use cases. • And power users will have always want something more. • Example: our CICC project has dozens of chemical informatics Web services. • http://www.chembiogrid.org.wiki • Workflow composers can simplify this. • Allow users to encode and execute their own use cases.
Web Services and Workflows • Perform a similarity search on the NIH DTP Human Tumor data. • Filter the results based on Pharmacokinetic properties (FILTER) • Convert to 3D (OMEGA) • Docking into a pre-defined protein (FRED) • Visualize (JMOL). Taverna workflow connects remote services.
Updating the Octopus Browser Interface HTTP(S) Social Gadgets+AJAX RSS,JSON/HTTP REST REST REST REST REST REST WSDL REST REST DB Service Job Sub/Mon And File Services Visualization Service JDBC DB DB Operating and Queuing Systems Host 1 Host 2 Host 3
Microformats, KML, and GeoRSS feeds used to deliver SAR data to multiple clients.
More Information • Contact me: mpierce@cs.indiana.edu • See what I’m up to: http://communitygrids.blogspot.com/ • OGCE software: http://collab-ogce.org/ • Lots of people worked on all of these.
Tremendous Opportunities Using the Largest Shared Resources - Challenges too! • What’s different when the resource doesn’t belong just to me? • Resource discovery • Accounting • Security • Proposal-based requests for resources (peer-reviewed access) • Code scaling and performance numbers • Justification of resources • Gateway citations • Tremendous benefits at the high end, but even more work for the developers • Potential impact on science is huge • Small number of developers can impact thousands of scientists • But need a way to train and fund those developers and provide them with appropriate tools
Gateways can further investments in other projects • Increase access • To instruments • Increase capabilities • To analyze data • Improve workforce development • For underserved populations • Increase outreach • Increase public awareness • Public sees value in investments in large facilities