330 likes | 451 Views
Using the TeraGrid. Daniel S. Katz d.katz@ieee.org Director of Science, TeraGrid GIG Senior Fellow, Computation Institute, University of Chicago & Argonne National Laboratory Affiliate Faculty, Center for Computation & Technology, LSU
E N D
Using the TeraGrid Daniel S. Katz d.katz@ieee.org Director of Science, TeraGrid GIG Senior Fellow, Computation Institute, University of Chicago & Argonne National Laboratory Affiliate Faculty, Center for Computation & Technology, LSU Adjunct Associate Professor, Electrical and Computer Engineering, LSU
What is the TeraGrid • World’s largest distributed cyberinfrastructure for open scientific research, supported by US NSF • Integrated high performance computers (>2 PF HPC & >27000 HTC CPUs), data resources (>2 PB disk, >60 PB tape, data collections), visualization, experimental facilities (VMs, GPUs, FPGAs), network at 11 Resource Provider sites • Allocated to US researchers and their collaborators through national peer-review process • DEEP: provide powerful computational resources to enable research that can’t otherwise be accomplished • WIDE: grow the community of computational science and make the resources easily accessible • OPEN: connect with new resources and institutions • Integration: Single {portal, sign-on, help desk, allocations process, advanced user support, EOT, campus champions} http://www.teragrid.org/
11 Resource Providers, One Facility UW Grid Infrastructure Group (UChicago) UC/ANL PSC NCAR PU NCSA Caltech UNC/RENCI IU ORNL USC/ISI NICS SDSC LONI TACC Resource Provider (RP) Software Integration Partner Network Hub
Governance • 11 Resource Providers (RPs) funded under agreements with NSF • Different start and end dates, different goals, different agreements with NSF, different funding models • Only constant is change... • 1 Coordinating Body – Grid Infrastructure Group (GIG) • University of Chicago/Argonne National Laboratory • Subcontracts to all RPs and four other universities • 7-8 Area Directors • Working groups with members from many RPs • Will be replaced in mid 2011 by XD awardee • TeraGrid Forum with Chair
TeraGrid resources today include: • But change is constant - new systems: • Data Analysis and Vis systems • Longhorn (TACC): CPU and GPU • Nautilus (NICS): 4 TB shared memory • Data-Intensive Computing • Dash (SDSC): nodes w/ flash memory -> Gordon • FutureGrid • Experimental computing grid and cloud test-bed to tackle research challenges in computer science • Keeneland • Experimental, high-performance computing system with NVIDIA Tesla accelerators • New General Production Systems • Trestles (SDSC), new Lonestar (TACC), Athena (NICS), Blacklight (PSC) • And some old systems are being retired… • And adding InCommon – for users to login with campus credentials • Tightly Coupled Distributed Memory Systems, 2 systems in the top 10 at top500.org • Kraken (NICS): Cray XT5, 99,072 cores, 1.03 Pflop • Ranger (TACC): Sun Constellation, 62,976 cores, 579 Tflop, 123 TB RAM • Shared Memory Systems • Cobalt (NCSA): Altix, 8 Tflop, 3 TB shared memory • Pople (PSC): Altix, 5 Tflop, 1.5 TB shared memory • Clusters with Infiniband • Abe (NCSA): 90 Tflops • Lonestar (TACC): 61 Tflops • QueenBee (LONI): 51 Tflops • Condor Pool (Loosely Coupled) • Purdue- up to 22,000 cpus • Gateway hosting • Quarry (IU): virtual machine support • Visualization Resources • TeraDRE (Purdue): 48 node nVIDIA GPUs • Spur (TACC): 32 nVIDIA GPUs • Storage Resources • Wide area filesystems( Lustre, GPFS) • Archival storage • Data replication service • Advanced User Support • Computational scientists available to help (1-12 workmonths)
Who can use TeraGrid resources? • Using TeraGrid is based on allocations • An allocation is a set of resources that are available and a quantity of each • Including advanced user support (in FTE-months) • Requests for allocations are peer-reviewed • Allocations PI must be from a US institution (faculty, staff, postdoc, students who are NSF fellows) • Allocations PI determines who can use the allocation (accounts) • TeraGrid allocations are free to US researchers and their collaborators • TeraGrid is funded by the National Science Foundation
Allocations Definitions Allocation Request Types • Startup Development/testing/ porting/benchmarking • Education Classroom, Training • Research Program (usually funded) • PI Principal Investigator • POPSPartnerships Online Proposal System • TRACTeraGrid Resource Allocations Committee • SU Service Unit = 1 Core-hour • Allocations can be for a PI, a small collaboration, a community consortium, or an community of users not yet known 3 Types of TeraGrid Projects
Allocations Process • One per PI (generally) • 1-year duration, or multi-year • Unused SUs are forfeited at the end of an award period • Multi-year Annual Report required for multi-year awards • Add users to a grant via TeraGrid User Portal 4 quarters = 1 yr allocation period Advance Submission Review Award • For Startup/Education requests, no fixed cycle; review/allocation should take ~ 2 weeks
Allocations Details • Proposal outline • Research Objectives • Computational methodology (Applications/Codes) • Application Efficiencies • Computational Research plan • Justification for SUs (TB) requested • Additional considerations • Review Criteria • If the science is not peer-reviewed, does it make sense? • Are the computational method or applications appropriate to the science objectives? • Do you have a plan to use those methods and applications in a sensible fashion? • Are your codes/applications efficient on the resources selected? • For Startup/Education requests, just an abstract (~1 paragraph) required
Who Uses TeraGrid (2009) (2008)
How TeraGrid Is Used 2006 data
How One Uses TeraGrid POPS (for now) User Portal Science Gateways Command Line Viz Service Data Service RP 1 RP 2 TeraGrid Infrastructure (Accounting, Network, Authorization,…) Network, Accounting, … RP 3 Compute Service
User Portal: portal.teragrid.org http://portal.teragrid.org/
Science Gateways • A natural extension of Internet & Web 2.0 • Mosaic (original browser) is 18 years old • Implications of WWW on science are still evolving • Idea resonates with Scientists • Researchers can imagine scientific capabilities provided through familiar interface • Mostly web portal or web or client-server program • Designed by communities; provide interfaces understood by those communities • Also provide access to greater capabilities (back end) • Without user understand details of capabilities • Scientists know they can undertake more complex analyses and that’s all they want to focus on • TeraGrid provides tools to help developer • Seamless access doesn’t come for free • Hinges on very capable developer Nancy Wilkins-Diehr
vt100 in the 1980s and alogin window on Ranger today Nancy Wilkins-Diehr
Why are gateways worth the effort? ======= # Full path to executable executable=/users/wilkinsn/tutorial/bin/mcell # Working directory, where Condor-G will write # its output and error files on the local machine. initialdir=/users/wilkinsn/tutorial/exercise_3 # To set the working directory of the remote job, we # specify it in this globus RSL, which will be appended # to the RSL that Condor-G generates globusrsl=(directory='/users/wilkinsn/tutorial/exercise_3') # Arguments to pass to executable. arguments=nmj_recon.main.mdl # Condor-G can stage the executable transfer_executable=false # Specify the globus resource to execute the job globusscheduler=tg-login1.sdsc.teragrid.org/jobmanager-pbs # Condor has multiple universes, but Condor-G always uses globus universe=globus # Files to receive sdout and stderr. output=condor.out error=condor.err # Specify the number of copies of the job to submit to the condor queue. queue 1 • Increasing range of expertise needed to tackle the most challenging scientific problems • How many details do you want each individual scientist to need to know? • PBS, RSL, Condor • Coupling multi-scale codes • Assembling data from multiple sources • Collaboration frameworks #! /bin/sh #PBS -qdque #PBS -l nodes=1:ppn=2 #PBS -lwalltime=00:02:00 #PBS -opbs.out #PBS -epbs.err #PBS -V cd /users/wilkinsn/tutorial/exercise_3 ../bin/mcellnmj_recon.main.mdl +( &(resourceManagerContact="tg-login1.sdsc.teragrid.org/jobmanager-pbs") (executable="/users/birnbaum/tutorial/bin/mcell") (arguments=nmj_recon.main.mdl) (count=128) (hostCount=10) (maxtime=2) (directory="/users/birnbaum/tutorial/exercise_3") (stdout="/users/birnbaum/tutorial/exercise_3/globus.out") (stderr="/users/birnbaum/tutorial/exercise_3/globus.err") ) Nancy Wilkins-Diehr
Gateways democratize access to high end resources • Almost anyone can investigate scientific questions using high end resources • Not just those in the research groups of those who request allocations • Gateways allow anyone with a web browser to explore • Opportunities can be uncovered via google • Foster new ideas, cross-disciplinary approaches • Encourage students to experiment • But used in production too • Significant number of papers resulting from gateways including GridChem, nanoHUB • Scientists can focus on challenging science problems rather than challenging infrastructure problems Nancy Wilkins-Diehr
Today, there are approximately 35 gateways using the TeraGrid This just in: 35% of TeraGrid users charging jobs (June-Sept, 2010) were gateway users! Nancy Wilkins-Diehr
3 steps to connect a gateway to TeraGrid • Request an allocation • Only a 1 paragraph abstract required for up to 200k CPU hours • Register your gateway • Visibility on public TeraGrid page • Request a community account • Run jobs for others via your portal • Staff support is available! • www.teragrid.org/gateways Nancy Wilkins-Diehr
TeraGrid -> XD Future • Some current RP agreements end in March 2011 • Systems are always changing, now and in the future • TeraGrid XD (eXtreme Digital) starts in ~April 2011 • Era of potential interoperation with OSG and others • New types of science applications? • Current TG GIG continues through ~July 2011 • Allows four months of overlap in coordination • Probable overlap between GIG and XD members • Also in the future, Blue Waters (track 1) enters production in 2011 • Not part of TG/XD, separate NSF allocations process (PRAC)
Grid Enabled Neurosurgical Imaging Using Simulation (GENIUS) • Model large-scale patient-specific cerebral blood flow in clinically-relevant time scale • Provide simulation support within the operating theatre for neuroradiologists • Provide new information to surgeons for patient management and therapy: • Diagnosis and risk assessment • Predictive simulation in therapy • Provide patient-specific information to help plan embolisation of arterio-venous malformations, coiling of aneurysms, etc. Clinical workflow: • Book computing resources in advance or use preemption • Shift imaging data around quickly over high-bandwidth low-latency dedicated links • Interactive simulations and real-time visualization for immediate feedback Peter Coveney, University College London
OLSGW Gadgets • OLSGW Integrates bio-informatics applications • BLAST, InterProScan, CLUSTALW , MUSCLE, PSIPRED, ACCPRO, VSL2 • 454 Pyrosequencing service under development • Four OLSGW gadgets have been published in the iGoogle gadget directory. Search for “TeraGrid Life Science”. Wenjun Wu, Thomas Uram, Michael Papka, ANL
TG App: SCEC-PSHA • Part of SCEC (Tom Jordan, USC) • Using the large scale simulation data, estimate probablistic seismic hazard (PSHA) curves for sites in southern California (probability that ground motion will exceed some threshold over a given time period) • Used by hospitals, power plants, schools, etc. as part of their risk assessment • For each location, need a Cybershake run followed by roughly 840,000 parallel short jobs (420,000 rupture forecasts, 420,000 extraction of peak ground motion) • Parallelize across locations, not individual workflows • Completed 40 locations to date, targeting 200 in 2009, and 2000 in 2010 Managing these requires effective grid workflow tools for job submission, data management and error recovery, using Pegasus (ISI) and DAGman (Wisconsin) 23 Phil Maechling, USC
Multiscale Simulation of Arterial Tree Need to combine multi-scale models: 1D (arteries), 3D Navier Stokes (organs, arterial junctions, etc.), Dissipative Particle Dynamics (capillaries, venules, arterioles, blood cells, etc.), Molecular Dynamics (blood cells, platelets, molecular adhesion, etc.) Arterioles/venules 50 microns activated platelets Platelet diameter is 2-4 µm Normal platelet concentration in blood is 300,000/mm3 Functions: activation, adhesion to injured walls, and other platelets NIH/NSF-IMAG project: George EmKarnaidakis, Brown
Expressed Sequence Tag (EST) Pipeline • ESTs are a collection of random cDNA sequences, sequenced from a cDNA library or sequencing devices • Typical inputs are O(Million) sequences • Newer 454 devices from higher volume, are relatively easy to obtain and operate • Stored using FASTA format • ESTs are clustered and assembled to form contigs • Contigs then used to identify potential unknown genes, by Blasting against known protein database • Goal: Use TeraGrid for backend computing, with existing software, and a gateway frontend Initial results – run that took 5 days on local cluster done in 2 days – more opt. underway A. Kulshrestha, S. L. Pallickara, K. N. Muthuram, C. Kong, Q. Dong, M. Pierce, H. Tang, IU
Multiscale Computer Simulation of the Immature HIV-1 Virion Coarse-grained (CG) model development CG simulation Experimental structures Wright, Schooler, Ding, Kieffer, Fillmore, Sundquist, Jenson, EMBO, 26, 2218, 2007 CG model refinement Atomic-level simulation Key CG interactions New CG Interactions from MD An iterative modeling approach combining experimental imaging (cryo-electron tomography), coarse-grained (CG) simulation, and atomic-level molecular dynamics (MD) G. A. Voth, U. of Chicago
App: GridChem Different licensed applications with different queues Will be scheduled for workflows Joohyun Kim, LSU
CIPRES Portal: A New Science Gateway for Systematics • Systematics: study of diversification of life and relationships among living things through time • CIPRES: a flexible web application that can be sustained by the community at minimal cost even beyond the funding period of the project • Tools include parallel versions of MrBayes, RAxML, GARLI • User requirements include: • Access to most or all native command line options • Add new tools quickly • Provide personal user space for storing results • Use TeraGrid resources to quickly provide results • Cited in at least 35 publications, including Nature, PNAS, Cell • Examples: New Family Tree for Arthropoda, Genome Sequence of a Transitional Eukaryote, Co-evolution of Beetles and Flowering Plants • Used routinely in at least 5 undergraduate classes • Use 77% US (incl. 17 EPSCoR states), 23% 33 other countries Mark Miller, SDSC
Patient-Specific HIV Drug Therapy • HIV-1 Protease is a common target for HIV drug therapy • Enzyme of HIV responsible for protein maturation • Target for anti-retroviral Inhibitors • Example of structure assisted drug design • 9 FDA inhibitors of HIV-1 protease • So what’s the problem? • Emergence of drug resistant mutations in protease • Render drug ineffective • Drug resistant mutants have emerged for all FDA inhibitors • Too many mutations to be interpreted by a clinician • Solution: build a Binding Affinity Calculator (BAC) • Provide tools that allow simulations to be used in clinical context, including lightweight client • User only needs specify enzyme, mutations relative to wildtype, drug • Others options can be specified but begin as default • Requires large number of simulations to be constructed and run automatically (across distributed HPC resources) • To investigate generalisation • Automation is critical for clinical use • Turn-around time scale of around a week is required • Trade off between accuracy and time-to-solution Initial results – ensemble MD calculations for lopinavirvswildtype & five mutants – appear promising; excellent relative ranking in binding free energies Peter Coveney, University College London
Linked Environments for Atmospheric Discovery (LEAD) • Providing tools that are needed to make accurate • predictions of tornados and hurricanes • Meteorological data • Forecast models • Analysis and visualization tools • Data exploration and Grid workflow Dennis Gannon & Beth Plale, Indiana
Scripting Protein Structure Prediction … intnSim = 1000; intmaxRounds = 3; Protein pSet[ ] <ext; exec="Protein.map">; float startTemp[ ] = [ 100.0, 200.0 ]; float delT[ ] = [ 1.0, 1.5, 2.0, 5.0, 10.0 ]; foreachp, pn in pSet { foreacht in startTemp { foreachd in delT { ItFix(p, nSim, maxRounds, t, d); } } } ItFix() { foreachsim in [1:nSim] { (structure[sim], log[sim]) = predict(p, t, d); } result = analyze(structure) } 1000 predict() calls Analyze() 10 proteins x 1000 simulations x3 rounds x 2 temps x 5 delta-T’s = 300K application runs T. Sosnick, K. Freed, G. Hocky, J. DeBartolo, A. Adhikari, J. Xu, W. Wilde, U. Chicago
Community Climate System Model (CCSM) • Makes a world-leading, fully coupled climate model easier to use and available to a wide audience • Compose, configure, and submit CCSM simulations to the TeraGrid • Used in Purdue’s POL 520/EAS 591: Models in Climate Change Science and Policy • Semester-long projects, 100 year CCSM simulations, generate policy recommendations based on scientific, economic, and political models of climate change impacts Nancy Wilkins-Diehr
We want you to succeed!If you have any questions, contact help@teragrid.orgOr contact your campus championTemple: Axel Kohlmeyerakohlmey@temple.eduor contact me:dsk@ci.uchicago.edu