Introduction to Grid & Cluster Computing Sriram Krishnan, Ph.D. sriram@sdsc

Introduction to Grid & Cluster ComputingSriram Krishnan, Ph.D.sriram@sdsc.edu

QMView GAMESS Motivation: NBCR Example Set of Biomedical Applications Resources Gtomo2 TxBR Cyber-Infrastructure APBS Continuity Autodock Rich Clients Web Portals Web Services PMV ADT Vision Workflow Telescience Portal Middleware APBSCommand Continuity

Cluster Resources • “A computer cluster is a group of tightly coupled computers that work together closely so that in many respects they can be viewed as though they are a single computer.” [wikipedia] • Typically built using commodity off-the-shelf hardware (processors, networking, etc) • Differs from traditional “supercomputers” • Now at more than 70% of deployed Top500 machines • Useful for: high availability, load-balancing, scalability, visualization, and high performance

Grid Computing • “Coordinated resource sharing and problem solving in dynamic multi-institutional virtual organization.” [Foster, Kesselman, Tuecke] • Coordinated - multiple resources working in concert, eg. Disk & CPU, or instruments & database, etc. • Resources - compute cycles, databases, files, application services, instruments. • Problem solving - focus on solving scientific problems • Dynamic - environments that are changing in unpredictable ways • Virtual Organization - resources spanning multiple organizations and administrative domains, security domains, and technical domains

Grids are not the same as Clusters! • Foster’s 3 point checklist • Resources not subjected to centralized control • Use of standard, open, general-purpose protocols and interfaces • Delivery of non-trivial qualities of service • Grids are typically made up of multiple clusters

Popular Misconception • Misconception: Grids are all about CPU cycles • CPU cycles are just one aspect, others are: • Data: For publishing and accessing large collections of data, e.g. Geosciences Network (GEON) Grid • Collaboration: For sharing access to instruments (e.g. TeleScience Grid), and collaboration tools (e.g. Global MMCS at IU)

Uses 1000s of internet connected PCs to help in search for extraterrestrial intelligence When the computer is idle, the software downloads ~ 1/2 MB chunk of data for analysis. Results of analysis sent back to the SETI team, combined with 1000s of other participants Largest distributed computation project in existence Total CPU time: 2433979.781 years Users: 5436301 Statistics from 2006 SETI@Home

NCMIR TeleScience Grid * Slide courtesy TeleScience folks

NBCR Grid Gemstone PMV/Vision Kepler State Mgmt Application Services Security Services (GAMA) Globus Globus Globus PBS Cluster Condor pool SGE Cluster

Day 1 - Using Grids and Clusters: Job Submission • Scenario 1 - Clusters: • Upload data to remote cluster using scp • Log on to the said cluster using ssh • Submit job via command-line to schedulers, such as Condor or the Sun Grid Engine (SGE) • Scenario 2 - Grids: • Upload data using to Grid resource using GridFTP • Submit job via Globus command-line tools (e.g. globus-run) to remote resources • Globus services communicate with the resource specific schedulers

Day 1 - Using Grids & Clusters: Security

Day 1 - Using Grids & Clusters: User Interfaces

Day 2 - Managing Cluster Environments • Clusters are great price/performance computational engines • Can be hard to manage without experience • Failure rate increases with cluster size • Not cost-effective if maintenance is more expensive than the cluster itself • System administrators can cost most than clusters (1 Tflops cluster < $100,000)

Day 2 - Rocks (Open Source Clustering Distribution) • Technology transfer of commodity clustering to application scientists • Making clusters easy • Scientists can build their own supercomputers • Rocks distribution is a set of CDs • Red Hat Enterprise Linux • Clustering Software (PBS, SGE, Ganglia, Globus) • Highly programmatic software configuration management • http://www.rocksclusters.org

Day 2 - Rocks Rolls

Day 3 - Advanced Usage Scenarios: Workflows • Scientific workflows emerged as an answer to the need to combinemultiple Cyberinfrastructure components in automated process networks • Combination of • Data integration, analysis, and visualization steps • Automated “scientific process” • Promotes scientific discovery

Day 3 - The Big Picture: Scientific Workflows From “Napkin Drawings” … … to Executable Workflows Source: Mladen Vouk (NCSU) Conceptual SWF Executable SWF Here: John Blondin, NC State Astrophysics Terascale Supernova Initiative SciDAC, DOE

Day 3 - Kepler Workflows: A Closer Look

Day 3 - Advanced Usage Scenarios: MetaScheduling • Local schedulers are responsible for load balancing and resource sharing within each local administrative domain • Meta-Schedulers are responsible for querying, negotiating access and managing resources existing within different administrative domains in Grid systems

Day 3 - MetaSchedulers: CSF4 • What is the CSF Meta-Scheduler? • Community Scheduler Framework • CSF4 is a group of Grid services hosted inside the Globus Toolkit (GT4) • CSF4 is fully WSRF compliant • Open Source project and can be accessed at http://sourceforge.net/projects/gcsf • The development team of CSF4 is from Jilin University, PRC

Day 3 - CSF4 Architecture Grid Environment CSF 4 Services R e s o u r c e M a n a g e r Meta Information F a c t o r y S e r v i c e W S - M D S R e s o u r c e M a n a g e r R e s e r v a t i o n G r a m S e r v i c e S e r v i c e R e s o u r c e M a n a g e r J o b G T 2 E n v i r o n m e n t L S F S e r v i c e S e r v i c e G a t e K e e p e r Q u e u i n g S e r v i c e W S - G R A M G r a m P B S G r a m S G E G r a m C o n d o r G r a m F o r k g a b d G r a m L S F G r a m F o r k G r a m P B S G r a m S G E G r a m C o n d o r L o c a l L o c a l P B S S G E C o n d o r L S F P B S S G E C o n d o r M a c h i n e M a c h i n e : :

Day 4 - Accessing TeraScale Resources • I need more resources! What are my options? • TeraGrid: “With 20 petabytes of storage, and more than 280 teraflops of computing power, TeraGrid combines the processing power of supercomputers across the continent” • PRAGMA: “To establish sustained collaborations and advance the use of grid technologies in applications among a community of investigators working with leading institutions around the Pacific Rim”

Members: IU, ORNL, NCSA, PSC, Purdue, SDSC, TACC, ANL, NCAR 280 Tflops of computing capability 30 PB of distributed storage High performance networking between partner sites Linux-based software environment, uniform administration Focus is a national, production Grid PSC PSC Extensible Terascale Facility Day 4 - TeraGrid TeraGrid is a “top-down”, planned Grid

PRAGMA Grid Member Institutions JLU China CNIC GUCAS China AIST OsakaU UTsukuba TITech Japan NCSA USA UZurich Switzerland KISTI Korea BU USA UUtah USA SDSC USA LZU China UPRM Puerto Rico ASGC NCHC Taiwan UoHyd India CICESE Mexico CUHK HongKong UNAM Mexico NECTEC ThaiGrid Thailand HCMUT IOIT-HCM Vietnam ITCR Costa Rica APAC QUT Australia MIMOS USM Malaysia BII IHPC NGO NTU Singapore UCN Chile BESTGrid New Zealand UChile Chile MU Australia 31 institutions in 15 countries/regions (+ 7 in preparation)

Track 1: Agenda (9AM-12PM at PFBH 161) • Tues, July 31: Basic Cluster and Grid Computing Environment • Wed, Aug 1: Rocks Clusters and Application Deployment • Thurs, Aug 2: Workflow Management and MetaScheduling • Fri, Aug 3: Accessing National and International TeraScale Resources

Introduction to Grid & Cluster Computing Sriram Krishnan, Ph.D. sriram@sdsc