500 likes | 651 Views
Campus, State, and Regional Grid Issues. Warren Smith Texas Advanced Computing Center University of Texas at Austin. Outline. Overview of several grids UTGrid - a University of Texas campus grid TIGRE - a state of Texas grid SURAgrid - a southeast US regional grid Summary of issues.
E N D
Campus, State, and Regional Grid Issues Warren Smith Texas Advanced Computing Center University of Texas at Austin
Outline • Overview of several grids • UTGrid - a University of Texas campus grid • TIGRE - a state of Texas grid • SURAgrid - a southeast US regional grid • Summary of issues
UTGrid Overview • Create a campus cyberinfrastructure • Supporting research and education • Diverse resources, but easy to use environment • Worked in several main areas • Serial computing • Parallel computing • User interfaces • Close partnership with IBM • Partly funded by IBM • 2 IBM staff located at TACC
Serial Computing • There are many science domains that have “embarrassingly parallel” computations • DNA sequence analysis, protein docking, molecular modeling, CGI, engineering simulation • Can older clusters and desktop systems be used for these?
Parallel Rendering of Single Frames Was: 2h 17 min 5-8 min … 5-8 min Now: 15 min
Roundup • Aggregates unused cycles on desktop computers • Managed by United Devices Grid MP software • Support for Windows, Linux, Mac, AIX, & Solaris clients • Linux servers located at TACC • Supports hosted applications • Pre-configured by an expert for use by many • Resources contributed by several UT organizations • Nearly 2000 desktops available • Production resource • Supported by TACC
Rodeo • Condor Pool of dedicated and non-dedicated resources • Dedicated resources • Condor Central Manager (collector and negotiator) • One of our older clusters - Tejas • Non-dedicated resources • Linux, Windows, and Mac resources are managed by Condor • Usage policy is configured by resource owner • When there is no other activity • When load (utilization) is low • Give preference to certain group or users • TACC pool configured to flock to and from CS and ICES pools • 500 systems available • Production resource • Supported by TACC
Parallel Computing • Support executing parallel jobs on clusters around campus • Our approach to campus parallel computing influenced by location of resources • TACC has the largest systems at UT Austin • Lonestar: 1024-processor Linux • Wrangler: 656-processor Linux cluster • Champion: 96 processor IBM Power5 • Maverick: Sun system w/64 dual-core UltraSPARC 4 procs, 512 GB mem • Also a number of clusters in other locations on campus • Initially a hub and spoke model • TACC has largest systems • Allow users of non-TACC systems to easily send their jobs to TACC
Parallel Grid Technologies • File management • Quickly move (perhaps large) files between systems • Evaluated Avaki Data Grid, IBM GPFS, GridNFS, GridFTP • Using GridFTP • Job execution • Using Globus GRAM • Resource brokering • Evaluated Condor-G, Platform CSF • Created GridShell & MyCluster • Researching performance prediction services
GridShell • Transparently execute commands across grid resources • Extensions to tcsh and bash • “on” - “a.out on 2 procs” • “in” - “a.out in 1000 instances” • Redirection • a.out > gsiftp://lonestar.tacc.utexas.edu/work/data • Overload programs • cp - copy between systems • Environment variables • _GRID_THROTTLE - number of active jobs • _GRID_TAG - job tag • _GRID_TASKID - if part of a parallel job
MyCluster 3. User submits jobs MyScheduler MyCluster 1. Submit MyScheduler daemons 2. MyCluster is formed Scheduler Scheduler Cluster Cluster 4. MyScheduler daemons run jobs • GridShell extensions require job management • Approach is to form virtual cluster • Select systems to incorporate • User-specified maximum for each system • Number of nodes up to that maximum selected based on load • Submit “MyScheduler” daemons to selected systems • Iterate over this process to maintain a virtual cluster • Submit user jobs to this virtual cluster • “MyScheduler” schedules jobs to nodes • “MyScheduler” can be: • Condor • Sun Grid Engine (SGE) • Others possible • Deployed on TeraGrid, in addition to UT Grid
User Interfaces • Command line access via Grid User Node (GUN) • Submit jobs to Rodeo and Roundup • Submit jobs to clusters • Use GridShell to form virtual clusters and run jobs • Manage files • Graphical interface via Grid User Portal (GUP) • Using a standard web browser • Submit jobs to Rodeo and Roundup • Submit jobs to clusters • Manage files
Grid User Node • Command line access to UT Grid • Developed a software stack • UT Grid users can install on their systems • Can use serial and parallel UT Grid resources • Provided grid user nodes with this software stack • UT Grid users can login and use UT Grid • Targeted toward experienced users
Grid User Portal • Web interface to UT Grid • Simple GUI interface to complex grid computing capabilities • Can present information more clearly than a CLI • Lower the barrier of entry for novice user • Implemented as a set of configurable portlets
UTGrid Status • UT Grid received IBM funding for 2 years • Ended recently • Continuing efforts in a number of areas • Several new technologies developed that have been used in other projects • GridShell, MyCluster, GridPort • Rodeo and Roundup continue to be available to users
UTGrid Lessons Learned • Grid Technologies • Growing pains with some • Several versions of Globus • Platform Community Scheduling Framework • Surprising successes with others • United Devices Grid MP, Condor • Grids for serial computing are ready for users • Grids for parallel computing need further improvements before they are ready • Creating a campus grid harder than we expected • Technologies not as ready as advertised • More labor involved
Computing Across Texas • Grid computing is central to many high tech industries • Computational science and engineering • Microelectronics and advanced materials (nanotechnology) • Medicine, bioengineering and bioinformatics • Energy and environment • Defense and aerospace • Texas wants to encourage these industries • State of Texas is moving forward aggressively • Funded Lonestar Education And Research Network (LEARN) • $7.3M for optical fiber across state (33 schools) • Provides the infrastructure for a Grid • Funded Texas Internet Grid for Research and Education (TIGRE) • $2.5M for programmers / sysadmins / etc. (5 schools) • Provides the manpower to construct a Grid
TIGRE Mission • Integrate resources of Texas institutions to enhance research and educational capabilities • Foster academic, private, and government partnerships
TIGRE Approach • Construct a grid across 5 funded sites: • Rice University • Texas A&M University • Texas Tech University • University of Houston • University of Texas at Austin • Support a small set of applications important to Texas • Package the grid • Software • Documentation • Procedures • Organizational structure • Leverage everything that’s already out there! • To easily bring other LEARN members into TIGRE (providing resources & running apps)
TIGRE Organization • Steering Committee • 1 participant from each of the 5 partners • Decisions by consensus • Select application areas and applications • High-level guidance • Development group • Technical group constructing TIGRE • Several members from each site • Break work up in to activities • Decisions by consensus • ~10 people now, still growing
TIGRE Resources • Amount of allocations undecided • Available from TIGRE institutions • Lonestar (UT Austin): • 1024 Xeons + Myrinet + GigE • Hrothgar (Texas Tech): • 256 Xeons + Infiniband + GigE • Cosmos (Texas A&M): • 128 Itaniums + Numalink • Rice Terascale Cluster: • 128 Itaniums + GigE • Atlantis (Houston): • 124 Itaniums + GigE + SCI • plus several smaller systems • Will incorporate other institutions as appropriate to applications
TIGRE Activities • Planning • Documents: Project management, requirements, architecture, initial design • Authentication and authorization • Setting up a CA (including policies) for TIGRE • Experimenting with a VOMS server • Assembling a software stack • Start small, add when needed • Pulling components from the Virtual Data Toolkit • Globus Toolkit 4.0, pre-Web Services and Web Services • GSI OpenSSH • UberFTP • Condor-G • MyProxy • VDT providing 64-bit versions for us • User portal • Creating a user portal for TIGRE
TIGRE Timeline today December 1, 2005 Q1 Q3 Q4 Q2 Y1 • Project plan • Web site • Certificate Authority • Testbed requirements • Eriving applications Web portal • Software stack • Distribution Mechanism • Demonstrate TIGRE app Client softwarepackage Q1 Q2 Q3 Q4 Y2 User support system Global scheduler • Software feature freeze • TIGRE service requirements • Final software • Final documentation • Procedures and policies to join TIGRE • Demonstrate at SC
TIGRE Applications • Developing in three areas • Biology / medicine • Rice, Houston collaborating with Baylor College of Medicine • UT Austin working with UT Southwestern Medical Center • Atmospheric science / air quality • Research interest and expertise at Texas A&M, Rice, Houston • WRF and WRF-CHEM under evaluation • Energy • Very preliminary discussions about seismic processing, reservoir modeling • Looking for industrial partners • Grid-ready applications • EMAN - 3D reconstruction from electron microscopy • ENDYNE - quantum chemistry • ALICE - high-energy physics
TIGRE Lessons Learned • Funding structure is important • Funding was given from state directly to each university • No one person responsible, 5 are responsible • Communication and coordination challenges • Organizational structure is important • Related to funding structure • Must be able to hold people accountable • Even with good people, can be difficult to form consensus • Availability of computational resources is important • TIGRE contains no funding for compute resources resources • Makes it harder to generate interest from users • Driving users are important • Identified some, need more
Southeastern Universities Research Association Grid(SURAgrid)
Southeastern UniversitiesResearch Association • An organization formed to manage the Jefferson National Laboratory • Has extended it’s activities beyond this • Mission: Foster excellence in scientific research, strengthen capabilities, provide training opportunities • Region: 16 states & DC • Membership: 62 research universities
SURAgrid Goals SURAgrid: Organizations collaborating to bring grids to the level of seamless, shared infrastructure Goals: • To develop scalable infrastructure that leverages local institutional identity and authorization while managing access to shared resources • To promote the use of this infrastructure for the broad research and education community • To provide a forum for participants to gain additional experience with grid technology, and participate in collaborative project development
University of Alabama at Birmingham University of Alabama in Huntsville University of Arkansas University of Florida George Mason University Georgia State University Great Plains Network University of Kentucky University of Louisiana at Lafayette Louisiana State University University of Michigan Mississippi Center for SuperComputing Research University of North Carolina, Charlotte North Carolina State University Old Dominion University University of South Carolina University of Southern California Southeastern Universities Research Association (SURA) Texas A&M University Texas Advanced Computing Center (TACC) Texas Tech Tulane University Vanderbilt University University of Virginia SURAgrid Participants
Current Activities • Grid-Building • Themes: heterogeneity, flexibility, interoperability, scalability • User Portal • Themes: capability, usability • Inter-institutional AuthN/AuthZ • Themes: maintain local autonomy; leverage enterprise infrastructure • Application Development • Themes: immediate benefit to applications; apps drive development • Education, Outreach, and Training • Cookbooks (how-to documents) • Workshops on science on grids and grid infrastructure • Tutorials on building and using grids • A small amount of funding provided by SURA for activities • The majority of effort on SURAgrid is volunteer
Building SURAgrid • Software selection • Globus pre-WS • GPIR information provider • Environment variables • Defined a basic set of environment variables that users can expect • Assistance adding resources • Installing & configuring software & environment • Assistance using resources • User support, modifying software environments • Almost totally volunteer • Access to smaller clusters • People’s time
SURAgrid Portal • Single sign-on to access all grid resources • Documentation tab has details on: • Adding resources to the grid • Setting up user ids and uploading proxy certificates
Resource Monitoring http://gridportal.sura.org/gridsphere/gridsphere?cid=resource-monitor
Proxy Management • Upload proxy certificates to MyProxy server • Portal provides support for selecting a proxy certificate to be used in a user session
File Management • List directories, Move files between grid resources, Upload/download files from local machine
Job Management • Submit Jobs for execution on remote grid resources • Check status of submitted jobs • Cancel and delete jobs.
SURAgrid AuthenticationBridge CA CA CA CA CA ? ? ? ? Campus Grid CA CA CA CA Bridge CA Campus Grid • Problem: • Many different Certificate Authorities (CA) issuing credentials • Which ones should a grid trust? • An Approach: Bridge CA • Trust any CAs signed by the bridge CA • Can query the Bridge CA to ask if it trusts a CA • A way to implement Policy Management Authorities (e.g. TAGPMA)
SURAgrid Authorization • The Globus grid-mapfile • Controls the basic (binary) authorization process • Sites add certificate Subject DNs from remote sites to their grid-mapfile based on email from SURAgrid sites • Grid-mapfile automation • Sites that use a recent version of Globus can use an LDAP callout that replaces the grid-mapfile • Directory holds and coordinates • Certificate Subject DN • Unix login name • Allocated Unix UID • Some Unix GIDs?
SURAgrid Applications • Multiple Genome Alignment (GSU, UAB, UVA) • Task Farming (LSU) • Muon Detector Grid (GSU) • BLAST (UAB) • ENDYNE (TTU) • SCOOP/ADCIRC (UNC, RENCI, MCNC, SCOOP partners, SURAgrid partners) • Seeking more
SCOOP & ADCIRC • SURA Coastal Ocean Observing and Prediction (SCOOP) • Develop data standards • Make observations available & integrate them • Modeling, analysis and delivery of real-time data • ADCIRC: forecast storm surge • Preparing for hurricane season now • Uses data acquired from IOOS • Executes on SURAgrid • Resource selection (query MDS) • Build package (application & data) • Send package to resource (GridFTP) • Run ADCIRC in MPI mode (Globus rsl & qsub) • Retrieve results from resource (GridFTP)
SCOOP/ADCIRC… Left: ADCIRC max water level for 72 hr forecast starting 29 Aug 2005,driven by the "usual, always-available” ETA winds. Right: ADCIRC max water level over ALL of UFL ensemble wind fields for 72 hr forecast starting 29 Aug 2005, driven by “UFL always-available” ETA winds. Images credit: Brian O. Blanton, Dept of Marine Sciences, UNC Chapel Hill
SURAgrid Lessons • Volunteer efforts are different • Lots of good people involved • They are also involved in many other (funded) activities • Progress can be uneven • Must interest people (builders & users) • Leadership is important • SURA is providing a lot of energy to lead this effort • Direction and purpose not always clear • Computational resources • Needed to interest users • Driving applications • Needed to guide infrastructure development • Matching organization with funding boundaries • SURAgrid isn’t matched to a funding boundary • Must compete against US national grid efforts
Summary of Issues I • Technology • Grid technologies are in constant flux • Stability and reliability also varies widely • Integrating technologies challenging • People choose different technologies • Single provider solutions available in some areas • Serial computing grids • Choose technology path carefully… • But you’ll be wrong, anyway • Be prepared for change • Funding • Always need it • The way it is obtained affects many things • Participants • Project organization • Integration with end users • Think about what you are proposing and who you collaborate with
Summary of Issues II • Organization • Different challenges for hierarchical vs committee, funded or volunteer • Important that everyone involved have the same (or at least similar) goals • Work it out ahead of time… • Or you’ll spend the first few months of the project doing it • It will change, anyway • Users • User-driven infrastructure is very important! • Include them from the very beginning… • Even if you are doing a narrow technical project • Resources • Need resources to interest users • But, need users to get resources… • One approach: Multidisciplinary teams