510 likes | 588 Views
WestGrid Overview. Dr. Rob Simmonds Distributed Systems Architect. Talk Overview. The WestGrid project The WestGrid HPTC resources Grid services for HPTC and how they will be used in WestGrid. WestGrid Project. 8 institutions More than 250 researchers Technical and operational officers
E N D
WestGrid Overview Dr. Rob Simmonds Distributed Systems Architect
Talk Overview • The WestGrid project • The WestGrid HPTC resources • Grid services for HPTC and how they will be used in WestGrid
WestGrid Project • 8 institutions • More than 250 researchers • Technical and operational officers • HPTC: compute resources and storage • Visualization and collaboration
WestGrid People • PIs • Jonathan Borwein (SFU), Gren Patey (UBC), Jonathan Schaeffer (UofA), Brian Unger (UofC), Mike Vetterli (SFU/TRIUMF) • HPC planning committee • Rob Balantyne, Matthew Choptuik, Corrie Kost, Harold Esche, Paul Lu, Richard Marchand, Seamus O'Shea, Mark Thachuk, Ron Senda, Martin Siegert, Rob Simmonds, Mike Vetterli • Visualization planning committee • Lyn Bartram, Kelly Booth, Pierre Boulanger, Brian Corrie, Sara Diamond, Larry Katz, John MacDonald, Trever Woods • CAO • Ken Hewitt
WestGrid HPTC Resources • 140TB IBM storage server (Power4/AIX) • 1008 processor IBM cluster (IA-32/Linux) • 256 processor SGI Origin (MIPS/Irix) • 144 processor HP SC45 (Alpha/Tru64) All connected by Canada’s world class networks
Grid Computing • “Grid” is a set software services • Combines meta-computing, resource discovery and security • Designed to enable access to resources in different management domains • Grid services will enable WestGrid resources to be integrated into individual researcher’s computing environments
Grid Standardization • Global Grid Form (GGF) is working to provide standards • Open Grid Services Architecture (OGSA) defines low level Grid services
Grid toolkits • Globus (Public domain – ANL/ISI) • Currently version 2.x used for production • Version 3 provides a reference implementation for OGSA • Legion (Commercial – Avaki) • Provides more support for data handing • Will support OGSA
Grid Security Infrastructure • Ability for trusted users to access remote resources without re-authentication • Ability for trusted jobs to access remote resources without re-authentication • Protection against stolen credentials • Avoid requirement for dedicated, highly available security server(s)
Certificate Authority Model • CA issues certificates to trusted users and services • Certificates used to authenticate with remote resources that trust issuing CA • Grid Canada CA will be trusted by WestGrid resources
GSI Proxy Certificates • User credentials delegated from user certificate to proxy certificate • Proxy certificate used for authentication • Proxy certificates have limited lifetime • can also be limited to only authenticate with certain services • Proxy certificate copied to remote resource when job is started
Globus Security Commands • Users can request a certificate using ‘grid-cert-request’ • This creates userkey.pem and usercert_request.pem in ~/.globus/ • Certificate request file sent to CA • usercert.pem is returned and placed in ~/.globus/ Aim to automate this process for WestGrid users
Globus Security – Cont. • Proxy certificate created using ‘grid-proxy-init’ • Proxy certificate examined using ‘grid-proxy-info’ • Proxy certificate destroyed using ‘grid-proxy-destroy’ Proxy certificates could be created during login process
Enabling Access to Resources • Holding certificate from trusted CA does not guarantee access to resources • Users given access to resource by being included in recource’s grid-mapfile • This allows owner of resource to choose which users are allowed to use the resource • The grid-mapfile maps Grid user to a local account
Globus Job Starting • Run job on remote resource using ‘globus-job-run <host> <program>’ • <host> must trust the CA that signed the users certificate and user must be mentioned in grid-mapfile • Proxy certificate is copied to GASS cache on <host> to enable program to authenticate with other remote resources
Batch Job Starting • ‘globus-job-submit <host> <program>’ • This returns a url used to query job • ‘globus-job-status <url>’ • Find out if the job is waiting, running or finished • ‘globus-job-get-output <url>’ • Get output produced by job. This is stored in the GASS cache on the host where the job is running • ‘globus-job-clean <url>’ • Remove the GASS cache entry for the job in question
GridFTP • ‘globus-url-copy <original> <copy>’ • Copies file from one location to another • file:/<filename> - a file on a local file-system • gsiftp://<host>/<filename> - a file on GridFTP server <host> • Extensions to standard FTP include • Third party transfers • Parallel transfers
Credential Repository • NCSA’s MyProxy server provides an on-line credential repository • User stores proxy certificate in repository • This certificate can be long lived • User can later recover a short lived certificate from the repository
Credential Repository Uses • Used to authenticate with environment when user does not have access to their certificate • e.g., in a Web portal • Could be used to authenticate and get proxy certificate during login process eliminating need for Unix passwords
MyProxy Commands • myproxy-init –s <host> • Put a proxy certificate into the MyPoxy server on <host> • Can specify host using environment variable • myproxy-info –s <host> • View information about user’s proxy certificate • myproxy-get-credential • Get a proxy certificate • myproxy-destroy • Remove proxy certificate from the MyProxy server
MyProxy Certificate Renewal • Allows automated proxy certificate renewal • Special proxy certificate enables trusted service to renew standard proxy certificate • e.g., trust a local scheduler to renew the certificate before starting a job • Should help to prevent users resorting to insecure means for automating proxy renewal
GSI Enabled SSH Tools • GSI enabled versions of OpenSSH tools will be used in WestGrid • gsi-ssh Authenticates through GSI and copies proxy certificates to remote host • gsi-scp Authenticates through GSI
Resource Discovery • Globus uses MDS for resource discovery • GRIS – provides information about individual hosts • GIIS – provides information about groups of hosts • In WestGrid each of the 4 major resources will run a GRIS • At least one GIIS will be provided to hold aggregate information • Probably use one per site
MDS • Publish information to LDAP servers • Information used by Grid services to locate needed resources • Publish information such as • Type(s) of job scheduler available • Parameters accepted by job scheduler • Number of processors • Amount of RAM, disk or tape • Software and license availability
Meta-scheduling • A meta-scheduler is used to submit jobs to other job schedulers • WestGrid will employ meta-scheduling • Condor-G, Silver and Trellis are under consideration • Multiple meta-schedulers could be used • Hierarchical meta-scheduling can be employed
Condor-G • Can be used to submit jobs to specific machines • Can use ‘glideins’ to add resources to local condor pool • New version will include support for batch scheduler advertisements
Condor-G : Glidein Example Movie at http://www.cpsc.ucalgary.ca/~simmonds/EdmontonTalk1/condor_demo1.avi
Result: Solar System Viz Movie at http://www.cpsc.ucalgary.ca/~simmonds/EdmontonTalk1/solarsystem.avi
WestGrid Accounting • Use MDS to publish accounting information from each site to LDAP • WestGrid wide accounting calculated and also published in secure LDAP • Users will be able to gain access to information, filtered by a policy manager
Scheduling Priorities • Plan to use accounting information to provide fairness in scheduling priorities across WestGrid • Feed values calculated using global accounting information back into local batch schedulers
Data Storage • Grid enabled access to storage • Accessible from researcher’s desktop • Distributed file systems currently limited • Security and caching issues • Data repository systems provide much of the functionality required • SRB from SDSC • Giggle from ISI/ANL
Repository management • Large network available file stores • Annotation – meta-data tagging • Data representation optimization • Files, collections and containers • User level replication aided by catalogs
Wide Area Message Passing • MPI-G2 enables running of message passing jobs in Grid environment • Attempts to use best MPI implementation at each site • Provides process mapping configuration to group tightly coupled processes
Web Portals • Enable access to Grid services via web browser • Start a secure session then authenticate this session with GSI using credential server • Web session now acts as you in Grid environment WestGrid mock up
Getting a WestGrid Account • Centralized Web based account requests • We get certificate or you use exiting certificate • We setup accounts, install certificates and email you
WestGrid Grid Environment • Initial Grid services use • Globus, MyProxy, OpenSSH, SRB • Services include • Job starting, resource discover, credential management and repository management • Working on having meta-scheduler(s) • Condor-G, …
Lots of work to do … • Distributed file systems • Improved replica management • Fine-grain security • Performance measurement and analysis • Credential based information discovery • Enhanced meta-scheduling • Workflow
Credits – TeleSim helpers • Mark Fox mfox@cpsc.ucalgary.ca (TeleSim programmer) • Web portals, demo • Andrey Mirchovski mirchov@cpsc.ucalgary.ca (TeleSim research student) • Security and chief Globus critic • Phil Rizk rizkp@cpsc.ucalgary.ca (Hons project student/TeleSim programmer) • MDS, accounting and Web services