370 likes | 533 Views
Grids and Condor Barcelona, 2006. Agenda. Extended user’s tutorial Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing Case studies, and a discussion of your application‘s needs. Resources.
E N D
Agenda • Extended user’s tutorial • Advanced Uses of Condor Java programs DAGMan Stork MW Grid Computing • Case studies, and a discussion of your application‘s needs
Resources • There are many resources (machines) in the world, and many are or can be made available! • Groups of machines may be labeled as grids • Welcome to the power of the grid !
Condor and Grids • Condor has always been a tool to harness grid computing • Condor’s mechanisms have evolved as technologies have evolved. Roughly categorized: • Flocking • Glidein • The grid universe
Flocking • A way for jobs to run within a different, separate Condor pool • Condor runs here, and Condor runs there there here
Connect Condor Poolswith Flocking • Flocking is a Condor-specific technology • Flocking is enabled with configuration • Jobs flock from here to there when they cannot be run here due to lack of available machines
Configuration • Configuration files contain lots of the administrative information used by Condor • Format is like that in submit description files: AttributeName = Value
Configuration here • For jobs to be able to flock from here to there • In the configuration file on the pool where jobs flock from: FLOCK_TO = <central manager machine name> FLOCK_COLLECTOR_HOSTS = $(FLOCK_TO) FLOCK_NEGOTIATOR_HOSTS = $(FLOCK_TO) HOSTALLOW_NEGOTIATOR_SCHEDD = $(COLLECTOR_HOST), $(FLOCK_NEGOTIATOR_HOSTS)
Configuration there • In the configuration file on the pool where jobs flock to: FLOCK_FROM = <submit machine name>, . . . , <submit machine name> • To make security work: HOSTALLOW_WRITE_COLLECTOR = $(HOSTALLOW_WRITE), $(FLOCK_FROM) HOSTALLOW_WRITE_STARTD = $(HOSTALLOW_WRITE), $(FLOCK_FROM) HOSTALLOW_READ_COLLECTOR = $(HOSTALLOW_READ), $(FLOCK_FROM) HOSTALLOW_READ_STARTD = $(HOSTALLOW_READ), $(FLOCK_FROM)
Submit Description File Enable file transfer: universe = vanilla executable = myjob.exe input = myjob.input output = myjob.output log = myjob.log should_transfer_files = YES when_to_transfer_output = ON_EXIT queue
The Glidein Concept • Assume: We need more machines, and we have permission to use a set of machines • Glidein temporarily adds a set of machines to the local pool
Glidein • In addition, Glidein solves the problem: “My job needs to run on that particular resource, and my job needs Condor.” • For example: a job that must run under the standard universe
Glidein • Condor sends and runs its own executables on the resource • The needed resource appears to temporarily join the local Condor pool !
Glidein run condor_glidein to add the remote resource to the local pool the master and startd daemons become grid universe jobs using gt2 remote resource local pool
Making Glidein Work • Change the configuration to give access permission (HOSTALLOW_WRITE) to the remote resource • No changes to jobs’ submit description files! • But, do enable file transfer in the submit description file: universe = vanillaexecutable = myjob.exeinput = myjob.inputoutput = myjob.outputlog = myjob.logshould_transfer_files = YESwhen_to_transfer_output = ON_EXITqueue
Force Job to Glidein Resource In the submit description file: universe = standardexecutable = ajob.exeinput = ajob.inputoutput = ajob.outputlog = ajob.logrequirements = \ ( machine == “example.mcs.anl.gov" ) \ && Arch != "" && OpSys != ""queue
The Grid Universe Most useful when • We want to send a job off to a far away machine • We want to hand a job to another batch processing system on the local machine • We want to send a job off to a far away machine, in order to hand that job to another batch processing system on that machine
The Grid Universe • All handled in the submit description file • Supports several back end types: • Globus: GT2, GT3, GT4 • NorduGrid • UNICORE • Condor • PBS • LSF
Condor-G • Condor-G describes jobs to be handed off to a machine, and the machine is utilizing Globus middleware • gt 2: Globus Toolkit 1 or 2 or the pre-web services GRAM • gt 3: Globus Toolkit 3 • gt 4: Globus Toolkit 4 or WS GRAM
Submit Description File One of: For gt2: universe = grid input = job1.input output = job1.result log = job1.log grid_resource = gt2 example.wisc.edu/jobmanager queue jobmanagerjobmanager-condorjobmanager-pbsjobmanager-lsfjobmanager-sge
XXX is one of: ForkCondorPBSLSFSGE Submit Description File For gt3: universe = grid input = job2.input output = job2.result log = job2.log grid_resource = gt3 http://198.51.254.40:8080/osga/services/base /gram/XXXManagedJobFactoryService queue IP address:Port number
XXX is one of: ForkCondorPBSLSFSGE Submit Description File For gt4: universe = grid input = job3.input output = job3.result log = job3.log grid_resource = gt4 https://198.51.254.40:8080/wsrf/service/ManagedJobFactoryService XXX queue IP address:Port numberORHost name:Port number
Nordugrid and the Submit Description File universe = grid input = job4.input output = job4.result log = job4.log grid_resource = nordugrid ngexample.com queue
Unicore and the Submit Description File vsite is the name of the Unicore virtual resource universe = grid input = job5.input output = job5.result log = job5.log grid_resource = unicore usite.example.comvsite keystore_file = /frieda/certificates/keystore keystore_alias = “frieda” keystore_passphrase_file = /frieda/private/passphrase queue
PBS and the Submit Description File • Details of the PBS installation in$(GLITE_LOCATION)/etc/batch_gahp.config universe = grid input = job6.input output = job6.result log = job6.log grid_resource = pbs queue
LSF and the Submit Description File • Details of the LSF installation in$(GLITE_LOCATION)/etc/batch_gahp.config universe = grid input = job7.input output = job7.result log = job7.log grid_resource = lsf queue
Condor-C • Condor is running here,and Condor is running over there • For the case where We want to send a job off to a far away machine, in order to hand that job to another batch processing system on that machine
Condor-C and the Submit Description File universe = grid input = job8.input output = job8.result log = job8.log grid_resource = condor joe@remotemachine.example.com remotecentralmanager.example.com +remote_jobuniverse = 5 +remote_requirements = True +remote_ShouldTransferFiles = "YES" +remote_WhenToTransferOutput = "ON_EXIT" queue schedd name collector machine name vanilla universe
Credentials • Not just anybody can use any resource at any time. . . • Key concepts: Authentication verification of an identity Authorization permission to do something
Authentication If Frieda says “I am Frieda.”, how do we distinguish this from if Frieda says “I am George Bush.” ?
Authentication • Bush can do whatever he pleases • If Frieda claims to be Bush, (and this is accepted), then Frieda can do whatever she pleases • Authentication attempts to verify the identity of the entity that is communicating
Authorization • Who is allowed (permitted) to do what • Frieda may run gt4 jobs on the Open Science Grid machines • Fred may write to files in /usr/bin • the Unix user root may do anything! • Can be implemented with a list of those authorized
Condor and Authentication Authentication within Condor comes in many forms. Here are three. • File system: Have the entity write a file. The OS attaches a name to the file owner. Condor checks that the entity’s claim is the same as the file owner. • GSI (Grid Security Infrastructure) • Kerberos
Authentication Idea CA • A centralized certificate authority (CA) does verification of an entity’s identity. • When satisfied, the CA issues a signed certificate (also called a credential) I am Frieda
Authentication CA • To authenticate, the entity presents the certificate • All is well, if we trust the CA and the remote machine I am Frieda
GSI Authentication • GSI uses X.509 certificates • Grid universe, submitting to back end types using Globus middleware (gt2, gt3, gt4), as well as nordugrid, and unicore use X.509 certificates • Condor can also use GSI
Revocation, Trust, and Proxies • The CA may revoke a credential • Frieda gives the signed credential to the remote machine. If the remote machine is malicious, it could impersonate Frieda. Therefore, a password protects the credential. • A proxy is a credential that includes the password, but is only valid for a specific (short) time period. • MyProxy software enables GSI proxy management