1 / 102

Basic Grid Projects

Basic Grid Projects. Sathish Vadhiyar. Sources/Credits: Project web pages. Condor. Condor Motivation. Most of the cycles (70%) of workstation pools are underutilized High throughput computing – Large amounts of processing capacity over long periods of time

gaille
Download Presentation

Basic Grid Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Basic Grid Projects Sathish Vadhiyar Sources/Credits: Project web pages

  2. Condor

  3. Condor Motivation • Most of the cycles (70%) of workstation pools are underutilized • High throughput computing – Large amounts of processing capacity over long periods of time • In contrast to High Performance Computing • Support system with distributed ownerships • Owners specify access policies

  4. Condor Features • Specialized workload management system • Provides a job queueing mechanism, scheduling policy, resource monitoring, and resource management • Can effectively harness wasted CPU power from otherwise idle desktop workstations • Can checkpoint and migrate a job to a different machine

  5. Condor Architecture – daemons / processes • master • Startd • Represents a machine to the Condor pool • Implement owner’s access control policies • Starts, stops, suspends jobs • Runs on executing machines • Starter • Spawned by startd for a job • Coordinates with the job

  6. Condor Architecture - daemons • Schedd • Represents jobs to the condor pool • Maintain persistent queues of user’s requests • Runs on submit machines • Shadow • Similar to starter functionality, but runs on submit machine • Job specific • Manager • Collector • Collects machine and resource information from all other daemons • Answers queries • Negotiator • Retrieves information from collector • Does match making

  7. Job Submission Steps

  8. Idle capacity utilization • When owner returns, Condor checkpoints and migrates jobs

  9. classads • Language used in Condor • For describing jobs, workstations, and other resources • Mapping from attribute names to expressions • Used by condor central manager to decide on job scheduling

  10. Steps • Step 1 – entities express their characteristics through classads, constraints and ranks for constraints and preferences. Other properties accesses by field other • Step 2 – matchmaker matches different classads • Step 3 – match maker notifies matched entities • Step 4 – matched entities establish allocation

  11. ClassAds

  12. More on ClassAds • Resource owners and customers can dynamically define own models – suitable for distributed setting • Matching and claiming as 2 distinct operations • 5 components of matchmaking protocol • classAd specification • advertising protocol • matchmaking algorithm • matchmaking protocol • claiming protocol • Constraints, i.e. queries, may be expressed as attibutes of classAd • classAd definition – mapping from attribute names to expressions

  13. Examples

  14. Examples

  15. ClassAds Steps

  16. Checkpointing • Checkpointing is used to vacate job from one idle workstation to another • A checkpoint library linked with the program’s code • Stores unix process’ states including text, stack, data segments, files, pointers etc. • Uses simple mechanisms including setjmp and longjmp. • Also provides periodic checkpointing • Works only with homogeneous systems

  17. Checkpointing • For scheduling and fault tolerance • Checkpoint library uses signals • Checkpoint contains the process's data and stack segments, information about open files, pending signals, and CPU state. • Checkpoints either stored on local disk of submitting machine or on checkpoint servers • Also provides a user interface • Transparent checkpointing • Remote files obtained from shadow process agent during migration

  18. DAGMan • Meta scheduler for Condor • Manages dependencies between jobs at a higher level • Sits on top of Condor • Input of one program depends on the other

  19. Example input file for DAGMan # Filename: diamond.dag # Job A A.condor Job B B.condor Job C C.condor Job D D.condor Script PRE A top_pre.csh Script PRE B mid_pre.perl $JOB Script POST B mid_post.perl $JOB $RETURN Script PRE C mid_pre.perl $JOB Script POST C mid_post.perl $JOB $RETURN Script PRE D bot_pre.csh PARENT A CHILD B C PARENT B C CHILD D Retry C 3

  20. References / Sources / Credits • Condor manual • Condor web pages • Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny, "Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System", University of Wisconsin-Madison Computer Sciences Technical Report #1346, April 1997. • James Frey, Todd Tannenbaum, Ian Foster, Miron Livny, and Steven Tuecke, "Condor-G: A Computation Management Agent for Multi-Institutional Grids", Proceedings of the Tenth IEEE Symposium on High Performance Distributed Computing (HPDC10) San Francisco, California, August 7-9, 2001. • Rajesh Raman, Miron Livny, and Marvin Solomon, "Matchmaking: Distributed Resource Management for High Throughput Computing", Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, July 28-31, 1998, Chicago, IL. • Michael Litzkow, Miron Livny, and Matt Mutka, "Condor - A Hunter of Idle Workstations", Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104-111, June, 1988.

  21. Globus • Open source toolkit used for building Grids • Software for • Security (GSI) • Information infrastructure (MDS) • Resource management (GRAM, job manager, gatekeeper) • Data management (GridFTP, DataGrid) • Communication (Nexus) • Fault detection, and • Portability • Now moving to web services - OGSA

  22. Timeline • I-WAY experiment – 1994 • Formal beginning - 1996 • 1st version – 1997 • Version 1.0 – 1998 • 2.0 – 2002 • 3.0 – latest • Show GT2 history powerpoint

  23. GT4 Planned architecture

  24. Grid Security Infrastructure (GSI) • Supports security across organizations. Not centrally managed • Single sign-on – delegation of credentials • Digital signatures based on public key cryptography for verification of messages

  25. Verification of messages / digital certificates Encypted hash + message Hash1 = hash(Message) Hash2 = decrypt hash If Hash1 = Hash2 ? Message Hash(message) Encyrpted hash

  26. GSI • Every resource identified by a certificate. • Certificate provided and signed by CA. • Certificate = resource identity + public key of resource + certificate authority + digital signature of CA • Uses SSL for mutual authentication • Parties trust CA’s – possess CA’s public keys

  27. Mutual Authentication CA B A I want to communicate. This is my certificate Did CA sign the certificate or is the certificate tempered? Verify digital signature OK. CA signed the certificate. Are you really A or did you steal the certificate from A? Send a random message

  28. Authentication with Proxy and delegation • Encrypted file for storing private keys. Needs passphrase • Proxy and delegation - More convenience and less security • Also for dynamic delegation and dynamic entities • Owner signs proxy certificate • Proxy’s private key are stored in unencrypted files since proxies are for short durations • Chain of trust is established

  29. Mutual Authentication with Proxy B A’s proxy Proxy’s certificate. A’s certificate First validate proxy’s certificate and then owner’s certificate

  30. GSS API • GSI implemented on top of GSS-API • GSS API provides both transport and mechanism independence. • Provides functions for obtaining credentials, performing authentication, signing messages and encrypting messages • GSI – X.509 public key certification, public key infrastructure, SSL protocol, X.509 proxy certificates

  31. X.509 Proxy Certificates • To allow users to: • Create identities for new entities dynamically and light-weight • Delegate privileges to those entities dynamically • Perform single sign-on • Allows for the reuse of existing protocols • Proxy certificate • Subject name (identity) – scoped by the subject name of the issuer – subject name of the issuer + RDN (Relative Distinguished Name) + serial number • Public key – different from subject’s public key • PCI – Proxy Certificate Information – policy method identifier + policy field

  32. Proxies

  33. Single sign-on and Proxies

  34. Delegation over Network

  35. Globus 3.2 - current • WS complies with OGSI 1.0 • New component – CAS (Community Authorization Service), XIO • Other components: WS RFT (Reliable File Transfer Service), Grid FTP, RLS (Replica Location Service), GRAM / WS MJS (Managed Job Service) / job manager

  36. GridFTP • GSI and Kherberos security on control and data channels with various levels of confidentiality and integrity • Multiple data channels for parallel transfers – using multiple TCP streams in parallel to improve aggregate bandwidth • Partial file transfers • Third-party (direct server-to-server) transfers by adding GSSAPI security to the existing third-party data transfers in FTP standard – transfers between 2 servers mediated by a third-party client • GSSAPI operations authenticate the third party to the source and destination machines of data transfer • Authenticated data channels • Reusable data channels • Command pipelining • Striped data transfers • Automatic negotiation of TCP buffer/window sizes • 2 libraries: • globus_ftp_control_library – implements control channel API • gobus_ftp_client_librray – implement GridFTP API • Plugin mechanisms for fault tolerance, performance monitoring, and extended data processing

  37. RFT (Reliable File Transfer) • Treat movement of multiple files as a single job • Accept transfer requests and reliably manage requests • OGSI compliant • To transfer data reliably between two GridFTP servers • Uses Grid Service Handles (GSH) • Acts as a proxy for the user, acts as client on user’s behalf for third-party transfers

  38. RFT • Client submits SOAP description of data transfer job • Maintains checkpoints in data bases • Supports both “push” and “pull” mechanisms

  39. GRAM • GRAM simplifies the use of remote systems by providing a single standard interface for requesting and using remote system resources for the execution of "jobs". The most common use (and the best supported use) of GRAM is remote job submission and control. This is typically used to support distributed computing applications • For remote job submission and resource management

  40. GRAM • Provides interfaces to local job scheduling mechanisms • Provides mechanisms to map GSI identities to local user accounts • Processes the requests for resources for remote application execution, allocates the required resources, and manages the active jobs. • also returns updated information regarding the capabilities and availability of the computing resources to the Metacomputing Directory Service (MDS). • provides an API for submitting and canceling a job request, as well as checking the status of a submitted job. The specifications are written by the user in the Resource Specification Language (RSL), and is processed by GRAM as part of the job request.

  41. GRAM • A Gatekeeper runs on the remote host • Creates jobmanager for the job • Gatekeeper: • mutually authenticates with the client, • maps the requestor to a local user, • starts a job manager on the local host as the local user, and • passes the allocation arguments to the newly created job manager. • Jobmanager: • Common component • Machine-specific component

  42. GRAM RSL attributes • (directory=value) • (executable=value) • (arguments=value [value] [value] ...) • (jobType=single|multiple|mpi|condor) • (count=value) • (hostCount=value) • (two_phase=<int>) • (restart=<old JM contact>)

  43. DUROC RSL attributes • Label • resourceManagerContact • subjobCommsType • subjobStartType

  44. Example (executable = a.out) (directory = /home/nobody ) (arguments = arg1 "arg 2") (count = 1)

  45. DUROC • Dynamically-Updated Request Online Coallocator • coallocator is used to coordinate transactions with each of the RMs and bring up the distributed pieces of the job • +(&(resourceManagerContact=RM1)(count=3)(executable=myprog.sparc))(&(resourceManagerContact=RM2)(count=2)(executable=myprog.rs6000))

  46. WS GRAM • A set of OGSI compliant services that provide remote job execution • (Master) Managed Job Factory Service (MJFS) • Managed Job Service (MJS) • File Stream Factory Service (FSFS) • File Stream Service (FSS) • Resource Specification Language (RSL-2) schema is used to communicate job requirements • Remote jobs run under local users account • Client to service credential delegation is done user to user, *not* through a third party

More Related