Grid Systems and scheduling

Grid Systems and scheduling

Grid systems • Many!!! • Classification: (depends on the author) • Computational grid: • distributed supercomputing (parallel application execution on multiple machines) • high throughput (stream of jobs) • Data grid:provides the way to solve large scale data management problems • Service grid:systems that provide services that are not provided by any single local machine. • on demand: aggregate resources to enable new services • Collaborative: connect users and applications via a virtual workspace • Multimedia: infrastructure for real-time multimedia applications 2

Taxonomy of Applications • Distributed supercomputingconsume CPU cycles and memory • High-Throughput Computingunused processor cycles • On-Demand Computingmeet short-term requirements for resources that cannot be cost-effectively or conveniently located locally. • Data-Intensive Computing • Collaborative Computingenabling and enhancing human-to-human interactions (eg: CAVE5D system supports remote, collaborative exploration of large geophysical data sets and the models that generated them) 3

Alternative classification • independent tasks • loosely-coupled tasks • tightly-coupled tasks 4

Application partitioning mapping allocation management grid node A grid node B Application Management • Description • Partitioning • Mapping • Allocation 5

Description • Use a grid application description language • Grid-ADL and GEL • One can take advantage of loop construct to use compilation mechanisms for vectorization 6

Grid-ADL Traditional systems 1 2 5 6 alternative systems 1 .. 2 5 6 7

Partitioning/Clustering • Application represented as a graph • Nodes: job • Edges: precedence • Graph partitioning techniques: • Minimize communication • Increase throughput or speedup • Need good heuristics • Clustering 8

Graph Partitioning • Optimally allocating the components of a distributed program over several machines • Communication between machines is assumed to be the major factor in application performance • NP-hard for case of 3 or more terminals 9

Graph partitioning and cut set • The partition of the program on to machines that minimizes the interprocessor communication corresponds to the minimal cut set for the graph • Finding a minimal cut set is an np-hard problem • heuristics 10

Basic concept: Collapse the graph • Given G = {N, E, M} • N is the set of Nodes • E is the set of Edges • M is the set of machine nodes 11

Heuristic: Dominant Edge • Take node nand its heaviest edge e • Edges e1,e2,…er with opposite end nodes not in M • Edges e’1,e’2,…e’k with opposite end nodes in M • If w(e) ≥ Sum(w(ei)) + max(w(e’1),…,w(e’k)) • Then the min-cut does not contain e • Soecan be collapsed 12

Another heuristic: Machine Cut • Let machine cut Mi be the set of all edges between a machine miand non-machine nodes N • Let Wi be the sum of the weights of all edges in the machine cut Mi • Wi’s are sorted so W1 ≥ W2 ≥ … • Any edge that has a weight greater than W2 cannot be part of the min-cut 13

Yet another heuristic: Zeroing • Assume that node n has edges to each of the m machines in M with weights w1 ≤ w2 ≤ … ≤ wm • Reducing the weights of each of the m edges from n to machines M by w1 doesn’t change the assignment of nodes for the min-cut • It reduces the cost of the minimum cut by (m-1)w1 14

Heuristics: Order of Application • If the previous 3 techniques are repeatedly applied on a graph until none of them are applicable then: • the resulting reduced graph is independent of the order of application of the techniques 15

Output • List of nodes collapsed into each of the machine nodes • Weight of edges connecting the machine nodes • Source: Graph Cutting Algorithms for Distributed Applications Partitioning, Karin Hogstedt, Doug Kimelman, VT Rajan, Tova Roth, and Mark WegmanACM SIGMETRICS, v. 28:4, 2001 • homepages.cae.wisc.edu/~ece556/fall2002/PROJECT/distributed_applications.ppt 16

Graph partitioning • Hendrickson and Kolda, 2000: edge cuts: • are not proportional to the total communication volume • try to (approximately) minimize the total volume but not the total number of messages • do not minimize the maximum volume and/or number of messages handled by any single processor • do not consider distance between processors (number of switches the message passes through, for example) • undirected graph model can only express symmetric data dependencies. 17

Graph partitioning • To avoid message contention and improve the overall throughput of the message traffic, it is preferable to have communication restricted to processors which are near to each other • But, edge-cut is appropriate to applications whose graph has locality and few neighbors 18

Resource Management (1988) Source: P. K. V. Mangan, Ph.D. Thesis, 2006 19

Static scheduling task precedence graphDSC: Dominance Sequence Clustering • Yang and Gerasoulis, 1994: two step method for scheduling with communication:(focus on the critical path) • schedule an unbounded number of completely connected processors (cluster of tasks); • if the number of clusters is larger than the number of available processors, then merge the clusters until it gets the number of real processors, considering the network topology (merging step). 20

Kwok and Ahmad, 1999: multiprocessor scheduling taxonomyStatic Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors 21

List Scheduling • make an ordered list of processes by assigning them some priorities • repeatedly execute the following two steps until a valid schedule is obtained: • Select from the list, the process with the highest priority for scheduling. • Select a resource to accommodate this process. • priorities are determined statically before the scheduling process begins. The first step chooses the process with the highest priority, the second step selects the best possible resource. • Some known list scheduling strategies: • Highest Level First algorithm or HLF • Longest Path algorithm or LP • Longest Processing Time • Critical Path Method • List scheduling algorithms only produce good results for coarse-grained applications 22

Graph partitioning • Kumar and Biswas, 2002: MiniMax • multilevel graph partitioning scheme • Grid-aware • consider two weighted undirected graphs: • a work-load graph (to model the problem domain) • a system graph (to model the heterogeneous system) 23

Resource Management • The scheduling algorithm has four components: • transfer policy: whena node can take part of a task transfer; • selection policy: which taskmust be transferred; • location policy: which nodeto transfer to; • information policy: when to collectsystem state information. 24

Resource Management • Location policy: • Sender-initiated • Receiver-initiated • Symmetrically-initiated 25

Scheduling mechanisms for grids • Berman, 1998 (ext. by Kayser, 2006): • Job scheduler • Resource scheduler • Application scheduler • Meta-scheduler 26

Scheduling mechanisms for grid • Legion • University of Virginia (Grimshaw, 1993) • Supercomputing 1997 • Commercialized in 2003 by Avaki 27

Legion • is an object oriented infrastructure for grid environments layered on top of existing software services. (some say it is grid-aware operating system) • uses the existing operating systems, resource management tools, and security mechanisms at host sites to implement higher level system-wide services • design is based on a set of core objects 28

Legion • Uses the concept of Context Spaces to implement the objects (processes, file names etc) • ProxyMultiObject: container process used to represent files and contexts residing on one host 29

LegionFS ProxyMultiObject Lightweight and distributed 30

Legion • resource management is a negotiation between resources and active objects that represent the distributed application • three steps to allocate resources for a task: • Decision: considers task’s characteristics and requirements, resource’s properties and policies, and users’ preferences • Enactment: the class object receives an activation request; if the placement is acceptable, start the task • Monitoring: ensures that the task is operating correctly 31

Globus • From version 1.0 in 1998 to the 2.0 release in 2002 and the latest 3.0, the emphasis is to provide a set of components that can be used either independently or together to develop applications • The Globus Toolkit version 2 (GT2) design is highly related to the architecture proposed by Foster et al. • The Globus Toolkit version 3 (GT3) design is based on grid services, which are quite similar to web services. GT3 implements the Open Grid Service Infrastructure (OGSI). • GT4 is also based on grid services, but with some changes in the standard • GT5 provides an API multithreaded implementation based on an asynchronous event model 32

Globus • Toolkit with a set of components that implement basic services: • Security • resource location • resource management • data management • resource reservation • Communication 33

Core Globus Services • Communication Infrastructure (Nexus) • Information Services (MDS) • Remote File and Executable Management (GASS, RIO, and GEM) • Resource Management (GRAM) • Security (GSS) 1/8/2020 MCC/MIERSI Grid Computing 34

Communications (Nexus) • Communication library (ANL & Caltech) • Asynchronous communications • Multithreading • Dynamic resource management 1/8/2020 MCC/MIERSI Grid Computing 35

Communications (Nexus) • 5 basic abstractions • Nodes • Contexts (Address spaces) • Threads • Communication links (global pointers) • Remote service requests • Startpoints and Endpoints 36

Communications (Nexus) Source; technologies for ubiquitous supercomputing…Foster et al, (CCPE 1997) A Remote Service Request takes a GP, a proc name and data Transfers the data to the context refrenced by the GP Remotely invokes the specified procedure (data and local portion of the GP arguments) 37

Information Services(Metacomputing Directory Service - MDS) • Required information • Configuration details about resources • Amount of memory • CPU speed • Performance information • Network latency • CPU load • Application specific information • Memory requirements 1/8/2020 MCC/MIERSI Grid Computing 38

Remote file and executable management • Global Access to Secondary Storage (GASS) • basic access to remote files, operations supported include remote read, remote write and append • Remote I/O (RIO) • distributed implementation of the MPI-IO, parallel I/O API • Globus Executable Management (GEM) • enables loading and executing a remote file through the GRAM resource manager 1/8/2020 MCC/MIERSI Grid Computing 39

GRAM LSF EASY-LL NQE Resource management • Resource Specification Language (RSL) • Globus Resource Allocation Manager (GRAM) • provides a standardized interface to all of the various local resource management tools that a site might have in place • DUROC • provides a co-allocation service • it coordinates a single request that may span multiple GRAMs. DUROC: Dynamically-Updated Request Online Coallocator 1/8/2020 MCC/MIERSI Grid Computing 40

Authentication Model • Authentication is done on a “user” basis • Single authentication step allows access to all grid resources • No communication of plaintext passwords • Most sites will use conventional account mechanisms • You must have an account on a resource to use that resource 1/8/2020 MCC/MIERSI Grid Computing 41

Grid Security Infrastructure • Each user has: • a Grid user id (called a Subject Name) • a private key (like a password) • a certificate signed by a Certificate Authority (CA) • A “gridmap” file at each site specifiesgrid-id to local-id mapping 1/8/2020 MCC/MIERSI Grid Computing 42

Certificate Based Authentication • User has a certificate, signed by a trusted “certificate authority” (CA) • Certificate contains user name and public key • Globus project operates a CA 1/8/2020 MCC/MIERSI Grid Computing 43

“Logging” onto the Grid • To run programs, authenticate to Globus: % grid-proxy-init Enter PEM pass phrase: ****** • Creates a temporary, short-lived credential for use by our computations Private key is not exposed past grid-proxy-init 1/8/2020 MCC/MIERSI Grid Computing 44

Simple job submission • globus-job-run provides a simple RSH compatible interface% grid-proxy-init Enter PEM pass phrase: *****% globus-job-run host program [args] 1/8/2020 MCC/MIERSI Grid Computing 45

Condor • It is a specialized job and resource management system. It provides: • Job management mechanism • Scheduling • Priority scheme • Resource monitoring • Resource management 1/8/2020 MCC/MIERSI Grid Computing 46

Condor Terminology • The user submits a job to an agent. • The agent is responsible for remembering jobs in persistent storage while finding resources willing to run them. • Agents and resources advertise themselves to a matchmaker, which is responsible for introducing potentially compatible agents and resources. • At the agent, a shadow is responsible for providing all the details necessary to execute a job. • At the resource, a sandbox is responsible for creating a safe execution environment for the job and protecting the resource from any mischief. 1/8/2020 MCC/MIERSI Grid Computing 47

Condor-G: computation management agent for Grid Computing • Merging of Globus and Condor technologies • Globus • Protocols for secure inter-domain communications • Standardized access to remote batch systems • Condor • Job submission and allocation • Error recovery • Creation of an execution environment 1/8/2020 MCC/MIERSI Grid Computing 48

Globus: scheduling • Resource Specification Language (RSL) is used to communicate requirements. • To take advantage of GRAM, a user still needs a system that can remember what jobs have been submitted, where they are, and what they are doing. • To track large numbers of jobs, the user needs queuing, prioritization, logging, and accounting. These services cannot be found in GRAM alone, but are provided by systems such as Condor-G 49

MyGrid and OurGrid (Cirne et al.) • Mainly for bag-of-tasks (BoT) applications • uses the dynamic algorithm Work Queue with Replication (WQR) • hosts that finished their tasks are assigned to execute replicas of tasks that are still running. • Tasks are replicated until a predefined maximum number of replicas is achieved (in MyGrid, the default is one). 50

Grid Systems and scheduling