700 likes | 716 Views
Chapter 2 Program and Network Properties. Conditions of parallelism. Data and resources dependences: The ability to execute several program segments in parallel requires each segment to be independent of the other segments. Dependence graphs are used to describe certain relationships.
E N D
Conditions of parallelism • Data and resources dependences: • The ability to execute several program segments in parallel requires each segment to be independent of the other segments. • Dependence graphs are used to describe certain relationships. • The nodes of the graph corresponds to the program statement (instruction) and the directed edges with different labels show the relations among the statements. • The analysis of dependence graph shows where opportunity exists for parallelization.
Data dependence • Flow dependence: a statement S2 is flow dependent on statement S1 if an execution path exists from s1 to s2 and if at least one output of s1 feeds in as input to s2. • Antidependence: Statement s2 is antidependent on statement s1 if s2 follows s1 in program order and if the output of s2 overlaps the input to s1. • Output dependence: two statements are output dependent if they produce (write) the same output variable. • I/O dependence: read and write are I/O statements. I/O dependence occur not because the same variable is involved but because the same file is referenced by both I/o statement.
Example • S1 load r1,a • S2 add r2,r1 • S3 move r1,r3 • S4 store b, r1 • S1 read (4), A(i) read array a from tape unit 4 • S2 rewind (4) • S3 write(4),B(i) write array b into tape unit 4 • S4 rewind (4)
Control Dependence: This refers to the situation where the order of execution of statements cannot be determined before run time. Eg. If condition. Control dependence often prohibits parallelism. • For ( i= 1; I<= n ; i++) • { • if (x[i - 1] == 0) • x[i] =0 • else • x[i] = 1; • }
Resource dependence: Deals with the conflicts in using shared resources. • When conflicting resource is ALU- alu dependence. • If memory location- storage dependence. • For example, floating point units or registers are shared, and this is known as ALU dependency. When memory is being shared, then it is called Storage dependency.
Bernstein’s Conditions • In 1966, Bernstein revealed a set of conditions based on which two processes can execute in parallel. • Input set-all input variables needed to execute the process. • Output set- all output variables generated after the execution.
To show the operation of Bernstein’s conditions, consider the following instructions of sequential program: • I1 : x = (a + b) / (a * b) • I2 : y = (b + c) * d • I3 : z = x2 + (a * e) • Now, the read set and write set of I1, I2 and I3 are as follows: • R1 = {a,b} W1 = {x} • R2 = {b,c,d} W2 = {y} • R3 = {x,a,e} W3 = {z}
Now let us find out whether I1 and I2 are parallel or not • R1∩W2=φ • R2∩W1=φ • W1∩W2=φ • That means I1 and I2 are independent of each other. • Similarly for I1 || I3, • R1∩W3=φ • R3∩W1≠φ • W1∩W3=φ
Hence I1 and I3 are not independent of each other. • For I2 || I3, • R2∩W3=φ • R3∩W2=φ • W3∩W2=φ • Hence, I2 and I3 are independent of each other. • Thus, I1 and I2, I2 and I3 are parallelizable but I1 and I3 are not.
2 processes p1 and p2 having I1 and I2 as input set and O1 and O2 as output set. • Now these 2 processes can execute in parallel if they satisfy the following conditions. • I1 ∩ O2 (anti dependent) = null (anti-independent) • I2 ∩ O1(flow dependency) = null(flow independent) • O1 ∩ O2(o/p dependent) = null(output independent )
The input set is also called the read set or the domain of process. • The output set has been called as the write set or the range of the process. • P1 C= D * E • P2 M= G + C • P3 A= B + C • P4 C= L + M • P5 F = G / E
Only 5 pairs can execute in parallel. • P1-P5, P2-P3, P2-P5, P3-P5, P4-P5. • Parallelism relation is commutative- p1-p2 implies p2-p1. • But it is not transitive. P1-p2, p2-p3. does not necessarily guarantee p1-p3.
Hardware and software parallelism • For implementation of parallelism, we need special hardware and software support. • Mismatch problem between them
Hardware parallelism • This refers to the type of parallelism defined by the machine architecture and hardware multiplicity. • Hardware parallelism is often a function of cost and performance tradeoffs. • It displays the resource utilization patterns of simultaneously executable operations.
One way to characterize the parallelism in a processor is by the number of instruction issues per machine cycle. If a processor issues k instructions per machine cycle, then it is called a k-issue processor. • A conventional processor takes one or more machine cycles to issue a single instructions. These types of processors are called as one-issue machine. • For eg. Intel i960C is a three-issue processor. One arithmetic, one memory access and one branch instruction issued per cycle.
Software parallelism • This type of parallelism is defined by the control and data dependence of programs.
Mismatch Example. • Software parallelism: there are eight instructions. 4 load and 4 arithmetic instructions. • Therefore software parallelism 8/3= 2.67 instructions per cycle. • Hardware parallelism- using 2-issue processor, can execute only one load (memory access) and one arithmetic operation simultaneously. • Hardware parallelism- 8/7= 1.14 instructions per cycle. • Now using dual-processor- 8/6.
Of the many types of software parallelism, two most important are control parallelism and data parallelism. • Control parallelism allows two or more operations to be performed in parallel. Eg. Pipeline operations. • Data parallelism in which almost the same operation is performed over many data elements by many processors in parallel.
To solve mismatch problem • To solve the problem of hardware and software mismatch, one approach is to develop compilation support. • The other is through hardware redesign for more efficient exploitation by an intelligent compiler. • One must design the compiler and hardware jointly. Interaction between the can lead to a better solution to the mismatch problem. • Hardware and software design tradeoffs also exists in terms of cost, complexity, expandability, compatibility and performance.
Program partitioning and scheduling • Grain size or granularity is a measure of the amount of computation involved in a software process. The simplest measure is to count the number of instructions in a grain (program segment). • Grain size determines the basic program segment chosen for parallel processing. • Grain size are commonly described as fine, medium or course depending on the processing level involved.
Latency: is a time measure of the communication overhead incurred between the machine subsystems. • For eg. Memory latency: is the time required by the processor to access the memory. • Synchronization latency is the time required for two processors to synchronize with each other.
Levels of parallelism • Instruction Level: a typical grain contains less than 20 instructions called fine grain. Its easy to detect the parallelism here. • Loop level: here the grain size is less than 500 instructions. • Procedures Level: this corresponds to medium grain. Contains less than 2000 instructions. Detection of parallelism is more difficult at this level as compared o fine level.
Sub-program level: the number of instructions range to thousands. Form coarse grain. • Job level: this corresponds to the parallel execution of essentially independent jobs (programs).the grain size can be as high as tens of thousands of instructions in a single program.
Kruatrachue algorithm • Each node is represented by pair(n,s) =Node name and grain size. • Edge label(v,d)=output variable v and communication delay.
Program Flow mechanisms • Conventional computers are based on a control flow mechanisms by which the order of program execution is explicitly stated in the user programs. • Data flow computers are based on a data driven mechanism which allows the execution of any instruction to be driven by data availability. • Reduction computers are based on demand driven mechanism which initiates an operation based on a demand for its results by other computations.
Control flow computers • Conventional von Neumann computers use a program counter to sequence the execution of instructions in a program. • Control flow computers use shared memory to hold program instructions and data. • Data flow computers, the instructions are executed once the operand is available. Data directly goes to the instruction. Computational results ()data token) are passed directly between instructions.
The data generated by an instruction will be duplicated into may copies and forwarded directly to all needy instructions. Data tokens once consumed by an instruction, will no longer be available for reuse. • It does not require shared memory or program counter. It just requires special mechanisms to detect data availability., to match data tokens with needy instructions.
A data flow architecture • Arvind and its associates at MIT have developed a tagged-token architecture for building data flow computers. • The global architecture consists of n processing elements interconnected by an n*n routing network. • Within each PE, the machine provides a token matching which dispatches only those instructions whose input data are already available.
Instructions are stored in program memory. • Each datum is tagged with the address of the instruction to which it belongs. • Tagged tokens enter the PE through local path. • It is the machine’s job to match up data with the same tag to needy instructions. • Each instruction represents a synchronization operation.
Another synchronization mechanism called I-structure is also provided within each PE. The I-structure is a tagged memory unit for overlapped usage of a data by both the producer and consumer processes. Each word of I-structure uses a 2 bit tag indicating whether the word is empty, is full or has a pending read requests.
Demand driven mechanisms • The computation is triggered by the demand for an operation’s result. • Eg. A= ((b+1)*c-(d/e)). • The data driven computation chooses a bottom-up approach, starting from the innermost operation. • Such computations are also called as eager evaluation because the operations are carried out immediately after all the operands become available.
A demand driven goes with top-down approach. • In this when a is demanded then operation should will start. They are also called as lazy evaluation, because the operations are executed only when their results are required by another instruction.
System interconnect network • Static and dynamic networks are used for interconnecting computer subsystems or for constructing multiprocessors/multicomputers. • Static: formed of point to point direct connections which will not change during program execution. • Dynamic: are implemented with switched channels which are dynamically configured to match the communication demand.
Network properties and Routing • Node degree: the no. of edges incident on a node. In-degree. Out-degree. Total is node degree. The total reflects the number of i/o ports required per node. Thus the cost of the node. • Diameter D: shortest path between any 2 nodes. The path length is measured by the number of links traversed. • Network Size: the total number of nodes.
Bisection width b: When a given network is cut into two equal halves, the minimum number of edges along the cut. In communication network, each edge corresponds to a channel with w bit wires. Thus the wire bisection width is B=bw. Thus B reflects the wiring density of a network.
Data routing functions • A data routing network is used for inter-PE data exchange. Routing network can be static- hypercube routing network used in TMC/CM-2 or dynamic- multistage network used in IBM GF11. • Commonly seen data routing functions among PE’s include shifting, rotation, permutation(one to one), broadcast, multicast, personalized communication, shuffle etc. these routing functions can be implemented on ring, mesh, hypercube etc.
Permutations: For n objects there are n! permutations by which the n objects can be reordered. The set of all permutations form permutation group. • Eg. Permutation ∏ = (a,b,c)(d,e). • In circular fashion- a-b, b-c, c-a, d-e, e-d. • A.b,c has a period of 3. e,d has period of 2. • Therefore total is 2*3=6. • Crossbar switch, multistage network.
Perfect Shuffle and Exchange • This is obtained by shifting 1 bit to the left and wrapping around the most significant to the least significant position. • Hypercube routing function: 3 routing functions are defined in this