240 likes | 297 Views
What happens when an instruction executes. Von Neumann Execution Model. important factor: PC. Fetch: send PC to memory transfer instruction from memory to CPU increment PC Decode & read ALU input sources Execute an ALU operation memory operation branch target calculation
E N D
What happens when an instruction executes Von Neumann Execution Model important factor: PC Fetch: • send PC to memory • transfer instruction from memory to CPU • increment PC Decode & read ALU input sources Execute • an ALU operation • memory operation • branch target calculation Store the result in a register • from the ALU or memory registers/immediate variations: SS pipelining # cycles each step CSE P548 - Dataflow Machines
What does this imply? linear fetch & execution, driven by PC pull Von Neumann Execution Model Program is a linear series of addressable instructions • next instruction to be executed is pointed to by the PC • send PC to memory • next instruction to execute depends on what happened during the execution of the current instruction Operands reside in a centralized, global memory (GPRs) can speculate within integer, FP CSE P548 - Dataflow Machines
What happens when an instruction executes Dataflow Execution Model & any initial inputs Instructions are already in the processor: Operands arrive from a producer instruction via a network Check to see if all an instruction’s operands are there Execute • an ALU operation • memory operation • branch target calculation Send the result • to the consumer instructions or memory matching table (token store) OOO to the max: not just Is in IQ, all Is CSE P548 - Dataflow Machines
What happens when an instruction executes what does this imply? Dataflow Execution Model Execution is driven by the availability of input operands • operands are consumed • output is generated • no PC Result operands are passed directly to consumer instructions • no register file Execution not driven by a PC;no PC parallel execution: not just OOO execution; only hindered by P/C data deps push; If push to instruction not there, load it no registers Models are diff, so what? can you intuit the implication for I exec of these models? gobs more parallelism & FG What if had unlimited functional units, which model would have better performance? CSE P548 - Dataflow Machines What is the HW trend these days, more or fewer FUs?
60’s performance was original motivation Dataflow Computers Motivation: • exploit instruction-level parallelism on a massive scale • more fully utilize all processing elements Believed this was possible if: • expose instruction-level parallelism by using a functional-style programming language • no side effects; only restrictions were producer-consumer • scheduled code for execution on the hardware greedily • hardware support for data-driven execution 3-pronged Side effects wrt generating new values programming language facilitates parallel execution, not just HW spread out, execute if operands computed operand checking can more easily express parallelism if side effect-free all multiplies & adds in matrix multiply in parallel driven by performance WS driven by more than perf CSE P548 - Dataflow Machines
Data dependences are the only thing that can inhibit DF exec (in theory) this is how they’re expressed a b destination1 destination2 opcode a+b + Dataflow Execution All computation is data-driven. • binary is represented as a directed graph • nodes are operations • values travel on arcs • WaveScalar instruction DFG computed by all compilers then linear list of instructions generated binary IP operands will find you … CSE P548 - Dataflow Machines
to repeat DF exec in context of directed graph + Dataflow Execution Data-dependent operations are connected, producer to consumer Code & initial values loaded into memory Execute according to the dataflow firing rule • when operands of an instruction have arrived on all input arcs, instruction may execute • value on input arcs is removed • computed value placed on output arc why remove? to allow new values for the next time I is executed a b a+b Sound familiar? Difference is that all of execution, not just the computation, works this way any I, not just those in the IQ can participate CSE P548 - Dataflow Machines
i A j * * + + Load + Store b Dataflow Example initialized value:beg computed value: some consumer I A[j + i*i] = i; b = A[i*j]; consumer I that will use b CSE P548 - Dataflow Machines
i A j * * + + Load + Store b Dataflow Example A[j + i*i] = i; b = A[i*j]; CSE P548 - Dataflow Machines
i A j * * + + Load + Store b Dataflow Example A[j + i*i] = i; b = A[i*j]; CSE P548 - Dataflow Machines
that was executing producer-consumer Is for straightline code control flow value predicate + T path value F path value T path F path predicate + value Dataflow Execution phi rho Control • steer (r) merge (f) • convert control dependence to data dependence with value-steering instructions • execute one path after condition variable is known (steer) or • execute both paths & pass values at end (merge) 2nd IP VN: predicated execution CSE P548 - Dataflow Machines
draw execution path WaveScalar Control r (steer) f (merge) no T arc if HW resources constrained or no need to hide I latency (have enough ILP from other Is, use r CSE P548 - Dataflow Machines
ISA in general Dataflow Computer ISA Instructions • operation • destination instructions Data packets, called Tokens • value • tag to identify the operand instance & match it with its fellow operands in the same dynamic instruction instance • architecture dependent • instruction number • iteration number • activation/context number (for functions, especially recursive) • thread number • Dataflow computer executes a program by receiving, matching & sending out tokens. Iterations diff I instances Execute I CSE P548 - Dataflow Machines
Types of Dataflow Computers static: • one copy of each instruction • no simultaneously active iterations, no recursion But simple, natural imple to begin with CSE P548 - Dataflow Machines
TTTT T T Types of Dataflow Computers increases parallelism dynamic • multiple copies of each instruction • better performance • gate counting technique to prevent instruction explosion: k-bounding • extra instruction with K tokens on its input arc; passes a token to 1st instruction of loop body • 1st instruction of loop body consumes a token (needs one extra operand to execute) • last instruction in loop body produces another token at end of iteration • limits active iterations to k need a token to enter loop & create token when exit: k tokens to begin with, only k interations replenishes so another iteration can take its place at any time have k executing iterations + unused tokens at loop entrance no more than k executing iterations CSE P548 - Dataflow Machines
processing elements token store instructions Prototypical Early Dataflow Computer 60’s: Dennis static Manchester Sigma-1 interconnection network tied together Original implementations were centralized. Performance cost • large token store (long access) • long wires • arbitration both for PEs and storing of result instruction packets data packets Token store could be with PEs distribution network for results distribution network for results CSE P548 - Dataflow Machines
my view: 2 related problems Problems with Dataflow Computers no PC unless explicit data dep, can’t tell order go back to ex: for what that means Language compatibility • dataflow cannot guarantee a correct ordering of memory operations • dataflow computer programmers could not use mainstream programming languages, such as C • developed special languages in which order didn’t matter Scalability: large token store • side-effect-free programming language with no mutable data structures • each update creates a new data structure • 1000 tokens for 1000 data items even if the same value • aggravated by the state of processor technology at the time • delays in processing (only so many functional units, arbitration delays, etc.) meant delays in operand arrival • associative search impossible; accessed with slower hash function Id, SISAL No mem side effects half-filled token entries CSE P548 - Dataflow Machines
i A j * * + + Load + Store b Example to Illustrate the Memory Ordering Problem A[j + i*i] = i; b = A[i*j]; CSE P548 - Dataflow Machines
i A j * * + + Load + Store b Example to Illustrate the Memory Ordering Problem A[j + i*i] = i; b = A[i*j]; CSE P548 - Dataflow Machines
i A j * * + + Load + Store b Example to Illustrate the Memory Ordering Problem A[j + i*i] = i; b = A[i*j]; Load-store ordering issue took longer to calculate the store’s address CSE P548 - Dataflow Machines back to text slide
response to large token store Partial Solutions Solutions led away from pure dataflow execution Data representation in memory • I-structures: • write once; read many times • early reads are deferred until the write • M-structures: • multiple reads & writes, but they must alternate • reusable structures which could hold multiple values can think of these as registers w/ FE bits better token store usage if: lots of readers, lots of tokens long delays between def and uses better token store usage if: update the same location long delays between def and use CSE P548 - Dataflow Machines
Partial Solutions For code with little parallelism Results stored in registers, not memory or tokens EM4 & Epsilon Local (register) storage for back-to-back instructions Frames of sequential instruction execution • create “frames”, each of which stored the data for one iteration or one thread • not have to search entire token store (offset to frame) • like having dataflow execution among coarse-grain threads rather than instructions Physically partition token store & place each partition with a PE expand on that idea Monsoon, *T response to accessing a large token store Monsoon MIT Tagged Token Machine CSE P548 - Dataflow Machines