320 likes | 351 Views
WaveScalar: the Executive Summary. Dataflow machine good at exploiting ILP dataflow parallelism + traditional coarser-grain parallelism cheap thread management low operand latency because of a hierarchical organization memory ordering enforced through wave-ordered memory
E N D
WaveScalar: the Executive Summary Dataflow machine • good at exploiting ILP • dataflow parallelism + traditional coarser-grain parallelism • cheap thread management • low operand latency because of a hierarchical organization • memory ordering enforced through wave-ordered memory • no special languages imperative languages CSE P548
begin with motivation, then the salient capabilities WaveScalar shrinking feature sizes no solution, just present the problem in 1 cycle ¼ distance of 2000 in 2010 performance Additional motivation: • increasing disparity between computation (fast transistors) & communication (long wires) • increasing circuit complexity • decreasing fabrication reliability long design time more bugs & longer verification time decrease size: R increases proportionally wire proximity determines C; C determines cross-talk time = R * C CSE P548
Monolithic von Neumann Processors A phenomenal success today. But in 2016? Performance Centralized processing & control Long wires e.g., operand broadcast networks • Complexity • 40-75% of “design” time is design verification • Defect tolerance 1 flaw -> paperweight 2004 die photo of P4 CSE P548
WaveScalar’s Microarchitecture Good performance via distributed microarchitecture • hundreds of PEs • dataflow execution – no centralized control • short point-to-point (producer to consumer) operand communication • organized hierarchically for fast communication between neighboring PEs • scalable Low design complexity through simple, identical PEs • design one & stamp out thousands Defect tolerance • route around a bad PE No long wires, even if increase WS size CSE P548
begin with bottom level of hierarchy see how operand latency grows with size Processing Element • Simple, small (.5M transistors) • 5-stage pipeline (receive input operands, match tags, instruction schedule, execute, send output) • Holds 64 (decoded) instructions • 128-entry token store • 4-entry output buffer matching table tracking board CSE P548 cache: index by hashing on I# & tag
PEs in a Pod • Share operand bypass network • Back-to-back producer-consumer execution across PEs • Relieve congestion on intra-domain bus CSE P548
Domain bus for operand traffic 5 cycles CSE P548
Cluster CSE P548
WaveScalar Processor • Long distance • communication • dynamic routing • grid-based network • 2-cycle hop/cluster latency depends on distance 9 cycles + cluster distance single thread: 61% traffic (operand) in pod 32% in domain 5% in cluster 1% across chip multiple threads: 40% traffic (operand) in pod 12% in domain 46% in cluster 2% across chip CSE P548
Whole Chip 8K: 8 PEs/domain, 4 clusters • Can hold 32K instructions • Normal memory hierarchy • Traditional directory-based cache coherence • ~400 mm2 in 90 nm technology • 1GHz. • ~85 watts 4MB 16wa L2 200 cycle memory CSE P548 revisit short wires, defect tolerance, simple design
only get low latency if place producers & consumers together on the other hand, want to exploit ILP, execute Is in parallel sometimes these goals conflict WaveScalar Instruction Placement pod PE1 PE2 operand latency vs. parallelism (resource conflicts) CSE P548
Place instructions in PEs to maximize data locality & instruction-level parallelism. Instruction placement algorithm based on a performance model that captures the important performance factors Carve the dataflow graph into segments a particular depth to make chains of dependent instructions that will be placed in the same pod a particular width to make multiple independent chains that will be placed in different, but near-by pods Snakes segments across PES in the chip on demand K-loop bounding to prevent instruction “explosion” WaveScalar Instruction Placement minimizing resource conflicts (execute in diff PEs, maximize parallelism) minimize operand latency short operand latency increase parallelism Loop control runs ahead creating tokens that won’t be used for awhile CSE P548
i A j * * + + Load + Store b Example to Illustrate the Memory Ordering Problem A[j + i*i] = i; b = A[i*j]; CSE P548
i A j * * + + Load + Store b Example to Illustrate the Memory Ordering Problem A[j + i*i] = i; b = A[i*j]; CSE P548
i A j * * + + Load + Store b Example to Illustrate the Memory Ordering Problem A[j + i*i] = i; b = A[i*j]; CSE P548
2 3 4 3 4 ? • Sequence # • Successor 4 5 6 4 7 8 • Predecessor 5 6 8 ? 8 9 Wave-ordered Memory Load • Compiler annotates memory operations • Send memory requests in any order • Hardware reconstructs the correct order Store ? previous example Store Load Load Assigned in breadth-first order Store CSE P548
this is how the HW reconstructs the memory order Store buffer 2 2 3 3 4 4 3 4 ? 4 7 8 ? 8 9 Wave-ordering Example Load Store 4 5 6 Store Load Load 5 6 8 Store CSE P548
this is how the HW reconstructs the memory order Store buffer 2 2 3 3 4 4 3 3 4 4 ? ? 4 7 8 ? 8 9 Wave-ordering Example Load Store 4 5 6 Store Load Load 5 6 8 Store CSE P548
this is how the HW reconstructs the memory order Store buffer 2 2 3 3 4 4 3 3 4 4 ? ? 4 7 8 ? ? 8 8 9 9 Wave-ordering Example Load Store 4 5 6 Store Load Load 5 6 8 Store store arrives 1st ? prevents sending to memory CSE P548
this is how the HW reconstructs the memory order Store buffer 2 2 3 3 4 4 3 3 4 4 ? ? 4 7 8 4 7 8 ? ? 8 8 9 9 Wave-ordering Example Load Store 4 5 6 Store Load Load 5 6 8 Store part of architecture use it/turn it off CSE P548 Ripples allow loads to execute in parallel or in any order memory nop makes a chain if no memory op on a path does not access memory
green bars are wave management instructions Wave-ordered Memory that was a wave: sequence of instructions, no backedges WOM within a wave Waves are loop-free sections of the dataflow graph Each dynamic wave has a wave number Wave number is incremented between waves Ordering memory: • wave-numbers • sequence number within a wave so can execute iterations & recursive functions in right order Precise interrupts on WS: What does that mean on a DF machine? Is before interrupted I have executed Is data dep on interrupted I have not fired Memory order is preserved CSE P548
+ WaveScalar Tag-matching WaveScalar tag • thread identifier • wave number Token: tag & value <2:5>.3 <2:5>.6 <ThreadID:Wave#:InstructionID>.value <2:5>.9 I ID? CSE P548
specfp, specint, mediabench/binary translator/AIPC avg raw performance comparable: iteration parallelism vs overhead of calling short functions Single-thread Performance CSE P548
Single-thread Performance per Area can use chip area more effectively bkz simpler & more parallel design CSE P548 encouraging but not the point: 1 cluster, 32 PEs, .1 to 1.4 AIPC
Multithreading the WaveCache Architectural-support for WaveScalar threads • instructions to start & stop memory orderings, i.e., threads • memory-free synchronization to allow exclusive access to data (thread communicate instruction) • fence instruction to force all previous memory operations to fully execute (to allow other threads to see the results of this one’s memory ops) Combine to build threads with multiple granularities • coarse-grain threads: 25-168X over a single thread; 2-16X over CMP, 5-11X over SMT • fine-grain, dataflow-style threads: 18-242X over single thread • combine the two in the same application: 1.6X or 7.9X -> 9X passes lock to the next thread used before another thread gets a lock $ coh on diff processors this on same processor more detail on thread creation/termination & performance CSE P548
data/thread/wave conversion Is t creates s Creating & Terminating a Thread Sets up HW in SB for WOM Must complete bf DtTW CSE P548
varied size of loop (affects granularity of parallelism) & dependence distance (affects # threads which can execute in parallel) compare to serial execution Thread Creation Overhead Reflection of how many threads can execute in parallel go back CSE P548 certain speedup above contour lines break even: 24 Is, 2 copies; 3 Is, 10 copies
near linear speedup to 32 threads 10X faster than CMPs Performance of Coarse-grain Parallelism 8x8 IPC ocean & raytrace decrease with 128 threads: 18% more I misses, 23% more matching table misses, 2X more data $ misses CSE P548
CMP Comparison CSE P548
Relies on: Cheap synchronization Load once, pass data (not load/compute/store) WOM part of architecture: turn off/on Performance of Fine-grain Parallelism lcs CSE P548
Building the WaveCache RTL-level implementation • some didn’t believe it could be built in a normal-sized chip • some didn’t believe it could achieve a decent cycle time and load-use latencies • Verilog & Synopsis CAD tools Different WaveCache’s for different applications • 1 cluster: low-cost, low power, single-thread or embedded • 42 mm2 in 90 nm process technology, 2.2 AIPC on Splash2 • 16 clusters: multiple threads, higher performance: 378 mm2 , 15.8 AIPC Board-level FPGA implementation • OS & real application simulations 20 FO4/1 GHz P4 122 fully custom SRAM cells 1/15 CSE P548