410 likes | 569 Views
Multiscalar Processors. Presented by Matthew Misler Gurindar S. Sohi , Scott E. Breach, T. N. Vijayjumar University of Wisconsin-Madison ISCA ‘95. Scalar Processors. Instruction Queue. Execution Unit. addu $20, $20, 16. ld $23, SYMVAL -16($20). move $17, $21. beq $17, $0, SKIPINNER.
E N D
Multiscalar Processors Presented by Matthew Misler Gurindar S. Sohi, Scott E. Breach, T. N. Vijayjumar University of Wisconsin-Madison ISCA ‘95
Scalar Processors Instruction Queue Execution Unit addu $20, $20, 16 ld $23, SYMVAL -16($20) move $17, $21 beq $17, $0, SKIPINNER ld $8, LELE($17)
SuperScalar Processors Instruction Queue Execution Unit addu $20, $20, 16 ld $23, SYMVAL -16($20) move $17, $21 beq $17, $0, SKIPINNER ld $8, LELE($17)
Fetch-Execute • Paradigm has been around for about 60 years • Superscalar processors to execute instructions out of order • Sometimes re-ordering done in hardware • Sometimes software • Sometimes both • Partial ordering
Control Flow Graphs • Segments are split on control dependencies (conditional branches)
Sequential “Walk” • Walk through the CFG with enough parallelism • Use speculative execution and branch prediction to raise the level of parallelism • Sequential semantics must be preserved • Can still execute out of order, but in-order commit
Multiscalars and Tasks • CFG broken down into tasks • Multiscalars step through at the task level • No inspection of instructions within a task • Each Task is assigned to one ‘processing unit’ • Multiple tasks can execute in parallel
Multiscalar Microarchitecture • Sequencer • Queue of processing units • Unidirectional ring • Each has an instruction cache, processing element, register file • Interconnect • Data Bank • Each has: address resolution buffer, data cache
Outline • Multiscalar Microarchitecture • Tasks • Multiscalars in-depth • Distribution of cycles • Comparison to other paradigms • Performance • Conclusion
Tasks • Sequencer distributes a task to a Processing unit • Unit fetches and executes the task until completion • Instructions in the window are bounded • By the first instruction in the earliest executing task • By the last instruction in the latest executing task
Tasks • Sequencer distributes a task to a Processing unit • Unit fetches and executes the task until completion • The Instruction Window is bounded by • The first instruction in the earliest executing task • The last instruction in the latest executing task • So? Instruction windows can be huge
Tasks Example A B C D E A B C B B C D
Tasks Example A B C D E A B C B B C D A B B C D A B C B C D E
Tasks • Hold true to sequential semantics inside each block • Enforce sequential order overall on tasks • The circular queue takes care of this part • In the previous example: • Head of queue does ABCBBCD • Middle unit does ABBCD • Tail of the queue ABCBCDE
Tasks • Registers • Create mask • May produce values for a future task • Forward values down the ring • Accum mask • Union of the create masks of active tasks • Memory • If it’s a known producer-consumer, then synchronize on loads and stores
Tasks • Memory (cont’d) • Unknown P-C relationship • Conservative approach: wait • Aggressive approach: speculate • Conservative approach means sequential operation • Aggressive approach requires dynamic checking, squashing and recovery
Outline • Multiscalar basics • Tasks • Multiscalars in-depth • Distribution of cycles • Comparison to other paradigms • Performance • Conclusion
Multiscalar Programs • Code for the tasks • Small changes to existing ISA • add specification of tasks • no major overhaul • Structure of the CFG and tasks • Communications between tasks
Control Flow Graph Structure • Successors • Task descriptor • Producing and consuming values • Forward register information on last update • Compiler can mark instructions: operate and forward • Stopping conditions • Special condition, evaluate conditions, complete • All of these can be viewed as tag bits
Multiscalar Hardware • Walks through the CFG • Assign tasks to processing units • Execute tasks in a ‘sequential’ order • Sequencer fetches the task descriptors • Using the address of the first instruction • Specifying the create masks • Constructing the accum mask • Using the task descriptor, predict successor
Multiscalar Hardware • Databanks • Updates to cache not speculative • Use of Address Resolution Buffer • Detects violation of dependencies • Initiates corrective actions • If it runs out of space, squash tasks • Not the head of the queue; it doesn’t use the ARB • Can stall rather than squash
Multiscalar Hardware • Remember the earlier architectural picture?
Multiscalar Hardware • It’s not the only possible architecture • Possible design with shared functional units • Possible design with ARB and data cache on the same side as the processing units • Scaling the interconnect is non-trivial • Glossed over
Outline • Multiscalar Basics • Tasks • Multiscalars In-Depth • Distribution of Cycles • Comparison to Other Paradigms • Performance • Conclusion
Distribution of Cycles • Wasted cycles: • Non-useful computation • Squashed • No computation • Waiting • Remains idle • No assigned task
Distribution of Cycles • Non-useful computation cycles • Determine useless computation early • Validate prediction early • Check if the next task is predicted correctly • Eg. Test for loop exit at the start of the loop • Tasks violating sequentiality are squashed • To avoid, try to synchronize memory communication with register communication • Could delay the load for a number of cycles • Can use signal-wait synchronization
Distribution of Cycles • Contrast with no assigned task • No computation cycles • Dependencies within the same task • Dependencies between tasks (earlier/later) • Load Balancing
Outline • Multiscalar Basics • Tasks • Multiscalars In-Depth • Distribution of Cycles • Comparison to Other Paradigms • Performance • Conclusion
Comparison to Other Paradigms • Branch prediction • Sequencer only needs to predict branches across tasks • Wide instruction window • Check to see which is ready for issue, in Multiscalar relatively few ready for inspection
Comparison to Other Paradigms • Issue logic • Superscalar processors have n2 logic • Multiscalar logic is distributed, • Each processing unit issues instructions independently • Loads and stores • Normally sequence numbers for managing the buffers • In multiscalar, the loads and stores are independent
Comparison to Other Paradigms • Superscalar processors need to discover CFG as it decodes branches • Only requires the compiler to split code into tasks • Multiprocessors require all dependence to be known or conservatively provided for • If a compiler could compile independently, it can be executed in parallel
Outline • Multiscalar Basics • Tasks • Multiscalars In-Depth • Distribution of Cycles • Comparison to Other Paradigms • Performance • Conclusion
Performance • Simulated • 5 stage pipeline Functional unit latency
Performance • Memory • Non-blocking loads and stores • 10 cycle latency for first 4 words • 1 cycle for each additional 4 words • Instruction Cache: 1 cycle for 4 words • 10+3 cycles for miss • Data Cache: 1 word per cycle multiscalar • 10+3 cycles + bus contention, for a miss • 1024 entry cache of task descriptors
Performance +12.2% on average
Performance – Summary • Most of the benchmarks achieve speedup • Eg. An average of 1.924 in 1-way in-order 4-unit multiscalar • Worst case 0.86 speedup (slowdown) • Many squashes in prediction and memory order in Gcc and Xlisp • Leads to almost sequential execution • Keeping in mind, 12.2% increase in IC
Outline • Multiscalar Basics • Tasks • Multiscalars In-Depth • Distribution of Cycles • Comparison to Other Paradigms • Performance • Conclusion
Conclusion • Divide the CFG into tasks • Assign tasks to processing units • Walk the CFG in task-size steps • Shows performance gains