Multiscalar processors

Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison

Outline • Motivation • Multiscalar paradigm • Multiscalar architecture • Software and hardware support • Distribution of cycles • Results • Conclusion

Motivation • Current architecture techniques reaching their limits • Amount of ILP that can be extracted by superscalar processor is limited • Kunle Olukotun (stanford university)

Limits of ILP • Parallelism that can be extracted from a single program is very limited – 4 or 5 in integer programs • Limits of instruction-level parallelism- David W. Wall (1990)

Limitations of superscalar • Branch prediction accuracy limits ILP • Every 5 instruction is a branch • Executing an instruction across 5 branches leads to useful result only 60% of the time (with branch prediction accuracy 90%) • There are branches which are difficult to predict – increasing the window size doesn’t always means executing useful instructions

Limitations of superscalar.. contd • Large window size • Issuing more instructions per cycle needs large window of instructions • Each cycle search the whole window to find instructions to issue • Increases the pipeline length • Issue complexity • To issue an instruction dependence checks have to be performed with other issuing instructions • To issue n instructions complexity of issue is n2

Limitations of superscalar.. contd • Load and store queue limitations • Loads and stores cannot be reordered before knowing their addresses • One load or store waiting for its address can block the entire processor

Consider the following hypothetical loop: Iter 1: inst 1 inst 2 … inst n Iter 2: inst 1 inst 2 … If window size is less than n, superscalar considers only one iteration at a time Possible improvement Iter 1: iter 2: inst 1 inst 1 inst 2 inst 2 … … … inst n inst n Superscalar limitation example

Multiscalar paradigm • Divide the program (CFG) into multiple tasks (not necessarily parallel) • Execute the tasks in different processing elements, residing in the same die – communication cost is less • Sequential semantics is preserved by hardware and software mechanisms • Tasks are typically re-executed if there is any violations

Crossing the limits of superscalar • Branch prediction • Each thread executes independently • Each thread is limited by branch prediction – but number of useful instructions available is much larger than superscalar • Window size • Each processing element has its own window • Total size of the windows in a die can be very large, while each window can be of moderate size

Crossing the limits of superscalar.. contd • Issue Complexity • Each processing element issue only a few instructions – simplifies logic • Loads and Stores • Loads and stores can executed without waiting for the previous thread’s load or store

Multiscalar architecture • A possible microarchitecture

Multiscalar execution • The sequencer walks over the CFG • According the hints inserted in the code, it assigns tasks to PEs • PEs execute the tasks in parallel • Maintaining sequential semantics • Register dependencies • Memory dependencies • Tasks are assigned in the ring order and are committed in the ring order

Register Dependencies • Register dependencies can be easily identified using compiler • Dependencies are always synchronized • Registers that a task may write are maintained in a create mask • Reservations are created in the successor tasks using the accum mask • If the reservation exist (value not arrived), the instruction reading the register waits

Memory dependencies • Cannot be statically found • Multiscalar uses an aggressive approach – speculate always • The loads don’t wait for stores in the predecessor tasks • Hardware checks for violation and the task is re-executed if it violates any memory dependency

Task commit • Speculative tasks are not allowed to modify memory • Store values are buffered in hardware • When the processing element becomes head – it retires its values into memory • In order to maintain sequential semantics the tasks retire in order – ring arrangement of processing elements

Compiler support • Structure of CFG • Sequencer needs information of tasks • Compiler or a assembly code analyzer marks the structure of the CFG – task boundaries • Sequencer walks through this information

Compiler support .. contd • Communication information • Gives the create mask as part of task header • Sets the forward and stop bits • Register value is forwarded if forward bit is set • Task is done when it sees a stop bit • Also needs to give release information

Hardware support • Need to buffer speculative values • Need to detect memory dependence violations • If a speculative thread loads a value its address is recorded in ARB • If a thread stores into some location, then ARB is checked to see if there was a load from the same location by a later thread • Also the speculative values are buffered

Cycle distribution • Best scenario – all processing element does useful work always – never happens • Possible wastage • Non-useful computation • If the task is squashed later due to incorrect value or incorrect prediction • No computation • Waits for some dependency to be resolved • Waits to commit its result • Remains idle • No task assigned

Non-useful computation • Synchronization of memory values • Squashes usually occur on global or static data values • Easy to predict this dependency • Explicitly synchronizations can be inserted to eliminate squashes due these dependencies • Early validation of prediction • For example loop exit testing can be done at the beginning of the iteration

No computation • Intra-task dependences • These can be eliminated through a variety of hardware and software techniques • Inter-task dependences • Possible scope for scheduling to reduce the wait time • Load balancing • Tasks retire in-order • Some tasks finish fast and wait for a long time to become the head task

Differences with other paradigms • Major improvement over superscalar • VLIW – limited because of the limits of static optimizations • Multiprocessor • Very much similar • Communication costs is very less • Leads to fine grained thread parallelism

Methodology • Simulator which uses MIPS code • 5 stage pipeline • Sequencer has a 1024 entry direct mapped cache of task descriptors

Results

Results • Compress – long critical path • Eqntott and cmppt – has parallel loops with good coverage • Espresso – one loop has load balancing issue • Sc – also has load imbalance • Tomcatv – good parallel loops • Cmp and wc – intra task dependences

Conclusion • Multiscalar paradigm has very good potential • Tackles the major limits of superscalar • Lots of scope for compiler and hardware optimizations • Paper gives a good introduction to the paradigm and also discusses the major optimization opportunities

Discussion

BREAK!

Multiscalar processors

Multiscalar processors

Presentation Transcript

Signal Processors

Vector Processors

Superscalar Processors

Processors

Macro Processors

Dynamic Processors

Multithreaded Processors

Processors

ARM processors

Multiscalar Processors

Parallel Processors

Language Processors

Network Processors

Multithreaded Processors

Processors

Trace Processors

Multiscalar processors

Modeling Processors

PROCESSORS

VLIW Processors

Superscalar Processors

MultiCore Processors