290 likes | 303 Views
This paper discusses the motivation behind multiscalar processors, their architecture, software and hardware support, and the distribution of cycles. It also presents results and a conclusion on how multiscalar processors can overcome the limitations of superscalar architecture.
E N D
Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison
Outline • Motivation • Multiscalar paradigm • Multiscalar architecture • Software and hardware support • Distribution of cycles • Results • Conclusion
Motivation • Current architecture techniques reaching their limits • Amount of ILP that can be extracted by superscalar processor is limited • Kunle Olukotun (stanford university)
Limits of ILP • Parallelism that can be extracted from a single program is very limited – 4 or 5 in integer programs • Limits of instruction-level parallelism- David W. Wall (1990)
Limitations of superscalar • Branch prediction accuracy limits ILP • Every 5 instruction is a branch • Executing an instruction across 5 branches leads to useful result only 60% of the time (with branch prediction accuracy 90%) • There are branches which are difficult to predict – increasing the window size doesn’t always means executing useful instructions
Limitations of superscalar.. contd • Large window size • Issuing more instructions per cycle needs large window of instructions • Each cycle search the whole window to find instructions to issue • Increases the pipeline length • Issue complexity • To issue an instruction dependence checks have to be performed with other issuing instructions • To issue n instructions complexity of issue is n2
Limitations of superscalar.. contd • Load and store queue limitations • Loads and stores cannot be reordered before knowing their addresses • One load or store waiting for its address can block the entire processor
Consider the following hypothetical loop: Iter 1: inst 1 inst 2 … inst n Iter 2: inst 1 inst 2 … If window size is less than n, superscalar considers only one iteration at a time Possible improvement Iter 1: iter 2: inst 1 inst 1 inst 2 inst 2 … … … inst n inst n Superscalar limitation example
Multiscalar paradigm • Divide the program (CFG) into multiple tasks (not necessarily parallel) • Execute the tasks in different processing elements, residing in the same die – communication cost is less • Sequential semantics is preserved by hardware and software mechanisms • Tasks are typically re-executed if there is any violations
Crossing the limits of superscalar • Branch prediction • Each thread executes independently • Each thread is limited by branch prediction – but number of useful instructions available is much larger than superscalar • Window size • Each processing element has its own window • Total size of the windows in a die can be very large, while each window can be of moderate size
Crossing the limits of superscalar.. contd • Issue Complexity • Each processing element issue only a few instructions – simplifies logic • Loads and Stores • Loads and stores can executed without waiting for the previous thread’s load or store
Multiscalar architecture • A possible microarchitecture
Multiscalar execution • The sequencer walks over the CFG • According the hints inserted in the code, it assigns tasks to PEs • PEs execute the tasks in parallel • Maintaining sequential semantics • Register dependencies • Memory dependencies • Tasks are assigned in the ring order and are committed in the ring order
Register Dependencies • Register dependencies can be easily identified using compiler • Dependencies are always synchronized • Registers that a task may write are maintained in a create mask • Reservations are created in the successor tasks using the accum mask • If the reservation exist (value not arrived), the instruction reading the register waits
Memory dependencies • Cannot be statically found • Multiscalar uses an aggressive approach – speculate always • The loads don’t wait for stores in the predecessor tasks • Hardware checks for violation and the task is re-executed if it violates any memory dependency
Task commit • Speculative tasks are not allowed to modify memory • Store values are buffered in hardware • When the processing element becomes head – it retires its values into memory • In order to maintain sequential semantics the tasks retire in order – ring arrangement of processing elements
Compiler support • Structure of CFG • Sequencer needs information of tasks • Compiler or a assembly code analyzer marks the structure of the CFG – task boundaries • Sequencer walks through this information
Compiler support .. contd • Communication information • Gives the create mask as part of task header • Sets the forward and stop bits • Register value is forwarded if forward bit is set • Task is done when it sees a stop bit • Also needs to give release information
Hardware support • Need to buffer speculative values • Need to detect memory dependence violations • If a speculative thread loads a value its address is recorded in ARB • If a thread stores into some location, then ARB is checked to see if there was a load from the same location by a later thread • Also the speculative values are buffered
Cycle distribution • Best scenario – all processing element does useful work always – never happens • Possible wastage • Non-useful computation • If the task is squashed later due to incorrect value or incorrect prediction • No computation • Waits for some dependency to be resolved • Waits to commit its result • Remains idle • No task assigned
Non-useful computation • Synchronization of memory values • Squashes usually occur on global or static data values • Easy to predict this dependency • Explicitly synchronizations can be inserted to eliminate squashes due these dependencies • Early validation of prediction • For example loop exit testing can be done at the beginning of the iteration
No computation • Intra-task dependences • These can be eliminated through a variety of hardware and software techniques • Inter-task dependences • Possible scope for scheduling to reduce the wait time • Load balancing • Tasks retire in-order • Some tasks finish fast and wait for a long time to become the head task
Differences with other paradigms • Major improvement over superscalar • VLIW – limited because of the limits of static optimizations • Multiprocessor • Very much similar • Communication costs is very less • Leads to fine grained thread parallelism
Methodology • Simulator which uses MIPS code • 5 stage pipeline • Sequencer has a 1024 entry direct mapped cache of task descriptors
Results • Compress – long critical path • Eqntott and cmppt – has parallel loops with good coverage • Espresso – one loop has load balancing issue • Sc – also has load imbalance • Tomcatv – good parallel loops • Cmp and wc – intra task dependences
Conclusion • Multiscalar paradigm has very good potential • Tackles the major limits of superscalar • Lots of scope for compiler and hardware optimizations • Paper gives a good introduction to the paradigm and also discusses the major optimization opportunities