1 / 29

Multiscalar processors

This paper discusses the motivation behind multiscalar processors, their architecture, software and hardware support, and the distribution of cycles. It also presents results and a conclusion on how multiscalar processors can overcome the limitations of superscalar architecture.

cheryllj
Download Presentation

Multiscalar processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiscalar processors Gurindar S. Sohi Scott E. Breach T.N. Vijaykumar University of Wisconsin-Madison

  2. Outline • Motivation • Multiscalar paradigm • Multiscalar architecture • Software and hardware support • Distribution of cycles • Results • Conclusion

  3. Motivation • Current architecture techniques reaching their limits • Amount of ILP that can be extracted by superscalar processor is limited • Kunle Olukotun (stanford university)

  4. Limits of ILP • Parallelism that can be extracted from a single program is very limited – 4 or 5 in integer programs • Limits of instruction-level parallelism- David W. Wall (1990)

  5. Limitations of superscalar • Branch prediction accuracy limits ILP • Every 5 instruction is a branch • Executing an instruction across 5 branches leads to useful result only 60% of the time (with branch prediction accuracy 90%) • There are branches which are difficult to predict – increasing the window size doesn’t always means executing useful instructions

  6. Limitations of superscalar.. contd • Large window size • Issuing more instructions per cycle needs large window of instructions • Each cycle search the whole window to find instructions to issue • Increases the pipeline length • Issue complexity • To issue an instruction dependence checks have to be performed with other issuing instructions • To issue n instructions complexity of issue is n2

  7. Limitations of superscalar.. contd • Load and store queue limitations • Loads and stores cannot be reordered before knowing their addresses • One load or store waiting for its address can block the entire processor

  8. Consider the following hypothetical loop: Iter 1: inst 1 inst 2 … inst n Iter 2: inst 1 inst 2 … If window size is less than n, superscalar considers only one iteration at a time Possible improvement Iter 1: iter 2: inst 1 inst 1 inst 2 inst 2 … … … inst n inst n Superscalar limitation example

  9. Multiscalar paradigm • Divide the program (CFG) into multiple tasks (not necessarily parallel) • Execute the tasks in different processing elements, residing in the same die – communication cost is less • Sequential semantics is preserved by hardware and software mechanisms • Tasks are typically re-executed if there is any violations

  10. Crossing the limits of superscalar • Branch prediction • Each thread executes independently • Each thread is limited by branch prediction – but number of useful instructions available is much larger than superscalar • Window size • Each processing element has its own window • Total size of the windows in a die can be very large, while each window can be of moderate size

  11. Crossing the limits of superscalar.. contd • Issue Complexity • Each processing element issue only a few instructions – simplifies logic • Loads and Stores • Loads and stores can executed without waiting for the previous thread’s load or store

  12. Multiscalar architecture • A possible microarchitecture

  13. Multiscalar execution • The sequencer walks over the CFG • According the hints inserted in the code, it assigns tasks to PEs • PEs execute the tasks in parallel • Maintaining sequential semantics • Register dependencies • Memory dependencies • Tasks are assigned in the ring order and are committed in the ring order

  14. Register Dependencies • Register dependencies can be easily identified using compiler • Dependencies are always synchronized • Registers that a task may write are maintained in a create mask • Reservations are created in the successor tasks using the accum mask • If the reservation exist (value not arrived), the instruction reading the register waits

  15. Memory dependencies • Cannot be statically found • Multiscalar uses an aggressive approach – speculate always • The loads don’t wait for stores in the predecessor tasks • Hardware checks for violation and the task is re-executed if it violates any memory dependency

  16. Task commit • Speculative tasks are not allowed to modify memory • Store values are buffered in hardware • When the processing element becomes head – it retires its values into memory • In order to maintain sequential semantics the tasks retire in order – ring arrangement of processing elements

  17. Compiler support • Structure of CFG • Sequencer needs information of tasks • Compiler or a assembly code analyzer marks the structure of the CFG – task boundaries • Sequencer walks through this information

  18. Compiler support .. contd • Communication information • Gives the create mask as part of task header • Sets the forward and stop bits • Register value is forwarded if forward bit is set • Task is done when it sees a stop bit • Also needs to give release information

  19. Hardware support • Need to buffer speculative values • Need to detect memory dependence violations • If a speculative thread loads a value its address is recorded in ARB • If a thread stores into some location, then ARB is checked to see if there was a load from the same location by a later thread • Also the speculative values are buffered

  20. Cycle distribution • Best scenario – all processing element does useful work always – never happens • Possible wastage • Non-useful computation • If the task is squashed later due to incorrect value or incorrect prediction • No computation • Waits for some dependency to be resolved • Waits to commit its result • Remains idle • No task assigned

  21. Non-useful computation • Synchronization of memory values • Squashes usually occur on global or static data values • Easy to predict this dependency • Explicitly synchronizations can be inserted to eliminate squashes due these dependencies • Early validation of prediction • For example loop exit testing can be done at the beginning of the iteration

  22. No computation • Intra-task dependences • These can be eliminated through a variety of hardware and software techniques • Inter-task dependences • Possible scope for scheduling to reduce the wait time • Load balancing • Tasks retire in-order • Some tasks finish fast and wait for a long time to become the head task

  23. Differences with other paradigms • Major improvement over superscalar • VLIW – limited because of the limits of static optimizations • Multiprocessor • Very much similar • Communication costs is very less • Leads to fine grained thread parallelism

  24. Methodology • Simulator which uses MIPS code • 5 stage pipeline • Sequencer has a 1024 entry direct mapped cache of task descriptors

  25. Results

  26. Results • Compress – long critical path • Eqntott and cmppt – has parallel loops with good coverage • Espresso – one loop has load balancing issue • Sc – also has load imbalance • Tomcatv – good parallel loops • Cmp and wc – intra task dependences

  27. Conclusion • Multiscalar paradigm has very good potential • Tackles the major limits of superscalar • Lots of scope for compiler and hardware optimizations • Paper gives a good introduction to the paradigm and also discusses the major optimization opportunities

  28. Discussion

  29. BREAK!

More Related