1 / 20

On-chip Parallelism

On-chip Parallelism. Alvin R. Lebeck CPS 221 Week 13, Lecture 2. Administrivia. Today simultaneous multithreading, MP on a chip project presentations (10-15 minutes) midterm II, Wed April 29, in class project write-up due Friday May 1 Noon approximately 8 pages.

zeki
Download Presentation

On-chip Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2

  2. Administrivia • Today simultaneous multithreading, MP on a chip • project presentations (10-15 minutes) • midterm II, Wed April 29, in class • project write-up due Friday May 1 Noon • approximately 8 pages CPS 221

  3. Review: Software Coherence Protocols Requires • Access Control • Messaging System • small control messages • large bulk transfer • Programmable Processor • Support for Protocol operations Questions • Kernel-based vs. User-Level? • Integration of processor with other requirements? CPS 221

  4. P P P P $ $ $ $ P $ Review: Typhoon • Fully Integrated (processor, access control, NI) Mem RTLB NI CPS 221

  5. Software Fine-Grain Access Control • Low cost, can run on network of workstations • Flexibility of Software protocol processing • Like SW Dirty Bits, but more general • Foreach load/store, check access bits • if access fault invoke fault handler • Lookup Options • table lookup (Blizzard-S) • magic cookie (Shasta, Blizzard-COW) • Instrumentation Options • compiler • executabe editing CPS 221

  6. Blizzard-S • Supports Tempest Interface • Executable Editing (EEL) • Fast Table Lookup • mask, shift, add CPS 221

  7. Shasta • Executable Editing (variant of ATOM) • Magic Cookie ld r1, r2[300] if r1 == magic_cookie do_out_of_line_check(x); add r3, r1, r4 • Incorporates several optimizations • code scheduling • batching checks (refs to same cache lines) • 3% overhead on uniprocessor code • Multiple coherence granularity • Supports Release Consistency CPS 221

  8. Future Directions • Simultaneous Multithreading • Single-Chip MP • MultiScalar Processors (Wednesday) CPS 221

  9. Multithreaded Processors • Exploit thread-level parallelism to improve performance • Multiple Program Counters • Thread • independent programs (multiprogramming) • threads from same program CPS 221

  10. Deneclor HEP • General purpose scientific computer • Organized as MP • up to 16 processors • each processor multithreaded • up to 128 memory modules • up to 4 I/O cache modules • Three-input switches and chaotic routing CPS 221

  11. HEP Processor Organization • Multiple contexts (threads) • each has own Program Status Word (PSW) • PSWs circulate in control loop • control and data loops pipelined 8 deep • PSW in control can circulate no faster than data in data loop • PSW at queue head fetches and starts execution of next instruction • Clock period: 100ns • 8 PSWs in control loop => 10MIPS • Each thread gets 1/8 the processor • Maximum performance per thread => 1.25 MIPS (And they tried to sell as supercomputer) CPS 221

  12. Horizontal Waste Verticle Waste Simultaneous Multithreading • Goal: use hardware resources more efficiently • especially for superscalar processors • Assume 4-issue superscalar Thread Instruction CPS 221

  13. Operation of Simultaneous Multithreading • Standard multithreading can reduce verticle waste • Issue from multiple threads in same cock cycle • Eliminate both horizontal and verticle waste Thread Instructions Thread Instructions Standard Multithreading Simultaneous Multithreading CPS 221

  14. Limitations of SuperScalar Architectures Instruction Fetch • branch prediction • alignment of packet of instructions Dynamic Instruction Issue • Need to identify ready instructions • Rename Table • No compares • Large number of ports (Operands x Width) • Reorder Buffer • n x Q x O x W 1 bit comparators (src and dest) • Quadratic increase in queue size with issue width • PA-8000 20% of die area to issue queue (56 instruction window) CPS 221

  15. SuperScalar Limitations (Continued) Instruction Execute • Register File • more rename registers • more access ports • complexity quadratic with issue width • Bypass logic • complexity quadratic with issue width • wire delays • Functional Units • replicate • add ports to data cache (complexity adds to access time) CPS 221

  16. Why Single Chip MP? • Technology Push • Benefits of wide issue are limited • Decentralized microarchitecture: easier to build several simple fast processors than one complex processor • Application Pull • Applications exhibit parallelism at different grains • < 10 instructions per cycle (Integer codes) • > 40 instructions per cycle (FP loops) CPS 221

  17. I-Cache (32 KB) External Interface Instruction Fetch TLB Instruction Decode & Rename D-Cache (32 KB) L2 Cache (256 KB) 21 mm Clocking & Pads Reorder Buffer, Instruction Queues, and Out-of-Order Logic Integer Unit Floating Point Unit A 6-Way SuperScalar Processor 21 mm CPS 221

  18. A 4 x 2 Single Chip Multiprocessor 21 mm Icache 1 Icache 2 External Interface Processor #1 Processor #2 L2 Cache (256 KB) Dcache 1 Dcache 2 21 mm Clocking & Pads Dcache 3 Dcache 4 L2 Communication Crossbar Processor #3 Processor #4 Icache 3 Icache 4 CPS 221

  19. Performance Comparison CPS 221

  20. Summary of Performance • 4 x 2 MP works well for coarse grain apps • How well would Message Passing Architecture do? • Can SUIF handle pointer intensive codes? • For “tough” codes 6-way does slightly better, but neither is > 60% better than 2-issue CPS 221

More Related