1 / 18

The Standford Hydra CMP

The Standford Hydra CMP. Lance Hammond Benedict A. Hubbert Michael Siu Manohar K. Prabhu Michael Chen Kunle Olukotun. Presented by Jason Davis. Hydra CMP with 4 MIPS Processors L1 cache for each CPU and L2 cache that holds the permanent states Why? Moore’s law is reaching its end

dawn-henson
Download Presentation

The Standford Hydra CMP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Standford Hydra CMP Lance Hammond Benedict A. Hubbert Michael Siu Manohar K. Prabhu Michael Chen Kunle Olukotun Presented by Jason Davis

  2. Hydra CMP with 4 MIPS Processors L1 cache for each CPU and L2 cache that holds the permanent states Why? Moore’s law is reaching its end Finite amount of ILP TLP (Thread Level Parallelism) vs ILP in pipelined architecture CMP can use ILP as well (TLP and ILP are orthogonal) Wire Delay Design Time (CPU core doesn’t need to be redesigned) just increase the number Problems Integration densities just now giving reasons to consider new models Difficult to convert uniprocessor code Multiprogramming is hard Introduction

  3. Base Design • 4 MIPS Cores (250 MHz) • Each core: • L1 Data Cache • L1 Primary Instruction Cache • Share a single L2 Cache • Virtual Buses (pipelined with repeaters) • Read bus (256 bits) • Acts as general purpose system bus for moving data between CPUs, L2, and external memory • Wide enough to handle entire cache line (CMP explicit gain, multiprocessor systems would require too many pins • Write bus (64 bits) • Writes directly from 4 CPUs to L2 • Pipelined to allow for single-cycle occupancy (not a bottleneck) • Uses simple invalidation for caches (broadcast invalidates all other L1s) • L2 Cache • Point of communication (10-20 cycles) • Bus Sufficient for 4-8 MIPS cores, more need larger system buses

  4. Base Design

  5. Parallel Software Performance

  6. Thread Speculation • Takes sequence of instructions on normal program and arbitrarily breaks it into a sequenced group of threads • Hardware must track all interthread dependencies to insure program acts the same way • Must re-execute code that follows a data violation based upon a true dependency • Advantages: • Does not require synchronization (different than enforcing dependencies on multiprocessor systems) • Dynamic (done at runtime) so programmer only needs to consider for maximum performance • Conventional Parallelizing compilers miss a lot of TLP because synchronization points must be inserted where dependencies can happen and not just where they do happen • 5 Issues to address:

  7. Thread Speculation 1. Forward data between parallel threads 2. Detect when reads occur to early (RAW) 3. Safely Discard speculative state after violations

  8. Thread Speculation 4. Retire speculative writes in correct order (WAW hazard) 5. Provide Memory renaming (WAR hazards)

  9. Hydra Speculation Implementation • Takes care of the 5 issues: • Forward data between parallel threads: • When thread writes to bus, newer threads that need the data have their current cache lines for that data invalidated • On miss in L1, access L2, write buffers of current or older thread replaces data returned from L2 byte-byte • Detect when read occurs too early: • Primary cache bits are set to mark possible violations, if write to that address of an earlier thread invalidates – Violation detected and thread is restarted. • Safely discard speculative states after violation: • Permanent state kept in L2, any L1 lines that are speculative data are invalidated, L2 buffer for thread is discarded (permanent state not effected)

  10. Hydra Speculation Implementation • Place speculative writes in memory in correct order: • Separate speculative data L2 buffers kept for each thread • Must be drained into L2 in original sequence • Thread sequencing system also sequences the buffer draining • Memory Renaming: • Each CPU can only read data written by itself or earlier threads • Writes from later threads don’t cause immediate invalidations (since writes from these threads should not be visible yet) • Ignored invalidations are recorded with pre-invalidate bit • If thread accesses L2 it must only access data it should be able to see from itself or earlier L2 buffers • When current thread completes all currently pre-invalidated lines are check against future threads for violations

  11. Hydra Speculation Implementation

  12. Hydra Speculation Implementation

  13. Speculation Performance

  14. Prototype • MIPS-based RC32364 • SRAM macro cells • 8-Kbyte L1 data and instruction caches • 128 Kbytes L2 • Die is 90 mm^2, .25-micron process • Have a verilog model, moving to physical design using synthesis • Central Arbritration for Buses will be the most difficult part, hard to pipeline, must accept many requests, and must reply with grant signals

  15. Prototype

  16. Prototype

  17. Hydra CMP High performance Cost effective alternative to large chip single processors Similar die area can achieve similar to uniprocessor performance on integer programs using thread speculation Multiprogrammed or High Parallelism can do better then single processor Hardware Thread-Speculation is not cost intensive, and can give great gains to performance Conclusion

  18. Questions

More Related