1 / 19

CS 7960-4 Lecture 20

CS 7960-4 Lecture 20. The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October 1996. CMP vs. Wide-Issue Superscalar. What is the best use of on-chip real estate? wide-issue processor (complex design/clock,

salene
Download Presentation

CS 7960-4 Lecture 20

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7960-4 Lecture 20 The Case for a Single-Chip Multiprocessor K. Olukotun, B.A. Nayfeh, L. Hammond, K. Wilson, K-Y. Chang Proceedings of ASPLOS-VII October 1996

  2. CMP vs. Wide-Issue Superscalar • What is the best use of on-chip real estate? • wide-issue processor (complex design/clock, diminishing ILP returns) • CMP (simple design, high TLP, lower ILP) • Contributions: • Takes area and latencies into account • Attempts fine-grain parallelization

  3. Scalability of Superscalars • Properties of large-window processors: • Requires good branch prediction and fetch • High rename complexity • High issue queue complexity (grows with issue • width and window size) • High bypassing complexity • High port requirements in the register file and cache •  Necessitates partitioned architectures

  4. Application Requirements • Low-ILP programs (SPEC-Int) benefit little from • wide-issue superscalar machines (1-wide R5000 • is within 30% of 4-wide R10000) • High-ILP programs (SPEC-FP) benefit from large • windows – typically, loop-level parallelism that • might be easy to extract

  5. The CMP Argument • Build many small CPU cores • The small cores are enough to optimize low-ILP • programs (high thruput with multiprogramming) • For high-ILP programs, the compiler parallelizes • the application into multiple threads – since the • cores are on a single die, cost of communication • is affordable • Low communication cost  even integer programs • with moderate ILP could be parallelized

  6. The CMP Approach • Wide-issue superscalar  the brute force method • that extracts parallelism by blindly increasing • in-flight window size and using more hardware • CMP  extract parallelism by static analysis; • minimum hardware complexity and maximum • compiler smarts • CMP can exploit far-flung ILP, has low hw cost • Far-flung ILP and SPEC-Int threads are hard to • automatically extract  memory disam, control flow

  7. Area Extrapolations

  8. Processor Parameters

  9. Applications

  10. 2-Wide  6-Wide • No change in branch prediction accuracy  area penalty for 6-wide? • More speculation  more cache misses • IPC improvements of at least 30% for all programs

  11. CMP Statistics

  12. Results

  13. Clustered SMT vs. CMP CMP Single-Thread Performance Fetch Fetch Fetch Fetch Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 DL1 Interconnect for cache coherence traffic

  14. Clustered SMT vs. CMP CMP Multi-Program Performance Fetch Fetch Fetch Fetch Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 Interconnect for cache coherence traffic

  15. Clustered SMT vs. CMP CMP Multi-thread Performance Fetch Fetch Fetch Fetch Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 Interconnect for cache coherence traffic

  16. Clustered SMT vs. CMP CMP Multi-thread Performance Fetch Fetch Fetch Fetch Proc Proc Proc Proc Clustered SMT Fetch Fetch Fetch Fetch DL1 DL1 DL1 DL1 Cluster Cluster Cluster Cluster Interconnect for register traffic DL1 DL1 DL1 DL1 Interconnect for cache coherence traffic

  17. Conclusions • CMP reduces hardware/power overhead • Clustered SMT can yield better single-thread • and multi-programmed performance (at high cost) • CMP can improve application performance if the • compiler can extract thread-level parallelism • What is the most effective use of on-chip real • estate? • Depends on the workload • Depends on compiler technology

  18. Next Class’ Paper • “The Potential for Using Thread-Level Data • Speculation to Facilitate Automatic Parallelization”, • J.G. Steffan and T.C. Mowry, Proceedings of • HPCA-4, February 1998

  19. Title • Bullet

More Related