Current Trends in CMP/CMT Processors

Current Trends in CMP/CMT Processors Wei Hsu 7/26/2006

Trends in Emerging Systems • Industry and community could previously rely on increased frequency and micro-architecture innovations to steadily improve performance of computers each year • superscalar, • out-of-order issue, • on-chip caching, • deep pipelines supported by sophisticated branch predictors.

Trends in Emerging Systems (cont.) • Processor designers have found it increasingly difficult to: • manage power dissipation • chip temperature • current swings • design complexity • decreasing transistor reliability in designs • Physics problems- not necessarily innovation problems

Moore’s Law

Performance Increase of Workstations Less than 1.5x every 18 months

The Power Challenge

As long as there are sufficient TLP

MultiCore becomes mainstream • Commercial examples:

Why MultiCore becomes mainstream • TLP vs. ILP Physical limitations has caused serious heat dissipation problems.Memory latency continues to limit single thread performance.Now designers try to push TLP (Thread Level Parallelism) rather than ILP or higher clock frequency. e.g. Sun UltraSparc T1 trades single thread performance for higher throughput to keep its server market. Server workloads are broadly characterized by high TLP, low ILP, and large working set.

Why MultiCore becomes mainstream • CMP with shared cache can reduce expensive coherence miss penalty • As L2/L3 caches become larger, coherence misses start to dominate the performance for server workloads. • SMP has been successfully used for years, so software is relatively mature for Multicore chips. • New applications tend to have high TLP e.g. media apps, server apps, games, network processing, … etc • One alternative to Multicore is SOC

Moore’s Law will continue to provide transistors Transistors will be used for more cores, caches, and new features. More cores for increasing TLP, Caches to address memory latency.

CMT (Chip Multi-Threading) • CMT processors support many simultaneous hardware threads of execution. • SMT (Simultaneous Multi-Threading) • CMP (i.e. multi-core) • CMT is about on-chip resource sharing • SMT: threads share most resources • CMP: threads share pins and bus to memory, may also share L2/L3 caches.

Single Instruction Issue Processors Time Reduced FU utilization due to memory latency or data dependency or branch misprediction

Superscalar Processors Time Superscalar leads to higher performance, but lower FU utilization.

SMT (Simultaneous Multi-Threading) Processors Time Maximize FU utilization by issuing operations from two or more threads. Example: Pentium IV Hyper-threading

Vertical Multi-Threading Time Stall cycles DCache miss occurs

Vertical Multi-Threading Time Switch to the 2nd thread on a long latency event (e.g. L2 cache miss) Example: Montecito uses event driven MT DCache miss occurs

Horizontal MT Time Thread switch occurs on every cycle. Example: Sun Niagara (T1) with 4 threads per core

MT in Niagara T1 Time Thread switch occurs on every cycle. The processor issues a single operation per cycle.

MT in Niagara T2 Time Thread switch occurs on every cycle. The processor issues two operations per cycle.

CMT Evolution • Stanford Hydra CMP project starts putting 4 MIPS processors on one chip in 1996 • DEC/Compaq Piranha project proposed to include 8 Alpha cores and a L2 cache on a single chip in 2000. • SUN’s MAJC chip was a dual-core processors with shared L1 cache, released in 1999 • IBM Power4 is dual-core (2001), and Power5 dual-core, each core 2-way SMT. • SUN’s Gemini and Jaguar were dual core processors (in 2003), Panther (in 2005) with shared on-chip L2 cache, Niagara (T1 in 2006) is a 32-way CMT, with 8 cores, 4 threads per core. • Intel Montecito (Itanium2 follow-up) will have two cores, and two threads per core.

CMT Design Trends Jaguar 2003 Panther2005 Niagara (T1) 2006

Multi-Core Software Support • Multi-Core demands Threaded Software • Importance of threading • Do nothing • OS is ready, background jobs can also benefit • Parallelize • Unlock the potential (apps, libraries, compiler generated threads) • Key Challenges • Scalability • Correctness • Ease of programming

Multi-Core Software Challenges • Scalability OpenMP (for SMP/CMP node), MPI (for clusters), or mixed • Correctness Various thread checker, thread profiler, performance analyzer, memory checker tools to simplify the creation and debugging of scalable thread safe code • Ease of programming New programming models (e.g. C++ template-based runtime library to simplify app writing with pre-built and tested algorithms and data structures) Transactional memory concept

CMT Optimization Challenges • Traditional optimization assumes all the resources in a processor can be used • Prefetch may take away buy bandwidth from the other core (and latency may be hidden anyway) • Code duplication/specialization may take away shared cache space. • Speculative execution may take away resource from a second thread. • Parallelization may reduce total throughput • Resource information is often determined at runtime. New policies and mechanisms are needed to maximize total performance.

CMT Optimization Challenges • I-cache optimization issues In single thread execution, I-cache misses often come from conflicts between procedures. In multi-threaded execution, the conflicts may come from different threads. • Thread scheduling issues Should two threads be scheduled on two separate cores or on the same core with SMT? Schedule for performance or schedule for power? (balanced vs unbalanced scheduling)

Some emerging Issues • New low power, high performance cores Current cores reuse the same design from previous generation. This cannot last long since supply power scaling is not sufficient to meet the requirement. New designs are called for to get low power and high performance cores. • Off-chip Bandwidth How to keep up the needs for of off-chip bandwidth (double every generation)? Cannot rely on the increase of pins (increased at 10% per generation). Must increase the bandwidth per pin.

Some emerging Issues (cont.) • Homogeneous or heterogeneous cores for workloads with sufficient TLP, multiple simple cores can deliver superior performance. However, how to deliver robust performance for single thread jobs? A complex core + many simple cores? • Shared hardware accelerators • network offload engines • cryptographic engines • XML parsing or processing? • FFT accelerator

New Research Opportunitieswith CMP/CMT • Speculative Threads • With thread-level control speculation and runtime data dependence check to speed up single program execution. • Recent studies have shown ~20% of speed up potential at loop level thread speculation on sequential code. • Helper Threads • Using otherwise idle cores to run dynamic optimization threads, performance monitoring (or profiling) threads, or scout threads.

New Research Opportunitieswith CMP/CMT • Monitoring Threads • The monitoring threads can run on other cors to enforce the correct execution of the main thread. • The main thread turns itself into a speculative thread until the monitoring thread verify the execution meets the requirements. If verification failed, the speculative execution aborts.

New Research Opportunities(Transient Fault Detection/Tolerance) • Software Redundant Multi-Threading Using software controlled redundancy to detect and tolerate transient faults. Optimizations are critical to minimize communication and synchronizations. Redundant threads run onmulti cores – this is different from SMT where one error may corrupt both threads. • Process Level Redundancy Only check on system calls to intercept faults that propagate to the output.

New Research Opportunities • For software debugging Running a different path on the other core to increase path coverage.

Future of CMP/CMT • Some companies already have 128/256 cores CMP on their roadmap. Not sure what will happen, future is hard to predict. High-end servers may be addressed by large scale CMP, but desktop and embedded market may be not (perhaps small scale or medium scale would be sufficient). • Today’s architectures are more likely be driven by software market than by hardware vendors. Itanium is one example. Even with Intel+HP, it has not been very successful. A successful product sells by itself.

Current Trends in CMP/CMT Processors