A few issues on the design of future multicores

A few issues on the design of future multicores André Seznec IRISA/INRIA

Single Chip Uniprocessor: the end of the road • (Very) wide issue superscalar processors are not cost effective: • More than quadratic complexity on many key components: • Register file • Bypass network • Issue logic • Limited performance return Failure of EV8 = end of very wide issue superscalar processors

Hardware thread parallelism • High-end single chip component: • Chip multiprocessors: • IBM Power 5, dual-core Intel Pentium 4, dual-core Athlon-64 • Many CMP SoCs for embedded markets • Cell • (Simultaneous) Multithreading: • Pentium 4, Power 5, • Multithreading

Thread parallelism • Expressed by the application developer: • Depends on the application itself • Depends on the programming language or paradigm • Depends on the programmer • Discovered by the compiler: • Automatic (static) parallelization • Exploited by the runtime: • Task scheduling • Dynamically discovered/exploited by hardware or software: • Speculative hardware/software threading

Direction of (single chip) architecture:betting on parallelism success • (Future) applications are intrinsically parallel: • As much as possible simple cores • (Future) applications are moderately parallel • A few complex state-of-the-art superscalar cores SSC: Sea of Simple Cores FCC: Few Complex Cores

SSC: Sea of Simple Cores

FCC: Few Complex Cores 4-way O-O-O superscalar 4-way O-O-O superscalar 4-way O-O-O superscalar •••• Shared L3 cache

Common architectural design issues

Instruction Set Architecture • Single ISAs ? • Extension of “conventional” multiprocessors • Shared or distributed memory ? • Hetorogeneous ISAs: • A la CELL ?: (master processor + slave processors) x N • A la SoC ? : specialized coprocessors • Radically new architecture ? • Which one ?

Hardware accelerators ? • SIMD extensions: • Seems to be accepted, report the burden to applications developers and compilers • Reconfigurable datapaths: • Popular when you get a well defined intrinsically parallel application • Vector extensions: • Might be the right move when targeting essentially scientific computing

On-chip memory/processors/memory bandwidth • The uniprocessor credo was: “Use the remaining silicon for caches” • New issue: • An extra processor or more cache Extra processing power = • increased memory bandwidth demand • Increased power consumption, more temperature hot spots Extra cache = decreased (external) memory demand

Memory hierarchy organization ?

μP μP μP μP μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ $ $ $ $ Flat: sharing a big L2/L3 cache? L3 cache

μP μP μP μP μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ $ $ $ $ Flat: communication issues?through the big cache L3 cache

L3 cache μP μP μP μP μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ $ $ $ $ Flat: communication issues?Grid-like ?

μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ L3 $ L2 $ L2 $ L2 $ L2 $ Hierarchical organization ?

Hierarchical organization ? • Arbitration at all levels • Coherency at all levels • Interleaving at all levels • Bandwidth dimensioning

NoC structure • Very dependent of the memory hierarchy organization !! • + sharing coprocessors/hardware accelerators • + I/O buses/(processors ?) • + memory interface • + network interface

μP μP μP μP μP μP $ $ $ $ $ $ L2 $ L2 $ L2 $ Example L3 $ Memory Int. IO

Multithreading ? • An extra level thread parallelism !! • Might be an interesting alternative to prefetching on massively parallel applications

Power and thermal issues • Voltage/frequency scaling to adapt to the workload ? • Adapting the workload to the available power ? • Adapting/dimensioning the architecture to the power budget • Activity migration for managing temperatures ?

General issues for software/compiler • Parallelism detection and partitioning: • find the correct granularity • Memory bandwidth mastering • Non-uniform memory latency • Optimizing sequential code portions

SSC design specificities

Basic core granularity • RISC cores • VLIW cores • In-order superscalar cores

Homogeneous vs. heterogeneous ISAs • Core specialization: • RISC + VLIW or DSP slaves ? • Master core + a set of special purpose cores ?

Sharing issue • Simple cores: • Lot of duplications and lots of unused resources at any time • Adjacent cores can share: • Caches • Functional units: FP, mult/div , multimedia, • Hardware accelerators

IL1 $ IL1 $ Inst. fetch Inst. fetch Hardware accelerator μP μP FP FP μP μP DL1 $ DL1 $ L2 cache An example of sharing

Multithreading/prefetching • Multithreading: • Is the extra complexity worth for simple cores ? • Prefetching: • Is it worth ? • Sharing prefetch engines ?

Vision of a SSC (my own vision )

I $ I $ I $ I $ μP μP μP μP FP FP FP FP μP μP μP μP L2 cache D $ D $ D $ D $ SSC: the basic brick

Memory interface I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ network interface μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP L2 cache L2 cache L2 cache L2 cache D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ System interface L3 cache

FCC design specificities

Only limited available thread parallelism ? • Focus on uniprocessor architecture: • Find the correct tradeoff between complexity and performance • Power and temperature issues • Vector extensions ? • Contiguous vectors ( a la SSE) ? • Strided vectors in L2 caches ( Tarantula-like)

Performance enablers • SMT for parallel workloads ? • Helper threads ? • Run ahead threads • Speculative multithreading hardware support

Intermediate design ? • SCCs: • Shine on massively parallel applications • Poor/ limited performance on sequential sections • FCCs: • Moderate performance on parallel applications • Good performance on sequential sections

Mix of FCC and SSC Amdahl’s law 

Ultimate Out-of-order Superscalar I $ I $ μP μP FP FP μP μP L2 cache D $ D $ The basic brick

Ult. O-O-O Ult. O-O-O Ult. O-O-O Ult. O-O-O Memory interface I $ I $ I $ I $ I $ I $ I $ I $ L3 cache network interface L2 $ L2 $ L2 $ L2 $ D $ D $ D $ D $ D $ D $ D $ D $ System interface

Conclusion • The era of uniprocessor has come to the end • No clear trend to continue • Might be time for more architecture diversity

A few issues on the design of future multicores

A few issues on the design of future multicores

Presentation Transcript

The First • The Few • The Future

Chapter 24, Future Design Issues

A Few Design Ideas

A Study of Garbage Collector Scalability on Multicores

Programming and Timing Analysis of Parallel Programs on Multicores

A few slides on

A few low-tech thoughts on connectivity and the future of policing Fraser Sampson

StreamIt-A Programming Language for the Era of Multicores

On Writing: A few reminders

A few updates on the stork front…..

Communication Overhead Estimation on Multicores

The physics of Flash and A few issues/tricks of the trade

StreamIt – A Programming Language for the Era of Multicores

A Few Issues in MHD Turbulence

Periodicities in variable stars: a few issues

A few thoughts on The Road Movie

A few thoughts on the way forward

A few remarks on financing of culture

Chapter 24, Future Design Issues

Few Popular Aspects Of A Website Design

A few words on the subject of race . . .

A few facts on… PENGUINS