390 likes | 492 Views
A few issues on the design of future multicores. André Seznec IRISA/INRIA. Single Chip Uniprocessor: the end of the road. (Very) wide issue superscalar processors are not cost effective: More than quadratic complexity on many key components: Register file Bypass network Issue logic
E N D
A few issues on the design of future multicores André Seznec IRISA/INRIA
Single Chip Uniprocessor: the end of the road • (Very) wide issue superscalar processors are not cost effective: • More than quadratic complexity on many key components: • Register file • Bypass network • Issue logic • Limited performance return Failure of EV8 = end of very wide issue superscalar processors
Hardware thread parallelism • High-end single chip component: • Chip multiprocessors: • IBM Power 5, dual-core Intel Pentium 4, dual-core Athlon-64 • Many CMP SoCs for embedded markets • Cell • (Simultaneous) Multithreading: • Pentium 4, Power 5, • Multithreading
Thread parallelism • Expressed by the application developer: • Depends on the application itself • Depends on the programming language or paradigm • Depends on the programmer • Discovered by the compiler: • Automatic (static) parallelization • Exploited by the runtime: • Task scheduling • Dynamically discovered/exploited by hardware or software: • Speculative hardware/software threading
Direction of (single chip) architecture:betting on parallelism success • (Future) applications are intrinsically parallel: • As much as possible simple cores • (Future) applications are moderately parallel • A few complex state-of-the-art superscalar cores SSC: Sea of Simple Cores FCC: Few Complex Cores
FCC: Few Complex Cores 4-way O-O-O superscalar 4-way O-O-O superscalar 4-way O-O-O superscalar •••• Shared L3 cache
Instruction Set Architecture • Single ISAs ? • Extension of “conventional” multiprocessors • Shared or distributed memory ? • Hetorogeneous ISAs: • A la CELL ?: (master processor + slave processors) x N • A la SoC ? : specialized coprocessors • Radically new architecture ? • Which one ?
Hardware accelerators ? • SIMD extensions: • Seems to be accepted, report the burden to applications developers and compilers • Reconfigurable datapaths: • Popular when you get a well defined intrinsically parallel application • Vector extensions: • Might be the right move when targeting essentially scientific computing
On-chip memory/processors/memory bandwidth • The uniprocessor credo was: “Use the remaining silicon for caches” • New issue: • An extra processor or more cache Extra processing power = • increased memory bandwidth demand • Increased power consumption, more temperature hot spots Extra cache = decreased (external) memory demand
μP μP μP μP μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ $ $ $ $ Flat: sharing a big L2/L3 cache? L3 cache
μP μP μP μP μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ $ $ $ $ Flat: communication issues?through the big cache L3 cache
L3 cache μP μP μP μP μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ $ $ $ $ Flat: communication issues?Grid-like ?
μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ L3 $ L2 $ L2 $ L2 $ L2 $ Hierarchical organization ?
Hierarchical organization ? • Arbitration at all levels • Coherency at all levels • Interleaving at all levels • Bandwidth dimensioning
NoC structure • Very dependent of the memory hierarchy organization !! • + sharing coprocessors/hardware accelerators • + I/O buses/(processors ?) • + memory interface • + network interface
μP μP μP μP μP μP $ $ $ $ $ $ L2 $ L2 $ L2 $ Example L3 $ Memory Int. IO
Multithreading ? • An extra level thread parallelism !! • Might be an interesting alternative to prefetching on massively parallel applications
Power and thermal issues • Voltage/frequency scaling to adapt to the workload ? • Adapting the workload to the available power ? • Adapting/dimensioning the architecture to the power budget • Activity migration for managing temperatures ?
General issues for software/compiler • Parallelism detection and partitioning: • find the correct granularity • Memory bandwidth mastering • Non-uniform memory latency • Optimizing sequential code portions
Basic core granularity • RISC cores • VLIW cores • In-order superscalar cores
Homogeneous vs. heterogeneous ISAs • Core specialization: • RISC + VLIW or DSP slaves ? • Master core + a set of special purpose cores ?
Sharing issue • Simple cores: • Lot of duplications and lots of unused resources at any time • Adjacent cores can share: • Caches • Functional units: FP, mult/div , multimedia, • Hardware accelerators
IL1 $ IL1 $ Inst. fetch Inst. fetch Hardware accelerator μP μP FP FP μP μP DL1 $ DL1 $ L2 cache An example of sharing
Multithreading/prefetching • Multithreading: • Is the extra complexity worth for simple cores ? • Prefetching: • Is it worth ? • Sharing prefetch engines ?
Vision of a SSC (my own vision )
I $ I $ I $ I $ μP μP μP μP FP FP FP FP μP μP μP μP L2 cache D $ D $ D $ D $ SSC: the basic brick
Memory interface I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ network interface μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP L2 cache L2 cache L2 cache L2 cache D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ System interface L3 cache
Only limited available thread parallelism ? • Focus on uniprocessor architecture: • Find the correct tradeoff between complexity and performance • Power and temperature issues • Vector extensions ? • Contiguous vectors ( a la SSE) ? • Strided vectors in L2 caches ( Tarantula-like)
Performance enablers • SMT for parallel workloads ? • Helper threads ? • Run ahead threads • Speculative multithreading hardware support
Intermediate design ? • SCCs: • Shine on massively parallel applications • Poor/ limited performance on sequential sections • FCCs: • Moderate performance on parallel applications • Good performance on sequential sections
Mix of FCC and SSC Amdahl’s law
Ultimate Out-of-order Superscalar I $ I $ μP μP FP FP μP μP L2 cache D $ D $ The basic brick
Ult. O-O-O Ult. O-O-O Ult. O-O-O Ult. O-O-O Memory interface I $ I $ I $ I $ I $ I $ I $ I $ L3 cache network interface L2 $ L2 $ L2 $ L2 $ D $ D $ D $ D $ D $ D $ D $ D $ System interface
Conclusion • The era of uniprocessor has come to the end • No clear trend to continue • Might be time for more architecture diversity