The Paradigm Shift to Multi-Cores: Opportunities and Challenges

In memory of Stamatis The Paradigm Shift to Multi-Cores: Opportunities and Challenges Per Stenstrom Department of Computer Science & Engineering Chalmers University of Technology Sweden

30% annual performance growth 60% annual performance growth An Unwanted Paradigm Shift • Clock frequency couldn’t be pushed higher • Traditional parallelism exploitation didn’t pay off

The Easy Way Out: Replicate • Moore’s Law: 2X cores every 18 months • Implication: About a hundred cores in five years • BUT: Software can only make use of one!

Main Challenges • Programmability • Scalability We want to seamlessly scale up application performance within power envelope

Application SW (existing and new) P P System software infrastructure M P P P P Multi-core Vision: Multiple Cores = One Processor Requires a concerted action across layers: programming model, compiler, architecture

On-chip cache management Support for enhancing programmability P P How can Architects Help? • What is the best use of the many transistors? Cache hierarchy P

”Inherent” Speculative Parallelism[Islam et al. ICPP 2007] Representative of what is possible today Scaling beyond eight cores will need manual efforts

Goal: balance load & reduce communication Three Hard Steps • Decomposition • Assignment • Orchestration Goal: expose concurrency but beware of thread mngmt overhead Goal: Orchestrate threads to reduce communication and synchronization costs

LD A LD A ST A ST A Transactional Memory Transactional memory provides a safety net for data races: hence, simplify coordination T1 T2 • Research is warranted into high-productivity programming interfaces • Transactional memory is a good starting point SQUASH RE-EXECUTE

Transistors can Help Programmers Recall the ”hard steps”: • Decomposition • Assignment • Orchestration Opportunities abound Low-overhead spawning mechanisms Load balancing supported in HW Communication balancing supported in HW

Memory Memory Memory P-M speed gap How to bridge it? Cache hierarchy Cache hierarchy Cache hierarchy P P P P P P P P P P P P Processor/Memory Gap

P1 P2 P2 P2 P1 P1 L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 L2 Adaptive Hybrid Shared Private +++ +++ +++ Conflicts Speed Utilization --- --- +++ +++ +++ --- Adaptive Shared Caches[Dybdahl & Stenstrom HPCA 2007]

Memory Memory Memory Off-chip bandwidth bottleneck Cache hierarchy Cache hierarchy Cache hierarchy P P P P P P P P P P P P ... ... ... Scaling-Up Off-chip Bandwidth BW does not scale with Moore’s law unless optics or other disruptive technologies change the rules

Memory/Cache Link Compression[Thuresson & Stenstrom, IEEE TC to appear] Our combined scheme yields 3X in bandwidth reduction

Summary • Multi-cores promise scalable performance under a manageable power envelope, but are hard to program • To provide scalable application performance for the future requires research at all levels • Architecture (processor, cache, interconnect) • Compiler • Programming model These topics are dealt with in the FET SARC IP and in the HiPEAC network of excellence

The Paradigm Shift to Multi-Cores: Opportunities and Challenges