270 likes | 478 Views
Heterogeneous Multi-Core Processors. Jeremy Sugerman GCafe May 3, 2007. Context. Exploring the CPU and GPU future relationship Joint work, thinking with Kayvon Much kibbitzing from Pat, Mike, Tim, Daniel Vision and opinion, not experiments and results More of a talk than a paper
E N D
Heterogeneous Multi-Core Processors Jeremy Sugerman GCafe May 3, 2007
Context • Exploring the CPU and GPU future relationship • Joint work, thinking with Kayvon • Much kibbitzing from Pat, Mike, Tim, Daniel • Vision and opinion, not experiments and results • More of a talk than a paper • The value is more conceptual than algorithmic • Wider gcafe audience appeal than our near term elbows-deep plans to dive into GPU guts
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
Introduction • Multi-core is status quo for forthcoming CPUs • Variety of emerging (for “general purpose”) architectures try to offer discontinuous performance boost over traditional CPUs • GPU, Cell SPEs, Niagara, Larrabee, … • CPU vendors have a history of co-opting special purpose units for targeted performance wins: • FPU, SSE/Altivec, VT/SVM • CPUs should co-opt entire “compute” cores!
Introduction • Industry is already exploring hybrid models • Cell: 1 PowerPC and 8 SPEs • AMD Fusion: Slideware CPU + GPU • Intel Larrabee: Weirder, NDA encumbered • The programming model for communicating deserves to be architecturally defined. • Tighter integration than the current “host + accelerator” model eases porting and efficiency. • Work queues / buffers allow intregrated coordination with decoupled execution.
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
CPU “Special Features” • CPUs are built for general purpose flexibility… • … but have always stolen fixed function units in the name of performance. • Old CPUs had schedulers, malloc burned in! • CISC instructions really were faster • Hardware managed TLBs and caches • Arguably, all virtual memory support
CPU “Special Features” • More relevantly, dedicated hardware has been adopted for domain-specific workloads. • … when the domain was sufficiently large / lucrative / influential • … and the increase in performance over software implementation / emulation was BIG • … and the cost in “design budget” (transistors, power, area, etc.) was acceptable. • Examples: FPUs, SIMD and Non-Temporal accesses, CPU virtualization
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
Compute-Maximizing Processors • “Important” common apps are FLOP hungry • Video processing, Rendering • Physics / Game “Physics” • Even OS compositing managers! • HPC apps are FLOP hungry too • Computational Bio, Finance, Simulations, … • All can soak vastly more compute than current CPUs can deliver. • All can utilize thread or data parallelism. • Increased interest in custom / non-”general” processors
Compute-Maximizing Processors • Or “throughput oriented” • Packed with ALUs / FPUs • Application specified parallelism replaces the focus on single-thread ILP • Available in many flavours: • SIMD • Highly threaded cores • High numbers of tiny cores • Stream processors • Real life examples generally mix and match
Compute-Maximizing Processors • Offer an order of magnitude potential performance boost… if the workload sustains high processor utilization • Mapping / porting algorithms is a labour intensive and complex effort. • This is intrinsic. Within any design budget, a BIG performance win comes at a cost… • If it didn’t, the CPU designers would steal it.
Compute-Maximizing Programming • Generally offered as off-board “accelerators” • Data “tossed over the wall” and back • Only portions of computations achieve a speedup if offloaded • Accelerators mono-task one kernel at a time • Applications are sliced into successive statically defined phases separated by resorting, repacking, or converting entire datasets. • Limited to a single dataset-wide feed forward pipeline. Effectively back to batch processing
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
Synthesis • Add at least one compute-max core to CPUs • Workloads that use it get BIG performance • Programmers are struggling to get any performance from having more normal cores • Being “on-chip” architected and ubiquitous is huge for application use of compute-max • Compute core exposed as programmable independent multithreaded execution engine • A lot like adding (only!) fragment shaders • Largely agnostic on hardware “flavour”
Extensions • Unified address space • Coherency is nice, but still valuable without it • Multiple kernels “bound” (loaded) at a time • All part of the same application, for now • “Work” delivered to compute cores through work queues • Dequeuing batches / schedules for coherence, not necessarily FIFO • Compute and CPU cores can insert on remote queues
Extensions CLAIM: Queues break the “batch processing” straitjacket and still expose enough coherent parallelism to sustain compute-max utilization. • First part is easy: • Obvious per-data element state machine • Dynamic insertion of new “work” • Instead of being idle as the live thread count in a “pass” drops, a core can pull in “work” from other “passes” (queues).
Extensions CLAIM: Queues break the “batch processing” straitjacket and still expose enough coherent parallelism to sustain compute-max utilization. • Second part is more controversial: • “Lots” of data quantized into a “few” states should have plentiful, easy coherence. • If the workload as a whole has coherence • Pigeon hole argument, basically • Also mitigates SIMD performance constraints • Coherence can be built / specified dynamically
Outline • Introduction • CPU “Special Feature” Background • Compute-Maximizing Processors • Synthesis, with Extensions • Questions for the Audience…
Audience Participation • Do you believe my argument conceptually? • For the heterogeneous / hybrid CPU in general? • For queues and multiple kernels? • What persuades you 3 x86 + compute is preferable to quad x86? • What app / class of apps and how much of a win? 10x? 5x? • How skeptical are you that queues can match the performance of multi-pass / batching? • What would you find a compelling flexibility / expressiveness justification for adding queues? • Performance wins regaining coherence in existing branching/looping shaders? • New algorithms if shaders and CPU threads can dynamically insert additional “work”?