Undersubscribed Threading on Clustered Cache Architectures

Undersubscribed Threading on Clustered Cache Architectures Wim Heirman1,2 Trevor E. Carlson1 Kenzo Van Craeynest1 Ibrahim Hur2 Aamer Jaleel2 Lieven Eeckhout1 1 Ghent University 2 Intel Corporation HPCA 2014, Orlando, FL

Context • Many-core processor with 10s-100s of cores • E.g. Intel Xeon Phi, Tilera, GPGPU • Running scalable, data-parallel workloads • SPEC OMP, NAS Parallel Benchmarks, … • Processor design @ fixed area/power budget • Spend on cores, caches? • Cache topology?

Overview • Cache topology: • Why clustered caches? • Why undersubscription? • Dynamic undersubscription • CRUST algorithms for automatic adaptation • CRUST and future many-core design

Many-core Cache Architectures Core Core Core Core Core Core private Cache Cache Cache Cache Cache Cache … (N) Core Core Core Core … (C) … (C) clustered Cache Cache … (N/C) Core Core Core Core Core Core … (N) shared (NUCA) Cache

Many-core Cache Architectures sharing hit latency private clustered … … shared …

Undersubscribing for Cache Capacity 3/4 undersubscription 2/4 undersubscription 1/4 undersubscription full subscription Core Core Core Core Core Core Core Core Cache Cache … (N) Less than C active cores/threads per cluster When working set does not fit in cache Keep all cache capacity accessible

Many-core Cache Architectures sharing hit latency undersubscription private (1:1) clustered (1:C) … … shared … (1:N)

Performance & Energy: Working set vs. Cache size • Baseline architecture: • 128 cores • private L1 • clustered L21M shared per 4 cores N-cg/A energy efficiency(1/energy) normalized performance(1/execution time) 1/4 2/4 3/4 4/4

Performance & Energy: Working set vs. Cache size Capacity bound:reduce thread countto optimize hit rate Bandwidth bound: disable cores for better energy efficiency Compute bound: use all cores for highest performance N-cg/C N-cg/A 1/4 undersubscription:3.5x performance80% energy savings N-ft/C performance performance energy efficiency energy efficiency energy efficiency performance 1/4 1/4 2/4 2/4 3/4 3/4 4/4 4/4 1/4 2/4 3/4 4/4

ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) CRUST-descend CRUST-predict • Dynamic undersubscription • Integrated into the OpenMP runtime library • Adapt to each #pragma omp individually • Optimize for performance first, save energy when possible • Compute bound: full subscription • Bandwidth bound: no* performance degradation (* <5% vs. full) • Capacity bound: highest performance • Two CRUST heuristics(descend and predict)for on-line adaptation

CRUST-descend Selected 3 performance 5 2 1 4 full threads/cluster Start with full subscription Reduce thread count while performance increases

CRUST-predict • Reduce number of steps required by being smarter • Start with heterogeneous undersubscription • Measure LLC miss rate for each thread/cluster option • Predict performance of each option using PIE-like model • Select best predicted option

Methodology • Generic many-core architecture • 128 cores, 2-issue OOO @1GHz • 2x 32 KB private L1 I+D • L1-D stride prefetcher • 1 MB shared L2 per 4 cores • 2-D mesh NoC • 64 GB/s total DRAM bandwidth • Sniper simulator, McPAT for power • SPEC OMP and NAS parallel benchmarks • Reduced iteration counts from ref, class A inputs

Results: Oracle (static) compute bound capacity bound bandwidth bound

Results: Linear Bandwidth Models compute bound capacity bound bandwidth bound Linear Bandwidth Models (e.g. BAT): save energy,does not exploit capacity effects on clustered caches

Results: CRUST compute bound capacity bound bandwidth bound CRUST: save energy when bandwidth-bound,exploit capacity effects on clustered caches

Undersubscription vs. future designs • Finite chip area, spent on cores or caches • Increasing max. compute vs. keeping cores fed with data • Undersubscription can adapt workload behavior to the architecture Does this allow us to build a higher-performance design? • Sweep core vs. cache area ratio for 14-nm design • Fixed 600 mm² area, core = 1.5 mm², L2 cache = 3 mm²/MB • Clustered L2 shared by 4 cores, latency ~ log2(size) • 1 GB @ 512 GB/s on-package, 64 GB/s off-package

Undersubscription for Future Designs Compute bound: linear relation between active cores and performance Capacity bound: reduce thread countuntil combined working set fits available cache N-ft/C

Undersubscription for Future Designs E vs. C: 40% more cores,15% higher performance 1.54 2.0 full subscription relative performance 1.0 dynamic subscription 1.24 0.0 A B C D E F • Build one design with best average performance • Full subscription: • conservative option C has highest average performance • Dynamic undersubscription: prefer more cores • higher max. performance for compute-bound benchmarks • use undersubscription to accomodate capacity-bound workloads

Conclusions • Use clustered caches for future many-core designs • Balance hit rate and hit latency • Exploit sharing to avoid duplication • Allow for undersubscription (use all cache, not all cores) • CRUST for dynamic undersubscription • Adapt thread count per OpenMP parallel section • Performance and energy improvements of up to 50% • Take undersubscription usage model into account when designing future many-core processors • CRUST-aware design: 40% more cores, 15% higher performance

Undersubscribed Threading on Clustered Cache Architectures

Undersubscribed Threading on Clustered Cache Architectures

Presentation Transcript

Protein Threading

On - Chip Communication Architectures

Oracle on Clustered Data ONTAP

Threading

Cache Coherence for GPU Architectures

Threading Wrapup

Clustered Data Cache Designs for VLIW Processors

Multi-Threading

Windows Threading

Python-Threading

Threading Libraries

Optimizing Loop Performance for Clustered VLIW Architectures

Clustered Computing

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor

Clustered Planarity = Flat Clustered Planarity

Exploring Design Space for 3D Clustered Architectures

MORE ON ARCHITECTURES

Clustered Planarity = Flat Clustered Planarity