250 likes | 456 Views
Optimizing for Intel multi-/many-core architectures. N. Satish, Throughput Computing Lab, Intel Labs. Outline of the talk. Architectural trends Optimizations for multi-/many-core platforms Challenges in performance scaling moving forward. Outline of the talk. Architectural trends
E N D
Optimizing for Intel multi-/many-core architectures N. Satish, Throughput Computing Lab, Intel Labs
Outline of the talk • Architectural trends • Optimizations for multi-/many-core platforms • Challenges in performance scaling moving forward
Outline of the talk • Architectural trends • Optimizations for multi-/many-core platforms • Challenges in performance scaling moving forward
Increasing parallelism • Core scaling • Nhm (4 cores) -> Wsm (6 cores) -> … -> Intel Knights Ferry (32 cores) -> … • Data level parallelism (SIMD) scaling • Earliest SSE (64-bit) -> SSE (128-bit) -> AVX (256-bit) -> LRBNI (512-bit) • Thread scaling/core • Core 2 (1 thread/core) -> Nhm (2 threads/core) -> .. -> Intel Knights Ferry (4 threads/core) • Cache scaling (more slowly) • Memory latency not likely to drop • Need to make better use of cache, SMT, ILP
Intel® MIC Architecture: An Intel Co-Processor Architecture … VECTORIA CORE VECTORIA CORE VECTORIA CORE VECTORIA CORE INTERPROCESSOR NETWORK … COHERENT CACHE COHERENT CACHE COHERENT CACHE COHERENT CACHE MEMORY and I/O INTERFACES FIXED FUNCTION LOGIC … COHERENT CACHE COHERENT CACHE COHERENT CACHE COHERENT CACHE INTERPROCESSOR NETWORK … VECTORIA CORE VECTORIA CORE VECTORIA CORE VECTORIA CORE Many cores and many, many more threads Standard IA programming and memory model Source: Kirk Skaugen, ISC 2010 keynote
Knights Ferry • Software development platform • 32 cores, 1.2 GHz • 128 threads at 4 threads / core • 8MB shared coherent cache • 1-2GB GDDR5 • Bundled with Intel HPC tools Software development platform for Intel® MIC architecture Source: Kirk Skaugen, ISC 2010 keynote
The Knights Family Future Knights Products Knights Corner 1st Intel® MIC product 22nm process >50 Intel Architecture cores Knights Ferry Source: Kirk Skaugen, ISC 2010 keynote
Outline of the talk • Architectural trends • Optimizations for multi-/many-core platforms • Challenges in performance scaling moving forward
Extent of possible gains • Tree search [SIGMOD 2010]: performance difference on Core i7: 8X over baseline, 5X over previous best reported results • In Lee et al. [ISCA 2010], we showed that performance of CPUs could be improved by an average of 8X for a range of throughput-intensive kernels
General optimization flow • Scale down problem to fit in a single core cache and use 1 core (to optimize for compute) • Vectorization • Avoid core stalls due to lack of ILP • Optimize for memory latency, per-core bandwidth • Block for TLB, multiple levels of cache • Software pipelining, iteration lookahead • Avoid cache/TLB conflicts • Finally check core scalability (by weak scaling) • Dynamic load balancing, avoiding synchronization • If architecture does not have scalable bandwidth, might still find bandwidth bottlenecks
Tree Search [SIGMOD 2010] Each query traverses a path in a binary tree until it hits a leaf node, and checks if the leaf node value matches the query Assume the whole tree fits in cache first (say 16 levels = 64 K entries) Parallelization is over queries; trivial Naïve SIMD – each lane handles one query – heavily latency bound
SIMD Blocking • Rearrange binary tree and block for SIMD – no gathers/scatters • Tradeoff: SIMD efficiency = log(actual width):1.2X scaling for Core i7 • However, this improves to 3.1 X for KNF (of a peak of 4) • Bottleneck was now back to back dependencies – s/w pipeline • Compute bound: 2.5X better than previously reported on CPUs • MIC architecture performance is about 2.3X better than Core i7
Optimizing for memory • For a larger tree, every level of the tree (beyond the first few levels) is a cache miss, and also a TLB miss at larger depths • TLB misses are expensive – 200-300 cycles • Cache misses can be 10-100 cycles in latency, depending on level • Heavily latency bound
Tree Search [SIGMOD 2010] • Page blocking minimizes TLB misses (1 page can hold a sub-tree of 20 levels for a 2MB page) – only 2 misses for the whole search: 1.7X speedup on Core i7, 3X speedup on KNF • Cache misses are also minimized • Found that first few levels are kept warm in cache • No cache line blocking on KNF – cache line length == SIMD width
Thread level parallelism • Do multiple queries at once • No issues in tree search (obtain BW bound/compute bound performance for large/small trees respectively)
Tree search is but one example • Optimizations for compute and memory are more widely applicable • Compute: Vectorization, unrolling, sw pipelining (better tool support) • Memory: TLB/cache blocking, prefetching • Optimizations for core parallelism • Create many statically partitioned tasks – or develop a locality-aware dynamic load balancer • Enforce only actual dependencies instead of performing global barriers (especially for many-core architectures) • Eg: Stencils only require neighbor communication – can enforce neighbor dependencies (limits cache to cache traffic)
Outline of the talk • Architectural trends • Optimizations for multi-/many-core platforms • Challenges in performance scaling moving forward
Will things continue scaling well? • Challenge 1: SIMD efficiency • Certain algorithms are not SIMD friendly • Issue 1: Code can be irregular with branches • Issue 2: Code may require gather/scatter to/from distinct locations • To support these efficiently, the LRB Native Instruction Set has 3 features (1) mask support (2) supports gather/scatters and (3) pack/unpack instructions • Most additions are 512-bit vector instructions with masks that are used to predicate writes into vector registers • Gather involves loading values from non-contiguous memory locations (may not be cheap if they miss cache) • Pack is a restricted (cheaper) gather where elements within a cache line with the same mask are collected
pack Input: v0 Output: [rbx] for(i = 0; i < N; i++) if(A[i]){ } else{ } } Assume N is large • Compute mask using vcmp • Use pack to collect items with mask 0 • Collect elements with mask 1 • Run SIMD friendly code on collected elements 22
Example: ClearPath [Guy et al, SCA 09] Each person (“agent”) finds nearest neighbors and a velocity that will avoid collisions with them (computational geometry; involves branchy code) • SIMD utilization for a single agent is limited (1.25-1.5X on SSE), but inter-agent SIMD has gather/scatter, divergent branches • Obtain ~ 6.4X SIMD scaling for MIC using HW support
Other challenges • Challenge 2: BW scaling • Fundamentally, memory bandwidth is not keeping pace with compute • Used to be about 1 byte/flop • Current: Westmere: 0.21 bytes/flop, AMD Magny Cours: 0.20 bytes/flop, NVIDIA GTX 480: 0.13 bytes/flop • Future GPUs [Bill Dally, SC 09]: 2017: 1 GPU node = 2 TB/s, 40 TFlops => 0.05 bytes/flop • Occasional disruptive changes improve BW • No magic solution – we really need to use the local storage well • Change algorithms if required – merge vs radix sort [SIGMOD 2010]
Other challenges • Challenge 3: Cache is not going to keep increasing at the rate of compute • Cant increase latency, power too much • Most likely will stay in the 10’s of MB range, never going to be GBs • New levels of the memory hierarchy will come between • EDRAM is one example: offers bandwidth and capacity in-between caches and GDDR • Need to capture working sets at multiple levels • Need better autotuners
Current Clinical Compressed Sensing KNF performance results • HPC kernels • SGEMM (1 TFlop) [SC 2009 demo] • LU (> 0.5 TFlop ) [ISC 2010 demo] • Medical Imaging • Volume Rendering [TVCG 2009] – gather/scatter heavy: 5-8X faster than 4C Nhm • Compressed Sensing (MRI reconstruction) [EMBC 2010]: Clinically viable : 12 seconds