Feeding the Multicore Beast: It’s All About the Data!

Feeding theMulticore Beast:It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Outline • History: Data challenge • Motivation for multicore • Implications for programmers • How Cell addresses these implications • Examples • 2D/3D FFT • Medical Imaging, Petroleum, general HPC… • Green’s Functions • Seismic Imaging (Petroleum) • String Matching • Network Processing: DPI & Intrusion Detections • Neural Networks • Finance mpp@us.ibm.com

Chapter 1:The Beast is Hungry! mpp@us.ibm.com

The Hungry Beast Data (“food”) Processor (“beast”) Data Pipe • Pipe too small = starved beast • Pipe big enough = well-fed beast • Pipe too big = wasted resources mpp@us.ibm.com

The Hungry Beast Data (“food”) Processor (“beast”) Data Pipe • Pipe too small = starved beast • Pipe big enough = well-fed beast • Pipe too big = wasted resources • If flops grow faster than pipe capacity… … the beast gets hungrier! mpp@us.ibm.com

Move the food closer • Example: Intel Tulsa • Xeon MP 7100 series • 65nm, 349mm2, 2 Cores • 3.4 GHz @ 150W • ~54.4 SP GFlops • http://www.intel.com/products/processor/xeon/index.htm • Large cache on chip • ~50% of area • Keeps data close for efficient access • If the data is local,the beast is happy! • True for many algorithms mpp@us.ibm.com

What happens if the beast is still hungry? Cache • If the data set doesn’t fit in cache • Cache misses • Memory latency exposed • Performance degraded • Several important application classes don’t fit • Graph searching algorithms • Network security • Natural language processing • Bioinformatics • Many HPC workloads Data mpp@us.ibm.com

Make the food bowl larger Cache • Cache size steadily increasing • Implications • Chip real estate reserved for cache • Less space on chip for computes • More power required for fewer FLOPS Data mpp@us.ibm.com

Make the food bowl larger Cache • Cache size steadily increasing • Implications • Chip real estate reserved for cache • Less space on chip for computes • More power required for fewer FLOPS • But… • Important application working sets are growing faster • Multicore even more demanding on cache than uni-core Data mpp@us.ibm.com

Chapter 2:The Beast Has Babies mpp@us.ibm.com

Power Density – The fundamental problem mpp@us.ibm.com

What’s causing the problem? 65 nM 1000 100 10 Gate Stack Power Density (W/cm2) 1 0.1 0.01 0.001 1 0.1 0.01 Gate dielectricapproaching a fundamental limit (a few atomic layers) Gate Length (microns) Power, signal jitter, etc... mpp@us.ibm.com

Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency-DrivenDesignPoints mpp@us.ibm.com

.85 1.7 Power vs Performance Trade Offs We need to adapt our algorithms to get performance out of multicore 1.45 1 1.3 mpp@us.ibm.com

Implications of Multicore • There are more mouths to feed • Data movement will take center stage • Complexity of cores will stop increasing … and has started to decrease in some cases • Complexity increases will center around communication • Assumption • Achieving a significant % or peak performance is important mpp@us.ibm.com

Chapter 3:The Proper Care and Feeding of Hungry Beasts mpp@us.ibm.com

Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W mpp@us.ibm.com

SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC PPU L1 PXU 16B/cycle Feeding the Cell Processor SPE • 8 SPEs each with • LS • MFC • SXU • PPE • OS functions • Disk IO • Network IO 16B/cycle EIB (up to 96B/cycle) 16B/cycle 16B/cycle 16B/cycle (2x) PPE MIC BIC L2 32B/cycle Dual XDRTM FlexIOTM 64-bit Power Architecture with VMX mpp@us.ibm.com

Cell Approach: Feed the beast more efficiently • Explicitly “orchestrate” the data flow between main memory and each SPE’s local store • Use SPE’s DMA engine to gather & scatter data between memory main memory and local store • Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes • Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient • Allows more efficient use of the existing bandwidth mpp@us.ibm.com

Cell Approach: Feed the beast more efficiently • Explicitly “orchestrate” the data flow between main memory and each SPE’s local store • Use SPE’s DMA engine to gather & scatter data between memory main memory and local store • Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes • Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient • Allows more efficient use of the existing bandwidth • BOTTOM LINE:It’s all about the data! mpp@us.ibm.com

Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology (to scale) mpp@us.ibm.com

Memory Managing Processor vs. Traditional General Purpose Processor Cell BE AMD IBM Intel mpp@us.ibm.com

Examples of Feeding Cell • 2D and 3D FFTs • Seismic Imaging • String Matching • Neural Networks (function approximation) mpp@us.ibm.com

Feeding FFTs to Cell • SIMDized data • DMAs double buffered • Pass 1: For each buffer • DMA Get buffer • Do four 1D FFTs in SIMD • Transpose tiles • DMA Put buffer • Pass 2: For each buffer • DMA Get buffer • Do four 1D FFTs in SIMD • Transpose tiles • DMA Put buffer Tile Buffer Input Image Transposed Image Transposed Tile Transposed Buffer mpp@us.ibm.com

3D FFTs • Long stride trashes cache • Cell DMA allows prefetch Stride N2 Single Element Data envelope N Stride 1 mpp@us.ibm.com

Feeding Seismic Imaging to Cell Data Green’s Function • New G at each (x,y) • Radial symmetry of G reduces BW requirements (X,Y) mpp@us.ibm.com

Feeding Seismic Imaging to Cell Data SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 mpp@us.ibm.com

Feeding Seismic Imaging to Cell 2R+1 • For each X • Load next column of data • Load next column of indices • For each Y • Load Green’s functions • SIMDize Green’s functions • Compute convolution at (X,Y) • Cycle buffers (X,Y) R H 1 Data buffer Green’s Index buffer 2 mpp@us.ibm.com

Feeding String Matching to Cell Sample Word List: “the”“that”“math” • Find (lots of) substrings in (long) string • Build graph of words & represent as DFA • Problem: Graph doesn’t fit in LS mpp@us.ibm.com

Feeding String Matching to Cell mpp@us.ibm.com

Hiding Main Memory Latency mpp@us.ibm.com

Software Multithreading mpp@us.ibm.com

Feeding Neural Networks to Cell F Output • Neural net function F(X) • RBF, MLP, KNN, etc. • If too big for LS, BW Bound N Basis functions: dot product + nonlinearity DxN Matrix of parameters D Input dimensions X mpp@us.ibm.com

Convert BW Bound to Compute Bound Merge • Split function over multiple SPEs • Avoids unnecessary memory traffic • Reduce compute time per SPE • Minimal merge overhead mpp@us.ibm.com

Moral of the Story:It’s All About the Data! • The data problem is growing: multicore • Intelligent software prefetching • Use DMA engines • Don’t rely on HW prefetching • Efficient data management • Multibuffering: Hide the latency! • BW utilization: Make every byte count! • SIMDization: Make every vector count! • Problem/data partitioning: Make every core work! • Software multithreading: Keep every core busy! mpp@us.ibm.com

Backup mpp@us.ibm.com

Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell. mpp@us.ibm.com

Feeding the Multicore Beast: It’s All About the Data!