380 likes | 491 Views
Feeding the Multicore Beast: It’s All About the Data!. Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept. Outline. History: Data challenge Motivation for multicore Implications for programmers How Cell addresses these implications Examples 2D/3D FFT
E N D
Feeding theMulticore Beast:It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.
Outline • History: Data challenge • Motivation for multicore • Implications for programmers • How Cell addresses these implications • Examples • 2D/3D FFT • Medical Imaging, Petroleum, general HPC… • Green’s Functions • Seismic Imaging (Petroleum) • String Matching • Network Processing: DPI & Intrusion Detections • Neural Networks • Finance mpp@us.ibm.com
Chapter 1:The Beast is Hungry! mpp@us.ibm.com
The Hungry Beast Data (“food”) Processor (“beast”) Data Pipe • Pipe too small = starved beast • Pipe big enough = well-fed beast • Pipe too big = wasted resources mpp@us.ibm.com
The Hungry Beast Data (“food”) Processor (“beast”) Data Pipe • Pipe too small = starved beast • Pipe big enough = well-fed beast • Pipe too big = wasted resources • If flops grow faster than pipe capacity… … the beast gets hungrier! mpp@us.ibm.com
Move the food closer • Example: Intel Tulsa • Xeon MP 7100 series • 65nm, 349mm2, 2 Cores • 3.4 GHz @ 150W • ~54.4 SP GFlops • http://www.intel.com/products/processor/xeon/index.htm • Large cache on chip • ~50% of area • Keeps data close for efficient access • If the data is local,the beast is happy! • True for many algorithms mpp@us.ibm.com
What happens if the beast is still hungry? Cache • If the data set doesn’t fit in cache • Cache misses • Memory latency exposed • Performance degraded • Several important application classes don’t fit • Graph searching algorithms • Network security • Natural language processing • Bioinformatics • Many HPC workloads Data mpp@us.ibm.com
Make the food bowl larger Cache • Cache size steadily increasing • Implications • Chip real estate reserved for cache • Less space on chip for computes • More power required for fewer FLOPS Data mpp@us.ibm.com
Make the food bowl larger Cache • Cache size steadily increasing • Implications • Chip real estate reserved for cache • Less space on chip for computes • More power required for fewer FLOPS • But… • Important application working sets are growing faster • Multicore even more demanding on cache than uni-core Data mpp@us.ibm.com
Chapter 2:The Beast Has Babies mpp@us.ibm.com
Power Density – The fundamental problem mpp@us.ibm.com
What’s causing the problem? 65 nM 1000 100 10 Gate Stack Power Density (W/cm2) 1 0.1 0.01 0.001 1 0.1 0.01 Gate dielectricapproaching a fundamental limit (a few atomic layers) Gate Length (microns) Power, signal jitter, etc... mpp@us.ibm.com
Diminishing Returns on Frequency In a power-constrained environment, chip clock speed yields diminishing returns. The industry has moved to lower frequency multicore architectures. Frequency-DrivenDesignPoints mpp@us.ibm.com
.85 1.7 Power vs Performance Trade Offs We need to adapt our algorithms to get performance out of multicore 1.45 1 1.3 mpp@us.ibm.com
Implications of Multicore • There are more mouths to feed • Data movement will take center stage • Complexity of cores will stop increasing … and has started to decrease in some cases • Complexity increases will center around communication • Assumption • Achieving a significant % or peak performance is important mpp@us.ibm.com
Chapter 3:The Proper Care and Feeding of Hungry Beasts mpp@us.ibm.com
Cell/B.E. Processor: 200GFLOPS (SP) @ ~70W mpp@us.ibm.com
SPU SPU SPU SPU SPU SPU SPU SPU SXU SXU SXU SXU SXU SXU SXU SXU LS LS LS LS LS LS LS LS MFC MFC MFC MFC MFC MFC MFC MFC PPU L1 PXU 16B/cycle Feeding the Cell Processor SPE • 8 SPEs each with • LS • MFC • SXU • PPE • OS functions • Disk IO • Network IO 16B/cycle EIB (up to 96B/cycle) 16B/cycle 16B/cycle 16B/cycle (2x) PPE MIC BIC L2 32B/cycle Dual XDRTM FlexIOTM 64-bit Power Architecture with VMX mpp@us.ibm.com
Cell Approach: Feed the beast more efficiently • Explicitly “orchestrate” the data flow between main memory and each SPE’s local store • Use SPE’s DMA engine to gather & scatter data between memory main memory and local store • Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes • Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient • Allows more efficient use of the existing bandwidth mpp@us.ibm.com
Cell Approach: Feed the beast more efficiently • Explicitly “orchestrate” the data flow between main memory and each SPE’s local store • Use SPE’s DMA engine to gather & scatter data between memory main memory and local store • Enables detailed programmer control of data flow • Get/Put data when & where you want it • Hides latency: Simultaneous reads, writes & computes • Avoids restrictive HW cache management • Unlikely to determine optimal data flow • Potentially very inefficient • Allows more efficient use of the existing bandwidth • BOTTOM LINE:It’s all about the data! mpp@us.ibm.com
Cell Comparison: ~4x the FLOPS @ ~½ the power Both 65nm technology (to scale) mpp@us.ibm.com
Memory Managing Processor vs. Traditional General Purpose Processor Cell BE AMD IBM Intel mpp@us.ibm.com
Examples of Feeding Cell • 2D and 3D FFTs • Seismic Imaging • String Matching • Neural Networks (function approximation) mpp@us.ibm.com
Feeding FFTs to Cell • SIMDized data • DMAs double buffered • Pass 1: For each buffer • DMA Get buffer • Do four 1D FFTs in SIMD • Transpose tiles • DMA Put buffer • Pass 2: For each buffer • DMA Get buffer • Do four 1D FFTs in SIMD • Transpose tiles • DMA Put buffer Tile Buffer Input Image Transposed Image Transposed Tile Transposed Buffer mpp@us.ibm.com
3D FFTs • Long stride trashes cache • Cell DMA allows prefetch Stride N2 Single Element Data envelope N Stride 1 mpp@us.ibm.com
Feeding Seismic Imaging to Cell Data Green’s Function • New G at each (x,y) • Radial symmetry of G reduces BW requirements (X,Y) mpp@us.ibm.com
Feeding Seismic Imaging to Cell Data SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 mpp@us.ibm.com
Feeding Seismic Imaging to Cell Data SPE 0 SPE 1 SPE 2 SPE 3 SPE 4 SPE 5 SPE 6 SPE 7 mpp@us.ibm.com
Feeding Seismic Imaging to Cell 2R+1 • For each X • Load next column of data • Load next column of indices • For each Y • Load Green’s functions • SIMDize Green’s functions • Compute convolution at (X,Y) • Cycle buffers (X,Y) R H 1 Data buffer Green’s Index buffer 2 mpp@us.ibm.com
Feeding String Matching to Cell Sample Word List: “the”“that”“math” • Find (lots of) substrings in (long) string • Build graph of words & represent as DFA • Problem: Graph doesn’t fit in LS mpp@us.ibm.com
Feeding String Matching to Cell mpp@us.ibm.com
Hiding Main Memory Latency mpp@us.ibm.com
Software Multithreading mpp@us.ibm.com
Feeding Neural Networks to Cell F Output • Neural net function F(X) • RBF, MLP, KNN, etc. • If too big for LS, BW Bound N Basis functions: dot product + nonlinearity DxN Matrix of parameters D Input dimensions X mpp@us.ibm.com
Convert BW Bound to Compute Bound Merge • Split function over multiple SPEs • Avoids unnecessary memory traffic • Reduce compute time per SPE • Minimal merge overhead mpp@us.ibm.com
Moral of the Story:It’s All About the Data! • The data problem is growing: multicore • Intelligent software prefetching • Use DMA engines • Don’t rely on HW prefetching • Efficient data management • Multibuffering: Hide the latency! • BW utilization: Make every byte count! • SIMDization: Make every vector count! • Problem/data partitioning: Make every core work! • Software multithreading: Keep every core busy! mpp@us.ibm.com
Backup mpp@us.ibm.com
Abstract Technological obstacles have prevented the microprocessor industry from achieving increased performance through increased chip clock speeds. In a reaction to these restrictions, the industry has chosen the multicore processors path. Multicore processors promise tremendous GFLOPS performance but raise the challenge of how one programs them. In this talk, I will discuss the motivation for multicore, the implications to programmers and how the Cell/B.E. processors design addresses these challenges. As an example, I will review one or two applications that highlight the strengths of Cell. mpp@us.ibm.com