310 likes | 536 Views
SIMD Processor Extensions. Houffaneh Osman halio029@uottawa.ca. SIMD (1). Single Instruction, Multiple Data Part of Flynn Taxonomy computer classification Multiple processors Different data streams Same instruction executed. SIMD (2).
E N D
SIMD Processor Extensions HouffanehOsman halio029@uottawa.ca
SIMD (1) • Single Instruction, Multiple Data • Part of Flynn Taxonomy computer classification • Multiple processors • Different data streams • Same instruction executed
SIMD (2) • Able to operates on multiple data items at the same time • Computation : The most minimal time possible • Vectors • Matrices • Better speedup then sequential
SIMD Architecture • Two type of processors • True SIMD • Pipelined SIMD • Divide a instruction into smaller function • Execute smaller function in parallel on different data
True SIMD - Distributed Memory • Single control unit • M processing elements act as arithmetic unit • N data elements (or even more then M) • Processor elements receives instruction from control unit • If a processor element need information from another processor element • Send request to control unit and it manage the memory exchanges
True SIMD - Shared Memory • Single control unit • M processing elements act as arithmetic unit • N data elements (or even more then M) • Processor elements receives instruction from control unit • Processing elements able to share their memory without control unit access
True SIMD True SIMD : Distributed Memory True SIMD : Shared Memory
Programming the SIMD architecture (1) • Cell used : IBM Cell BE • The Cell Broadband Engine (CBE) • Single-chip multiprocessor • with 9 processor • All processor share the same main storage • Processor function used in 2 functions • PowerPC Processor Element (PPE) • Synergistic Processor Element (SPE)
Programming the SIMD architecture (2) • VMX : Vector Multimedia eXtension to the PowerPC architecture • Utilizes data parallelism for faster performance • SIMD in VMX and SPE (Reference IBM Cell Programming) • 128bit-wide datapath • 128bit-wide registers • 4-wide fullwords, 8-wide halfwords, 16-wide bytes • SPE includes support for 2-wide doublewords • Vector Programming
Programming the SIMD architecture (3) • Each of the 4 elements in VA and VB are added and their sum placed in VC • VC = vec_add(VA,VB)
Programming the SIMD architecture (4) • SIMD Unprocessable Patterns • Case where the instruction differ for each processing element • SIMD Processable Patterns • Case where the instruction are the same for each processing element
Programming the SIMD architecture (5) • Register view of the add instruction in previous slide • VC = vec_add(VA,VB)
Programming the SIMD architecture (6) • Permute method or shuffling • Between two vector • Third vector used for control vector • VT = vec_perm(VA,VB,VC)
Intel SSE (1) • SSE : Streaming SIMD Extensions • Instruction set to the x86 architectures • Extension of 128-bit • Introduced in 1999 in the Pentium III • Latest version : SSE5 before revision • Future extension from Intel • AVX : Advanced Vector Extensions • 256-bit instructions
Intel SSE (2) • Image Processing • Digital Signal Processing • Encoding • Streaming load
Intel SSE (3) • Streaming load instruction • Enables faster read • Improves performance of application that ‘s using the GPU and CPU • SIMD improve encoding speed • Required arithmetic performed on pixel • Pixel in a video -> high level of parallelism required
Matrix Multiplication (NI) Matrix multiplication – No data parallelism Matrix multiplication – Employed data parallelism
Implementation of SIMD • Native vs Traditional programming • Auto-vectorization • Detection of low-level operation • Convert these sequential program to process 2 to up to 16 elements in one operation • Auto-parallization • Turning sequential code into multi-threaded
Auto-Parallelization • Intel C++ Compiler • Serial section of input program -> multithreaded code • Compiler also efficient in order to not have too much overhead when creating multithreads • Intel® Architecture Code Analyzer • PGI CDK Cluster Development Kit • AMD Opteron • Intel Core 2
Auto-Vectorization • GNU Compiler for C and C++ • Nested Loops conditions • Multidimensional arrays • PGI CDK Cluster Development Kit • SSE vectorization
Intel Array Building Blocks (1) • Developed to utilized • Multi-core processors • Graphics processing units • Takes advantages of the SIMD and core processing elements • Portion of C/C++ code that have parallelism can be used in conjunction with ArBB
Intel Array Building Blocks (2) • Isolated data objects from rest of codes • Intel mention this imposes a restrictions • Restrictions eliminates locks and data races • Threading by itself • Do not provide access to per-core vector parallelism • ArBB API provides programming models at software level for developers
References (1) • Intel Press, “Multi-Core Programming : Increasing Performance through Software Multi-threading,'' pp. 2--6 -- 11--13, Apr 2006. • Intel Corp. “Intel C++ Compiler 8.1 for Linux,” Internet: ftp://download.intel.com/support/performancetools/c/linux/sb/clin81_relnotes.pdf, 2004 pg 1--9.[2010-10-24] • Linux Kernel Organization, “Cell Programming Primer : Basics of SIMD programming,Documents of PS3 Linux Distributor's Starter Kit, Internet: http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/ps3-linux-docs-08.06.09/CellProgrammingTutorial/BasicsOfSIMDProgramming.html, 2006,2007,2008 [Oct. 24, 2010]. • C.Chen, R.Raghavan, J.Dale, E.Iwata, “Cell Broadband Engine Architecture and its first implementation,". Internet: http://www.ibm.com/developerworks/power/library/pa-cellperf/, Oct. 2005 [Oct. 24, 2010]. • H.Chang, C.Cho, S.Wonyong, “Performance Evaluation of an SIMD Architecture with a Multi-bank Vector Memory Unit, Signal Processing Systems Design and Implementation, 2006. SIPS '06. IEEE Workshop on}, oct. 2006, pp. 1520-6130.
References (2) • GCC GNU Project, “Auto-vectorization in GCC,". Internet: http://gcc.gnu.org/projects/tree-ssa/vectorization.html, Aug. 2010 [Oct. 24, 2010]. • Intel Software Network, “Performance Tools for Software Developers - Auto parallelization and /Qpar-threshold,". Internet: http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/, Jul. 2009 [Oct. 24, 2010]. • National Instruments, “Programming Strategies for Multicore Processing: Data Parallelism,". Internet: http://zone.ni.com/devzone/cda/tut/p/id/6421, Nov. 2008 [Oct. 24, 2010]. • A.Lanterman, “Multicore and GPU Programming for Video Games: Developing Code for Cell - SIMD". Internet: http://users.ece.gatech.edu/~lanterma/mpg09/, Fall 2010 [Oct. 24, 2010]. • R.Michael Hord, "Parallel supercomputing in SIMD architectures," Boca Raton, FL: CRC Press, c1990
References (3) • IBM Corp and Sony Computer Entertainment, “Software Development Kit for Multicore Acceleration Version 3.0: Data Parallelism,". Internet: http://users.ece.gatech.edu/~lanterma/mpg09/CBE_Programming_Tutorial_v3.0.pdf, Nov. 2008 [Oct. 24, 2010]. • IBM Corp and Sony Computer Entertainment (2006,2007). "Software Development Kit for Multicore Acceleration (Version 3). [On-line],", Internet: http://users.ece.gatech.edu/~lanterma/mpg09/CBE_Programming_Tutorial_v3.0.pdf"[Oct. 24, 2010]. • J.Demmel, "A closer look at parallel architectures: Lecture 9," Internet: http://www.eecs.berkeley.edu/~demmel/cs267/lecture09/lecture09.html, Feb. 1996 [Oct. 24, 2010]. • S.Morse, "Practical parallel computing ," Boston : AP Professional, c1994 • C.Leopold, "Parallel and distributed computing : a survey of models, paradigms and approaches ," New York : Wiley, 2001
References (4) • L.Dong-hwan, S. Wonyong, ``Importance of SIMD computation reconsidered,''Parallel and Distributed Processing Symposium, 2003. Proceedings. International}, apr. 2003, pp. 8. • W.C. Meilander,J.W. Baker, M. Jin, ``Performance Evaluation of an SIMD Architecture with a Multi-bank Vector Memory Unit,'', Signal Processing Systems Design and Implementation, 2006. SIPS '06. IEEE Workshop on}, oct. 2006, pp. 1520-6130. • http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php • http://domino.watson.ibm.com/comm/research.nsf/pages/r.arch.simd.html • Intel Array Building Blocks : http://software.intel.com/en-us/articles/intel-array-building-blocks/ • http://www.wolfire.com/