1 / 29

SIMD Processor Extensions

SIMD Processor Extensions. Houffaneh Osman halio029@uottawa.ca. SIMD (1). Single Instruction, Multiple Data Part of Flynn Taxonomy computer classification Multiple processors Different data streams Same instruction executed. SIMD (2).

ethanael
Download Presentation

SIMD Processor Extensions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SIMD Processor Extensions HouffanehOsman halio029@uottawa.ca

  2. SIMD (1) • Single Instruction, Multiple Data • Part of Flynn Taxonomy computer classification • Multiple processors • Different data streams • Same instruction executed

  3. SIMD (2) • Able to operates on multiple data items at the same time • Computation : The most minimal time possible • Vectors • Matrices • Better speedup then sequential

  4. SIMD Architecture • Two type of processors • True SIMD • Pipelined SIMD • Divide a instruction into smaller function • Execute smaller function in parallel on different data

  5. True SIMD - Distributed Memory • Single control unit • M processing elements act as arithmetic unit • N data elements (or even more then M) • Processor elements receives instruction from control unit • If a processor element need information from another processor element • Send request to control unit and it manage the memory exchanges

  6. True SIMD - Shared Memory • Single control unit • M processing elements act as arithmetic unit • N data elements (or even more then M) • Processor elements receives instruction from control unit • Processing elements able to share their memory without control unit access

  7. True SIMD True SIMD : Distributed Memory True SIMD : Shared Memory

  8. Programming the SIMD architecture (1) • Cell used : IBM Cell BE • The Cell Broadband Engine (CBE) • Single-chip multiprocessor • with 9 processor • All processor share the same main storage • Processor function used in 2 functions • PowerPC Processor Element (PPE) • Synergistic Processor Element (SPE)

  9. Programming the SIMD architecture (2) • VMX : Vector Multimedia eXtension to the PowerPC architecture • Utilizes data parallelism for faster performance • SIMD in VMX and SPE (Reference IBM Cell Programming) • 128bit-wide datapath • 128bit-wide registers • 4-wide fullwords, 8-wide halfwords, 16-wide bytes • SPE includes support for 2-wide doublewords • Vector Programming

  10. Programming the SIMD architecture (3) • Each of the 4 elements in VA and VB are added and their sum placed in VC • VC = vec_add(VA,VB)

  11. Programming the SIMD architecture (4) • SIMD Unprocessable Patterns • Case where the instruction differ for each processing element • SIMD Processable Patterns • Case where the instruction are the same for each processing element

  12. Programming the SIMD architecture (5) • Register view of the add instruction in previous slide • VC = vec_add(VA,VB)

  13. Programming the SIMD architecture (6) • Permute method or shuffling • Between two vector • Third vector used for control vector • VT = vec_perm(VA,VB,VC)

  14. Intel SSE (1) • SSE : Streaming SIMD Extensions • Instruction set to the x86 architectures • Extension of 128-bit • Introduced in 1999 in the Pentium III • Latest version : SSE5 before revision • Future extension from Intel • AVX : Advanced Vector Extensions • 256-bit instructions

  15. Intel SSE (2) • Image Processing • Digital Signal Processing • Encoding • Streaming load

  16. Intel SSE (3) • Streaming load instruction • Enables faster read • Improves performance of application that ‘s using the GPU and CPU • SIMD improve encoding speed • Required arithmetic performed on pixel • Pixel in a video -> high level of parallelism required

  17. Data Parallelism

  18. Matrix Multiplication (NI) Matrix multiplication – No data parallelism Matrix multiplication – Employed data parallelism

  19. Image Processing

  20. Processing

  21. Implementation of SIMD • Native vs Traditional programming • Auto-vectorization • Detection of low-level operation • Convert these sequential program to process 2 to up to 16 elements in one operation • Auto-parallization • Turning sequential code into multi-threaded

  22. Auto-Parallelization • Intel C++ Compiler • Serial section of input program -> multithreaded code • Compiler also efficient in order to not have too much overhead when creating multithreads • Intel® Architecture Code Analyzer • PGI CDK Cluster Development Kit • AMD Opteron • Intel Core 2

  23. Auto-Vectorization • GNU Compiler for C and C++ • Nested Loops conditions • Multidimensional arrays • PGI CDK Cluster Development Kit • SSE vectorization

  24. Intel Array Building Blocks (1) • Developed to utilized • Multi-core processors • Graphics processing units • Takes advantages of the SIMD and core processing elements • Portion of C/C++ code that have parallelism can be used in conjunction with ArBB

  25. Intel Array Building Blocks (2) • Isolated data objects from rest of codes • Intel mention this imposes a restrictions • Restrictions eliminates locks and data races • Threading by itself • Do not provide access to per-core vector parallelism • ArBB API provides programming models at software level for developers

  26. References (1) • Intel Press, “Multi-Core Programming : Increasing Performance through Software Multi-threading,'' pp. 2--6 -- 11--13, Apr 2006. • Intel Corp. “Intel C++ Compiler 8.1 for Linux,” Internet: ftp://download.intel.com/support/performancetools/c/linux/sb/clin81_relnotes.pdf, 2004 pg 1--9.[2010-10-24] • Linux Kernel Organization, “Cell Programming Primer : Basics of SIMD programming,Documents of PS3 Linux Distributor's Starter Kit, Internet: http://www.kernel.org/pub/linux/kernel/people/geoff/cell/ps3-linux-docs/ps3-linux-docs-08.06.09/CellProgrammingTutorial/BasicsOfSIMDProgramming.html, 2006,2007,2008 [Oct. 24, 2010]. • C.Chen, R.Raghavan, J.Dale, E.Iwata, “Cell Broadband Engine Architecture and its first implementation,". Internet: http://www.ibm.com/developerworks/power/library/pa-cellperf/, Oct. 2005 [Oct. 24, 2010]. • H.Chang, C.Cho, S.Wonyong, “Performance Evaluation of an SIMD Architecture with a Multi-bank Vector Memory Unit, Signal Processing Systems Design and Implementation, 2006. SIPS '06. IEEE Workshop on}, oct. 2006, pp. 1520-6130.

  27. References (2) • GCC GNU Project, “Auto-vectorization in GCC,". Internet: http://gcc.gnu.org/projects/tree-ssa/vectorization.html, Aug. 2010 [Oct. 24, 2010]. • Intel Software Network, “Performance Tools for Software Developers - Auto parallelization and /Qpar-threshold,". Internet: http://software.intel.com/en-us/articles/performance-tools-for-software-developers-auto-parallelization-and-qpar-threshold/, Jul. 2009 [Oct. 24, 2010]. • National Instruments, “Programming Strategies for Multicore Processing: Data Parallelism,". Internet: http://zone.ni.com/devzone/cda/tut/p/id/6421, Nov. 2008 [Oct. 24, 2010]. • A.Lanterman, “Multicore and GPU Programming for Video Games: Developing Code for Cell - SIMD". Internet: http://users.ece.gatech.edu/~lanterma/mpg09/, Fall 2010 [Oct. 24, 2010]. • R.Michael Hord, "Parallel supercomputing in SIMD architectures," Boca Raton, FL: CRC Press, c1990

  28. References (3) • IBM Corp and Sony Computer Entertainment, “Software Development Kit for Multicore Acceleration Version 3.0: Data Parallelism,". Internet: http://users.ece.gatech.edu/~lanterma/mpg09/CBE_Programming_Tutorial_v3.0.pdf, Nov. 2008 [Oct. 24, 2010]. • IBM Corp and Sony Computer Entertainment (2006,2007). "Software Development Kit for Multicore Acceleration (Version 3). [On-line],", Internet: http://users.ece.gatech.edu/~lanterma/mpg09/CBE_Programming_Tutorial_v3.0.pdf"[Oct. 24, 2010]. • J.Demmel, "A closer look at parallel architectures: Lecture 9," Internet: http://www.eecs.berkeley.edu/~demmel/cs267/lecture09/lecture09.html, Feb. 1996 [Oct. 24, 2010]. • S.Morse, "Practical parallel computing ," Boston : AP Professional, c1994 • C.Leopold, "Parallel and distributed computing : a survey of models, paradigms and approaches ," New York : Wiley, 2001

  29. References (4) • L.Dong-hwan, S. Wonyong, ``Importance of SIMD computation reconsidered,''Parallel and Distributed Processing Symposium, 2003. Proceedings. International}, apr. 2003, pp. 8. • W.C. Meilander,J.W. Baker, M. Jin, ``Performance Evaluation of an SIMD Architecture with a Multi-bank Vector Memory Unit,'', Signal Processing Systems Design and Implementation, 2006. SIPS '06. IEEE Workshop on}, oct. 2006, pp. 1520-6130. • http://www.gamasutra.com/view/feature/4248/designing_fast_crossplatform_simd_.php • http://domino.watson.ibm.com/comm/research.nsf/pages/r.arch.simd.html • Intel Array Building Blocks : http://software.intel.com/en-us/articles/intel-array-building-blocks/ • http://www.wolfire.com/

More Related