1 / 55

Revisiting Parallelism

Revisiting Parallelism. Multi-Threaded, Multi-Core. Multi Threaded. Era of Thread & Processor Level Parallelism Special Purpose HW. Speculative, OOO. Super Scalar. 486. 386. Era of Instruction Level Parallelism. 286. 8086. Era of Pipelined Architecture. Where Are We Headed?.

melba
Download Presentation

Revisiting Parallelism

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Revisiting Parallelism

  2. Multi-Threaded, Multi-Core Multi Threaded Era of Thread & Processor Level Parallelism Special Purpose HW Speculative, OOO Super Scalar 486 386 Era of Instruction Level Parallelism 286 8086 Era of Pipelined Architecture Where Are We Headed? Source: Shekhar Borkar, Intel Corp.

  3. parallelizable 1CPU 2CPUs 3CPUs 4CPUs Beyond ILP • Performance is limited by the serial fraction • Coarse grain parallelism in the post ILP era • Thread, process and data parallelism • Learn from the lessons of the parallel processing community • Revisit the classifications and architectural techniques

  4. Flynn’s Model* • Flynn’s Classification • Single instruction stream, single data stream (SISD) • The conventional, word-sequential architecture including pipelined computers • Single instruction stream, multiple data stream (SIMD) • The multiple ALU-type architectures (e.g., array processor) • Multiple instruction stream, single data stream (MISD) • Not very common • Multiple instruction stream, multiple data stream (MIMD) • The traditional multiprocessor system Data Level Parallelism (DLP) Thread Level Parallelism (TLP) *M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966.

  5. ILP Challenges • As machine ILP capabilities increase, i.e., ILP width and depth, so do challenges • OOO execution cores • Key data structure sizes increase – ROB, ILP window, etc. • Dependency tracking logic increases quadratically • VLIW/EPIC • Hardware interlocks, ports, recovery logic (speculation) increases quadratically • Circuit complexity increases with number of inflight instructions Data Parallelism

  6. Example: Itanium 2 • Note the percentage of the die devoted to control • And this is a statically scheduled processor!

  7. Data Parallel Alternatives • Single Instruction Stream Multiple Data Stream Cores • Co-processors exposed through the ISA • Co-processors exposed as a distinct processor • Vector Processing • Over 5 decades of development

  8. The SIMD Model • Single instruction stream broadcast to all processors • Processors execute in lock step on local data • Efficient in use of silicon area - less resources devoted to control • Distributed memory model vs. shared memory model • Distributed memory • Each processor has local memory • Data routing network whose operation is under centralized control. • Processor masking for data dependent operations • Shared memory • Access to memory modules through an alignment network • Instruction classes: computation, routing, masking

  9. Two Issues • Conditional Execution • Data alignment

  10. Vector Cores

  11. Classes of Vector Processors Vector machines • Memory to memory architectures have seen a resurgence on chip register machines memory machines

  12. VMIPS • Load/Store architecture • Multiported registers • Deeply pipelined functional units • Separate scalar registers

  13. Cray Family Architecture • Stream oriented • Recall data skewing and concurrent memory accesses! • The first load/store ISA design  Cray 1 (1976)

  14. Features of Vector Processors • Significantly less dependency checking logic • Order of complexity of scalar comparisons with a significantly smaller number • Vector data sets • Hazard free operation on deep pipelines • Conciseness of representation leads to low instruction issue rate • Reduction in normal control hazards • Vector operations vs. a sequence of scalar operations • Concurrency in operation, memory access and address generation • Often statically known

  15. Some Examples

  16. Basic Performance Concepts • Consider the vector operation Z = A*X + Y • Execution time • tex = tstartup + n*tcycle • Metrics • Rinfinity • Rhalf • Rv

  17. Optimizations for Vector Machines • Chaining MULT.V V1, V2. V3 ADD.V V4, V1, V5 • Fine grained forwarding of elements if a vector • Need additional ports on a vector register • Effectively creates a deeper pipeline • Conditional operations and vector masks • Scatter/gather operations • Vector lanes • Each lane is coupled to a portion of the vector register file • Lanes are transparent to the code and are like caches in the family of machines concept

  18. The IBM Cell Processor

  19. S P U S P U S P U S P U P P U R R AC M I C B I C MIB S P U S P U S P U S P U Cell Overview • IBM/Toshiba/Sony joint project - 4-5 years, 400 designers • 234 million transistors, 4+ Ghz • 256 Gflops (billions of floating pointer operations per second) • 26 Gflops (double precision) • Area 221 mm2 • Technology 90nm SOI

  20. Cell Overview (cont.) • One 64-bit PowerPC processor • 4+ Ghz, dual issue, two threads • 512 kB of second-level cache • Eight Synergistic Processor Elements • Or “Streaming Processor Elements” • Co-processors with dedicated 256kB of memory (not cache) • EIB data ring for internal communication • Four 16 byte data rings, supporting multiple transfers • 96B/cycle peak bandwidth • Over 100 outstanding requests • Dual Rambus XDR memory controllers (on chip) • 25.6 GB/sec of memory bandwidth • 76.8 GB/s chip-to-chip bandwidth (to off-chip GPU)

  21. Cell Features • Security • SPE dynamically reconfigurable as secure co-processor • Networking • SPEs might off-load networking overheads (TCP/IP) • Virtualization • Run multiple OSs at the same time • Linux is primary development OS for Cell • Broadband • SPE is RISC architecture with SIMD organization and Local Store • 128+ concurrent transactions to memory per processor

  22. PPE Block Diagram • PPE handles operating system and control tasks • 64-bit Power ArchitectureTM with VMX • In-order, 2-way hardware Multi-threading • Coherent Load/Store with 32KB I & D L1 and 512KB L2

  23. PPE Pipeline

  24. SPE Organization and Pipeline IBM Cell SPE Organization IBM Cell SPE pipeline diagram

  25. Cell Temperature Graph • Power and heat are key constrains • Cell is ~80 watts at 4+ Ghz • Cell has 10 temperature sensors Source: IEEE ISSCC, 2005

  26. SPE • User-mode architecture • No translation/protection within SPU • DMA is full Power Arch protect/x-late • Direct programmer control • DMA/DMA-list • Branch hint • VMX-like SIMD dataflow • Broad set of operations • Graphics SP-Float • IEEE DP-Float (BlueGene-like) • Unified register file • 128 entry x 128 bit • 256kB Local Store • Combined I & D • 16B/cycle L/S bandwidth • 128B/cycle DMA bandwidth

  27. Cell I/O • XDR is new high-speed memory from Rambus • Dual XDRTM controller (25.6GB/s @ 3.2Gbps) • Two configurable interfaces (76.8GB/s @6.4Gbps) • Flexible Bandwidth between interfaces • Allows for multiple system configurations • Pros: • Fast - dual controllers give 25GB/sed • Current AMD Opteron is only 6.4GB/s • Small pin count • Only need a few chips for high bandwidth • Cons: • Expensive ($ per bit)

  28. Multiple system support • Game console systems • Workstations (CPBW) • HDTV • Home media servers • Supercomputers

  29. Programming Cell • 10 virtual processors • 2 threads of PowerPC • 8 co-processor SPEs • Communicating with SPEs • 256kB “local storage” is NOT a cache • Must explicitly move data in and out of local store • Use DMA engine (supports scatter/gather)

  30. Programming Cell SIMD alignment directives Shared memory, single program abstraction Automatic tuning for each ISA Automatic SIMDization Automatic parallelization Multiple-ISA hand tuned programs Explicit SIMD coding Explicit parallelization with local memories Highest Productivity with fully automatic compiler technology Highest performance with help from programmers

  31. Execution Model • SPE executables are embedded as read-only data in the PPE executable • Use the memory flow controller (MFC) for DMA operations • The “shopping list” view of memory accesses Source: IBM

  32. Programming Model PPE Program SPE Program /* spe_runner.c A C program to be linked with spe_foo and run on the PPE. */ extern spe_program_handle_t spe_foo; int main() { int rc, status = 0; speid_t spe_id; spe_id = spe_create_thread(0, &spe_foo, 0, NULL, -1, 0); rc = spe_wait(spe_id, &status, 0); return status; } /* spe_foo.c A C program to be compiled into an executable called "spe_foo" */ int main(unsigned long long speid, addr64 argp, addr64 envp) { int i; /* func_foo would be the real code */ i = func_foo(argp); return i; } blocking call Source: IBM

  33. SPE Programming • Dual Issue with issue constraints • Predication and hints, no branch prediction hardware • Alignment instructions Source: IBM

  34. Programming Idioms: Pipeline

  35. Programming Idioms: Work Queue Model • Pull data off of a shared queue • Self scheduled work queue

  36. SPMD & MIMD Accelerators Executing same (SPMD) or different (MPMD) programs

  37. Cell Processor Application Areas • Digital content creation (games and movies) • Game playing and game serving • Distribution of (dynamic, media rich) content • Imaging and image processing • Image analysis (e.g. video surveillance) • Next-generation physics-based visualization • Video conferencing (3D) • Streaming applications (codecs etc.) • Physical simulation & science

  38. Some References and Links • http://researchweb.watson.ibm.com/journal/rd/494/kahle.html • http://en.wikipedia.org/wiki/Cell_(microprocessor) • http://www.research.ibm.com/cell/home.html • http://www.research.ibm.com/cellcompiler/slides/pact05.pdf • http://www.hpcaconf.org/hpca11/slides/Cell_Public_Hofstee.pdf • http://www.hpcaconf.org/hpca11/papers/25_hofstee-cellprocessor_final.pdf • http://www.research.ibm.com/cellcompiler/papers/pham-ISSCC05.pdf

  39. IRAM Cores

  40. Data Parallelism and the Processor Memory Gap µProc 60%/yr. CPU “Moore’s Law” Processor-Memory Performance Gap:(grows 50% / year) DRAM 7%/yr. DRAM 1 Time How can we close this gap?

  41. The Effects of the Processor-Memory Gap • Tolerate gap with deeper cache memories  increasing worst case performance • System level impact: Alpha 21164 • I & D cache access: 2 clocks • L2 cache: 6 clocks • L3 cache: 8 clocks • Memory: 76 clocks • DRAM component access: 18 clocks • How much time is spent in the memory hierarchy? • SpecInt92: 22% • Specfp92: 32% • Database: 77% • Sparse matrix: 73%

  42. Where do the Transistors Go? Processor % Area %Transistors (­cost) (­power) • Alpha 21164 37% 77% • StrongArm SA110 61% 94% • Pentium Pro 64% 88% • Caches have no inherent value, they simply recover bandwidth?

  43. Impact of DRAM Capacity • Increasing capacity creates a quandary • Continual four fold increase in density increases minimum memory increment for a given width • How do we match the memory bus width? • Cost/bit issues for wider DRAM chips • die size, testing, package costs • Number of DRAM chips decrease  decrease in concurrency

  44. Merge Logic and DRAM! • Bring the processors to memory  Tremendous on-chip bandwidth for predictable application reference patterns • Enough memory to hold complete programs and data  feasible • More applications are limited by memory speed  Better memory latency for applications with irregular access patterns • Synchronous DRAMs to integrate with the higher speed logic  compatible

  45. Potential: IRAM for Lower Latency • DRAM Latency • Dominant delay = RC of the word lines • Keep wire length short & block sizes small? • 10-30 ns for 64b-256b IRAM “RAS/CAS”?

  46. Potential for IRAM Bandwidth • 1024 1Mbit modules(1Gb), each 256b wide • 20% @ 20 ns RAS/CAS = 320 GBytes/sec • If cross bar switch delivers 1/3 to 2/3 of BW of 20% of modulesÞ 100 - 200 GBytes/sec • FYI: AlphaServer 8400 = 1.2 GBytes/sec • 75 MHz, 256-bit memory bus, 4 banks

  47. IRAM Applications Database demand: 2X / 9 months • PDAs, cameras, gameboys, cell phones, pagers • Database systems? 100 Database-Proc. Performance Gap: “Greg’s Law” µProc speed 2X / 18 months 10 “Moore’s Law” Processor-Memory Performance Gap: DRAM speed 2X /120 months 1 1996 1997 1998 1999 2000

  48. Estimating IRAM Performance • Direct application produces modest performance improvements • Architectures were designed to overcome the memory bottleneck • Architectures were not designed to use tremendous memory bandwidth • Need to rethink the design! • Tailor architecture to utilize the high bandwidth

  49. Emerging Embedded Applications and Characteristics • Fastest growing application domain • Video processing, speech recognition, 3D Graphics • Set top boxes, game consoles, PDAs • Data parallel • Typically low temporal locality • Size, weight and power constraints • Highest speed not necessarily the best processor • What about the role of ILP processors here? • Real Time constraints • Right data at the right time

  50. SIMD/Vector Architectures • VIRAM - Vector IRAM • Logic is slow in DRAM process • Put a vector unit in a DRAM and provide a port between a traditional processor and the vector IRAM instead of a whole processor in DRAM Source: Berkeley Vector IRAM

More Related