1 / 60

Manycores – From hardware prospective to software

Manycores – From hardware prospective to software. Presenter: D96943001 電子所 陳泓輝. Why Moore’s Law is die. He is not CEO anymore!! Walls => ILP, Frequency, Power, Memory walls. ILP – more cost less return. ILP: instruction level parallelism OOO: out of order execution of microcodes.

hashim
Download Presentation

Manycores – From hardware prospective to software

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Manycores – From hardware prospective to software Presenter: D96943001 電子所 陳泓輝

  2. Why Moore’s Law is die • He is not CEO anymore!! • Walls => ILP, Frequency, Power, Memory walls

  3. ILP – more cost less return • ILP: instruction level parallelism • OOO: out of order execution of microcodes

  4. Frequency wall • FO4 delay metric: delay of a inverter with 4 fan-in with ¼ size and it drives another inverter 4x size • Freq ↑ => Some OP cycle counts ↑ Saturated!

  5. Memory wall • External access penalty is increasing(the gap) • Solution => enlarge cache • Cache decide the performance and the price

  6. It’s cache that matters!

  7. The power wall • High power might imply • Thermal run away of device behavior • Larger current => electronic migration => issue of the reliability of the metal connection • Hit packaging heat limitation • Change to high cost packaging • Cooling  noise!! • Form factor

  8. The great wall…… Moore’s Law CMOS Multicore Manycore

  9. Historical - Intel 2007 Xeon • Dual on chip memory controller => fcpu > 2*fmem • Point-to-point interconnection => fabrics • Multiple communication activities (c.f. “bus” => one activity)

  10. Fabric working notation

  11. AMD – Opteron(Shanghai) • Much the same as Intel Xeon • Shared L3 cache among 2 cores

  12. Game consoles • XBox360 => Triple core • PS3 => Cell, 8+1 cores Homogeneous Heterogeneous Power PC wins!

  13. State-of-art multicoreDSP chips TI TNETV3020 Freescale 8156 Homogeneous Heterogeneous

  14. State-of-art multicoreDSP chips picoChip PC205 Tilera TILE64 Heterogeneous Homogeneous, Mesh

  15. State-of-art multicorex86 chips • 24 “tiles” with two IA cores per tile • A 24-router mesh network with 256 GB/s bisection bandwidth • 4 integrated DDR3 memory controllers • Hardware support for message-passing !! Intel Single-chip Cloud Computer 1GHz Pentium

  16. GPGPU - OpenCL Official LOGO

  17. Special case: multicorevideo processor • Characteristics of video applications in consumer electronics • High computational capability • Low hardware cost • Low power consumption • A General Solution • Fixed-function logic designed • Challenges • Multiple video decoding standards • Updating video decoding standards • Ill-posed video processing algorithms • Product requirements are diverse and mutually exclusive

  18. mediaDSPtechnology Nickname: accelerator • Broadcom: mediaDSP technology • Heterogeneous (programmable and fixed functions units) • A task-based programming model • A uniform approach for managing tasks executing on different types of programmable and fixed-function processing elements • A platform, easily extendable to support a range of applications • Easily to be customized for special purpose • Successful stories • SD MPEG Video encoder including scaling and noise reduction • Frame-Rate-Conversation Video Processing for FHD@60Hz /120Hz videos

  19. Classes of video processing • Highly parallelizable operations for fixed-point data and no floating point • A processor with SIMD data path engine • Ad-hoc computation and decision making, which are operating on smaller sets of data produced by the parallelizable processes • A general processor such as RISC • Data movement and formatting on multidimensional pixels • Bit serial processing for entropy decoding and encoding => dedicate hardware do this job very efficiently

  20. Task-based programming model • Programmers’ duties as follows: • Partition a sequential algorithm into a set of parallelizable tasks and then efficiently map it to the massively parallel architecture • A task has a definite initiation time • A task runs until completion with no interruption and no further synchronization with other task • Understand hardware architecture and limitation • Shared memory (instead of FIFO mode) • Buffer size must be enough for a data unit • Interconnect bandwidth must be enough • Computational power must be enough for real time

  21. (IP) Platform-based architecture • Task-oriented engine (TOE) • A programmable DSP or a fixed function unit • Task control unit (TCU) • A RISC maintains a queue of tasks and synchronous with other TCU/TOEs • To maximize the utilization of TOEs • Control engine • Shared memory • Communication fabric

  22. Memory architecture • All TOEs use software-managed DMA rather than caches for their local storage • 6D addressing (x,y,t,Y,U,V) and the chunking of blocks into smaller subblocks. • No {pre-fetching, early load scheduling, cache, speculative execution, multithreading …} • Memory hierarchy • L1 - Processor Instruction and Data Memory • L2 - On-chip Shared Memory • L3 - Off-chip

  23. Broadcom BCM35421 chip [1/2] • Do motion-compensated frame-rate conversion • Double frame rate from FHD@60fps to FHD@120fps (to conquer motion blur) • 24fps  60fps (de-judder)

  24. Broadcom BCM35421 chip [2/2] • 65nm CMOS process • mediaDSP runs at 400 MHz • 106 Million transistors • Two Teraops of peak integer performance

  25. Performance of DSPs for applications • DSP becomes useful when it can perform a minimum of 100 instructions per sample period • 68% DSP were shipped for mobile handsets and base stations in 2008 Several K cycles for processing a input sample

  26. Multiple elements • Increase in performance: • multiple elements > higher performance single elements

  27. Go deeper –TI’s multicore Multicore Programming Guide

  28. Mapping application to mutilcore • Know the processing model option • Identify all the tasks • Task partition into many small ones • Familiar with Inter-task communication/data flow • Combination/aggregation • Mapping • Memory hierarchy => L1/L2/L3, private/shared memory, external memory channel numbers/capability • DMA • Special purpose hardware!! • FFT, Viterb, reed solomon, AES codec, Entropy codec

  29. Parallel processing model Master/Slave model Data flow • Very successful in communication system • Router • Base station

  30. Data movement • Shared memory • Dedicated memory • Transitional memory => ownership change, content not copy

  31. Notification [1/4] • Direct signaling • Create event to other core’s local interrupt controller • Other core polling local status • Or the local interrupt controller convert this event to real interrupt

  32. Notification [2/4] • Indirect signaling • Not directly controlled by software

  33. Notification [3/4] • Atomic arbitration • Hardware semaphore/mutex • Semaphore => allow limited multiple access => example: multi-port SRAM/external DDR memory • Mutex => allow one access only • Use software semaphore instead if resource only shared between processes only executed in one core • Overhead of hardware semaphore is not small • Its only a facility for software usage, hardware only guarantee atomic operation, locked content is not protected • Cost, performance consideration

  34. Notification [4/4] • Left diagram is mutex • Just like the software counterpart

  35. Data transfer engines • DMA => System DMA, local DMA belongs to a core • Ethernet • Up to 32 MAC address • RapidIO • Implemented with ultra fast serial IO physical layer • Maybe multiple serial IO links  uni/bi-directional • Example • USB 2.0 => 480Mbit/sec  USB 3.0 => 5Gbit/sec • Serial ATA • 1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec • 3.0, Gen 3 => 6 Gbit/sec

  36. High speed serial link USB SATA

  37. Memory management • Devices do not support automated cache coherency among cores because of the power consumption involved and the latency overhead introduced Switched central resource  fabric

  38. Highlights [1/3] • Portion of the cache could be configured to as memory mapped SRAM • Transparent cache => visible • Address aliasing => masking MSByte • For core 0: 0x10800000 == 0x00800000 • For core 1: 0x11800000 == 0x00800000 • For core 2: 0x12800000 == 0x00800000 • Special register DNUM for dynamic pointer address update => Implicit Write common rom code still assess core’s private area Explicit Each core has it DNUM

  39. Highlight [2/3] • The only guaranteed coherency by hardware • L1D  L2 (core-locally) • L1D  L2  SL2 (if as memory mapped SRAM) (core-locally) • Equal access to the external DDR2 SDRAM through SCR L1P L1D L1P L1D This may be the bottleneck for certain application

  40. Highlight [3/3] • If any portion of the L1s is configured as memory-mapped SRAM, there is a small paging engine built into the core (IDMA) that can be used to transfer linear blocks of memory between L1 and L2 in the background of CPU operation • Paging engine => MMU • IDMA may also be used to perform bulk peripheral configuration register access

  41. DSP code and data image • Image types • Single image • Multiple image • Multiple image with shared code and data • Complex linking scheme should be used • Device boot • Tool ToolTool

  42. Debugging • Cuda => XXOO↑↑↓↓←→←→BA

  43. TI’s offer • Hardware emulation => ICE, JTAG • Basically, not intrusive • Software instrumentation • Patching original codes to enable same ability=> this time, “Trace Logs” • Basically, intrusive • Type of Trace Logs • API call log, Statistics log, DMA transaction log, Event log, Customer data log

  44. More on logs • Information stores in memory pull back to host by path through hardware emulation • Provide tool to correlate all the logs • Display them with an organized manner • Log example:

  45. Go deeper –Freescale’smanycore Embedded Multicore: An Introduction

  46. Why manycore? • Freescale MPC8641 • Single core => freq x 1.5 => power x 2 • Dual core => freq x 1.5 => power x 1.3 Bug in this Fig.

  47. Memory system types

  48. SMP + AMP + Sharing • Manycore enables multiple OS concurrently running • Memory sharing => MMU • Interface/peripheral sharing => hypervisor • Virtualization is good for legacy support

  49. Review of single core

  50. Manycore example [1/2] 2 2 2 2 4 2 3 4 3 1 4

More Related