Manycores – From hardware prospective to software

Manycores – From hardware prospective to software Presenter: D96943001 電子所陳泓輝

Why Moore’s Law is die • He is not CEO anymore!! • Walls => ILP, Frequency, Power, Memory walls

ILP – more cost less return • ILP: instruction level parallelism • OOO: out of order execution of microcodes

Frequency wall • FO4 delay metric: delay of a inverter with 4 fan-in with ¼ size and it drives another inverter 4x size • Freq ↑ => Some OP cycle counts ↑ Saturated!

Memory wall • External access penalty is increasing(the gap) • Solution => enlarge cache • Cache decide the performance and the price

It’s cache that matters!

The power wall • High power might imply • Thermal run away of device behavior • Larger current => electronic migration => issue of the reliability of the metal connection • Hit packaging heat limitation • Change to high cost packaging • Cooling  noise!! • Form factor

The great wall…… Moore’s Law CMOS Multicore Manycore

Historical - Intel 2007 Xeon • Dual on chip memory controller => fcpu > 2*fmem • Point-to-point interconnection => fabrics • Multiple communication activities (c.f. “bus” => one activity)

Fabric working notation

AMD – Opteron(Shanghai) • Much the same as Intel Xeon • Shared L3 cache among 2 cores

Game consoles • XBox360 => Triple core • PS3 => Cell, 8+1 cores Homogeneous Heterogeneous Power PC wins!

State-of-art multicoreDSP chips TI TNETV3020 Freescale 8156 Homogeneous Heterogeneous

State-of-art multicoreDSP chips picoChip PC205 Tilera TILE64 Heterogeneous Homogeneous, Mesh

State-of-art multicorex86 chips • 24 “tiles” with two IA cores per tile • A 24-router mesh network with 256 GB/s bisection bandwidth • 4 integrated DDR3 memory controllers • Hardware support for message-passing !! Intel Single-chip Cloud Computer 1GHz Pentium

GPGPU - OpenCL Official LOGO

Special case: multicorevideo processor • Characteristics of video applications in consumer electronics • High computational capability • Low hardware cost • Low power consumption • A General Solution • Fixed-function logic designed • Challenges • Multiple video decoding standards • Updating video decoding standards • Ill-posed video processing algorithms • Product requirements are diverse and mutually exclusive

mediaDSPtechnology Nickname: accelerator • Broadcom: mediaDSP technology • Heterogeneous (programmable and fixed functions units) • A task-based programming model • A uniform approach for managing tasks executing on different types of programmable and fixed-function processing elements • A platform, easily extendable to support a range of applications • Easily to be customized for special purpose • Successful stories • SD MPEG Video encoder including scaling and noise reduction • Frame-Rate-Conversation Video Processing for FHD@60Hz /120Hz videos

Classes of video processing • Highly parallelizable operations for fixed-point data and no floating point • A processor with SIMD data path engine • Ad-hoc computation and decision making, which are operating on smaller sets of data produced by the parallelizable processes • A general processor such as RISC • Data movement and formatting on multidimensional pixels • Bit serial processing for entropy decoding and encoding => dedicate hardware do this job very efficiently

Task-based programming model • Programmers’ duties as follows: • Partition a sequential algorithm into a set of parallelizable tasks and then efficiently map it to the massively parallel architecture • A task has a definite initiation time • A task runs until completion with no interruption and no further synchronization with other task • Understand hardware architecture and limitation • Shared memory (instead of FIFO mode) • Buffer size must be enough for a data unit • Interconnect bandwidth must be enough • Computational power must be enough for real time

(IP) Platform-based architecture • Task-oriented engine (TOE) • A programmable DSP or a fixed function unit • Task control unit (TCU) • A RISC maintains a queue of tasks and synchronous with other TCU/TOEs • To maximize the utilization of TOEs • Control engine • Shared memory • Communication fabric

Memory architecture • All TOEs use software-managed DMA rather than caches for their local storage • 6D addressing (x,y,t,Y,U,V) and the chunking of blocks into smaller subblocks. • No {pre-fetching, early load scheduling, cache, speculative execution, multithreading …} • Memory hierarchy • L1 - Processor Instruction and Data Memory • L2 - On-chip Shared Memory • L3 - Off-chip

Broadcom BCM35421 chip [1/2] • Do motion-compensated frame-rate conversion • Double frame rate from FHD@60fps to FHD@120fps (to conquer motion blur) • 24fps  60fps (de-judder)

Broadcom BCM35421 chip [2/2] • 65nm CMOS process • mediaDSP runs at 400 MHz • 106 Million transistors • Two Teraops of peak integer performance

Performance of DSPs for applications • DSP becomes useful when it can perform a minimum of 100 instructions per sample period • 68% DSP were shipped for mobile handsets and base stations in 2008 Several K cycles for processing a input sample

Multiple elements • Increase in performance: • multiple elements > higher performance single elements

Go deeper –TI’s multicore Multicore Programming Guide

Mapping application to mutilcore • Know the processing model option • Identify all the tasks • Task partition into many small ones • Familiar with Inter-task communication/data flow • Combination/aggregation • Mapping • Memory hierarchy => L1/L2/L3, private/shared memory, external memory channel numbers/capability • DMA • Special purpose hardware!! • FFT, Viterb, reed solomon, AES codec, Entropy codec

Parallel processing model Master/Slave model Data flow • Very successful in communication system • Router • Base station

Data movement • Shared memory • Dedicated memory • Transitional memory => ownership change, content not copy

Notification [1/4] • Direct signaling • Create event to other core’s local interrupt controller • Other core polling local status • Or the local interrupt controller convert this event to real interrupt

Notification [2/4] • Indirect signaling • Not directly controlled by software

Notification [3/4] • Atomic arbitration • Hardware semaphore/mutex • Semaphore => allow limited multiple access => example: multi-port SRAM/external DDR memory • Mutex => allow one access only • Use software semaphore instead if resource only shared between processes only executed in one core • Overhead of hardware semaphore is not small • Its only a facility for software usage, hardware only guarantee atomic operation, locked content is not protected • Cost, performance consideration

Notification [4/4] • Left diagram is mutex • Just like the software counterpart

Data transfer engines • DMA => System DMA, local DMA belongs to a core • Ethernet • Up to 32 MAC address • RapidIO • Implemented with ultra fast serial IO physical layer • Maybe multiple serial IO links  uni/bi-directional • Example • USB 2.0 => 480Mbit/sec  USB 3.0 => 5Gbit/sec • Serial ATA • 1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec • 3.0, Gen 3 => 6 Gbit/sec

High speed serial link USB SATA

Memory management • Devices do not support automated cache coherency among cores because of the power consumption involved and the latency overhead introduced Switched central resource  fabric

Highlights [1/3] • Portion of the cache could be configured to as memory mapped SRAM • Transparent cache => visible • Address aliasing => masking MSByte • For core 0: 0x10800000 == 0x00800000 • For core 1: 0x11800000 == 0x00800000 • For core 2: 0x12800000 == 0x00800000 • Special register DNUM for dynamic pointer address update => Implicit Write common rom code still assess core’s private area Explicit Each core has it DNUM

Highlight [2/3] • The only guaranteed coherency by hardware • L1D  L2 (core-locally) • L1D  L2  SL2 (if as memory mapped SRAM) (core-locally) • Equal access to the external DDR2 SDRAM through SCR L1P L1D L1P L1D This may be the bottleneck for certain application

Highlight [3/3] • If any portion of the L1s is configured as memory-mapped SRAM, there is a small paging engine built into the core (IDMA) that can be used to transfer linear blocks of memory between L1 and L2 in the background of CPU operation • Paging engine => MMU • IDMA may also be used to perform bulk peripheral configuration register access

DSP code and data image • Image types • Single image • Multiple image • Multiple image with shared code and data • Complex linking scheme should be used • Device boot • Tool ToolTool

Debugging • Cuda => XXOO↑↑↓↓←→←→BA

TI’s offer • Hardware emulation => ICE, JTAG • Basically, not intrusive • Software instrumentation • Patching original codes to enable same ability=> this time, “Trace Logs” • Basically, intrusive • Type of Trace Logs • API call log, Statistics log, DMA transaction log, Event log, Customer data log

More on logs • Information stores in memory pull back to host by path through hardware emulation • Provide tool to correlate all the logs • Display them with an organized manner • Log example:

Go deeper –Freescale’smanycore Embedded Multicore: An Introduction

Why manycore? • Freescale MPC8641 • Single core => freq x 1.5 => power x 2 • Dual core => freq x 1.5 => power x 1.3 Bug in this Fig.

Memory system types

SMP + AMP + Sharing • Manycore enables multiple OS concurrently running • Memory sharing => MMU • Interface/peripheral sharing => hypervisor • Virtualization is good for legacy support

Review of single core

Manycore example [1/2] 2 2 2 2 4 2 3 4 3 1 4

Manycores – From hardware prospective to software