John Cavazos Dept of Computer & Information Sciences University of Delaware

John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 Lecture 9 Cell Programming Tutorial

Lecture 9: Overview • Cell Basics • Programming Models • Programming Details • Example Code

Cell Architecture Recap • Heterogeneous architecture • 9 cores on chip • 1 PPE (General Purpose Processor) • 8 SPEs (SIMD processors) • PPEs runs control-plane code • Code with lots of brances (e.g., OS) • SPEs runs data-plane code • Computational code with little branches

Program Structure • Multiple programs in one • PPU and SPU programs cooperate • PPE Code • Regular Linux process (main thread) • Process can spawn SPE threads • SPE Code • SPE executables are packaged inside PPE executables

SPE Details • Register file: Large (128 entries), 128-bit, and unified • All instructions are SIMD instructions • Local Store (256 KB) • Loads/Stores access LS • Contains all Instructions/Data used by SPU • DMA transfers data between LS and main storage • High bandwidth (128 bytes per cycle) • Eliminate non-deterministic features • Out-of-order execution • Hardware-managed caches • Hardware branch prediction

SPE Register Layout The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot

SPE SIMD Example (add) • Example is a 4-wide add • each of the 4 elements in reg VA is added to the corresponding element in reg VB • the 4 results are placed in the appropriate slots in reg VC

SPE SIMD Example (shuffle) • Bytes selected from regs VA and VB based on control vector VC • Control vector entries are indices of VA and VB • Operation is purely byte oriented

SPE model • Code and data must fit into 256-KB local store • Explicit input/output of SPE program • Program arguments and return code • DMA • Mailboxes • Signals PPE maps system memory for SPE DMA SPE Program DMA transactions Local Store System Memory

SPE Model (cont’d) int g_ip[512*1024] int g_op[512*1024] • Streaming model for large size input/output data System memory Local store DMA int ip[32] SPE program: op = func(ip) DMA int op[32]

Programming Models • How application/data partitioned among PPEs/SPEs • Partitioning involves considering • Program structure • Data structures • Data and code via DMA • Several models: • Data-parallel • Task-Parallel • Job Queue

Job Queue System memory Job queue code/data … Local store code/data n Code n DMA code/data n+1 Data n code/data n+2 SPE kernel • Code and data packaged together

Data Parallel System Memory I0 O0 I1 O1 I2 O2 I3 O3 I4 O4 I5 O5 I6 O6 I7 O7 . . In On PPE ….. SPE0 Kernel() SPE1 Kernel() SPE7 Kernel() Data-parallel • SPE initiated DMA • Large array of data fed through SPEs • Special case of Job Queue

Task Parallel Task-parallel I0 O0 • LS to LS DMA • Flexible in pipeline functions • Load balancing harder System Memory I1 O1 I2 O2 I3 O3 I4 O4 I5 O5 I6 O6 I7 O7 . . In On PPE SPE0 Kernel() SPE1 Kernel() SPE7 Kernel() ….. DMA DMA

Cell Terminology • SPE Context • Holds information about Logical SPE • Used by Application • SPE Gang Context • Group of threads with same properties • SPE Event • Events caused by (asynchronously) running SPE threads • Examples: SPE execution stopped, Mailbox messages written/read, DMA operations completed

LibSPE Version 2.0 application_data_t pthread_t spe_context_ptr_t spe_program_handle_t arguments PPE Thread environment policy priority SPE Context events SPE Program PPE Thread Function SPE Stopinfo SPE Gang Context Application Data argp envp Conceptual Diagram

Single SPE Thread • A simple application uses a single PPE thread • Basic scheme for simple application using SPE: 1. Create an SPE context 2. Load executable object into the SPE context’s local store 3. Run SPE context Transfers control to OS requesting scheduling of the context to a physical SPE in the system 4. Destroy SPE context Note: Step 3 represents a synchronous call. Calling Application thread blocks until the SPE stops and returns.

Single Thread (Hello World) SPU #include <stdio.h> int main() { printf("hello world\n"); return 0; } PPU #include <stdlib.h> #include <libspe2.h> int main() { spe_context_ptr_t spe; unsigned int createflags = 0; unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; void * argp = NULL; void * envp = NULL; spe_program_handle_t * program; program = spe_image_open("hello"); spe = spe_context_create(createflags, NULL); spe_program_load(spe, program); spe_context_run(spe, &entry, runflags, argp, envp, NULL); spe_image_close(program); spe_context_destroy(spe); }

Multiple SPE Threads • May want multiple SPEs concurrently • Create N PPE threads for N concurrent SPE contexts • Each PPE thread runs single SPE context • Basic for simple application running N SPE contexts 1. Create N SPE contexts 2. Load SPE executable into each SPE context’s local store 3. Create N PPE threads - In each PPE thread run one SPE context - Terminate PPE thread 4. Wait for all N PPE threads to terminate 5. Destroy all N SPE contexts

Multi-threaded (Hello World) #include <stdlib.h> #include <pthread.h> #include <libspe2.h> #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg->argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

Communication Mechanisms • DMA transfers • Moves data and instructions from main storage to LS • Mailboxes • Communication between SPE and PPE or other devices • Hold 32-bit messages • 2 mailboxes for sending (1 entry each) • 1 mailbox for receiving (4 entries) • Signal notification • 32-bit registers

DMA Get/Set Commands • Data moved to/from effective address to local store • Effective address typically is in main memory, but can be other LS mfc_put(lsaddr,ea,size,tag,tid,rid) mfc_get(lsaddr,ea,size,tag,tid,rid) • lsaddr : target address in SPU local store • ea : effective address, i.e main memory address (64 bits) • size: size transfer in bytes • tag: tag to identify this transfer, 16 different tags available • tid : transfer-class id • rid: replacement-class id

DMA Read into Local Store inline void dma_mem_to_ls(unsigned int mem_addr, volatile void *ls_addr,unsigned int size) { unsigned int tag = 0; unsigned int mask = 1; mfc_get(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); } Read contents of mem_addr into ls_addr Set tag mask Wait for all tag DMA completed

DMA Write to Main Memory inline void dma_ls_to_mem(unsigned int mem_addr,volatile void *ls_addr, unsigned int size) { unsigned int tag = 0; unsigned int mask = 1; mfc_put(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); } Write contents of mem_addr into ls_addr Set tag mask Set tag mask

Double Buffer Example • Handling DMA latency is critical to overall performance • Data prefetching is a key technique to hide DMA latency I Buf 1 (n) O Buf 1 (n) SPE exec. SPE program: Func (n) DMAs I Buf 2 (n+1) O Buf 2 (n-1) DMAs outputn-2 inputn Outputn-1 Inputn+1 outputn Inputn+2 SPE exec. Func (inputn-1) Func (inputn) Func (inputn+1) Time

Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include <spu_intrinsics.h> #include <spu_mfcio.h> #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include <spu_intrinsics.h> #include <spu_mfcio.h> #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait for previous transfer done use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

Mailboxes • Communicate messages up to 32-bits in length • E.g., buffer completion flags or program status • E.g., when SPE places results in main storage via DMA • SPE can wait until DMA transfer completes then writes to outbound mailbox to notify PPE • Short-data transfers • Storage addresses, function parameters • Can be used to communicate between SPEs, PPE, or other devices • Priviledged software needs to allow one SPE to access mailbox register in another SPE

John Cavazos Dept of Computer & Information Sciences University of Delaware