1 / 38

John Cavazos Dept of Computer & Information Sciences University of Delaware

John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879. Lecture 9 Cell Programming Tutorial. Lecture 9: Overview. Cell Basics Programming Models Programming Details Example Code. Cell Architecture Recap. Heterogeneous architecture

vernon
Download Presentation

John Cavazos Dept of Computer & Information Sciences University of Delaware

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 Lecture 9 Cell Programming Tutorial

  2. Lecture 9: Overview • Cell Basics • Programming Models • Programming Details • Example Code

  3. Cell Architecture Recap • Heterogeneous architecture • 9 cores on chip • 1 PPE (General Purpose Processor) • 8 SPEs (SIMD processors) • PPEs runs control-plane code • Code with lots of brances (e.g., OS) • SPEs runs data-plane code • Computational code with little branches

  4. Program Structure • Multiple programs in one • PPU and SPU programs cooperate • PPE Code • Regular Linux process (main thread) • Process can spawn SPE threads • SPE Code • SPE executables are packaged inside PPE executables

  5. SPE Details • Register file: Large (128 entries), 128-bit, and unified • All instructions are SIMD instructions • Local Store (256 KB) • Loads/Stores access LS • Contains all Instructions/Data used by SPU • DMA transfers data between LS and main storage • High bandwidth (128 bytes per cycle) • Eliminate non-deterministic features • Out-of-order execution • Hardware-managed caches • Hardware branch prediction

  6. SPE Register Layout The left-most word (bytes 0, 1, 2, and 3) of a register is called the preferred slot

  7. SPE SIMD Example (add) • Example is a 4-wide add • each of the 4 elements in reg VA is added to the corresponding element in reg VB • the 4 results are placed in the appropriate slots in reg VC

  8. SPE SIMD Example (shuffle) • Bytes selected from regs VA and VB based on control vector VC • Control vector entries are indices of VA and VB • Operation is purely byte oriented

  9. SPE model • Code and data must fit into 256-KB local store • Explicit input/output of SPE program • Program arguments and return code • DMA • Mailboxes • Signals PPE maps system memory for SPE DMA SPE Program DMA transactions Local Store System Memory

  10. SPE Model (cont’d) int g_ip[512*1024] int g_op[512*1024] • Streaming model for large size input/output data System memory Local store DMA int ip[32] SPE program: op = func(ip) DMA int op[32]

  11. Programming Models • How application/data partitioned among PPEs/SPEs • Partitioning involves considering • Program structure • Data structures • Data and code via DMA • Several models: • Data-parallel • Task-Parallel • Job Queue

  12. Job Queue System memory Job queue code/data … Local store code/data n Code n DMA code/data n+1 Data n code/data n+2 SPE kernel • Code and data packaged together

  13. Data Parallel System Memory I0 O0 I1 O1 I2 O2 I3 O3 I4 O4 I5 O5 I6 O6 I7 O7 . . In On PPE ….. SPE0 Kernel() SPE1 Kernel() SPE7 Kernel() Data-parallel • SPE initiated DMA • Large array of data fed through SPEs • Special case of Job Queue

  14. Task Parallel Task-parallel I0 O0 • LS to LS DMA • Flexible in pipeline functions • Load balancing harder System Memory I1 O1 I2 O2 I3 O3 I4 O4 I5 O5 I6 O6 I7 O7 . . In On PPE SPE0 Kernel() SPE1 Kernel() SPE7 Kernel() ….. DMA DMA

  15. Cell Terminology • SPE Context • Holds information about Logical SPE • Used by Application • SPE Gang Context • Group of threads with same properties • SPE Event • Events caused by (asynchronously) running SPE threads • Examples: SPE execution stopped, Mailbox messages written/read, DMA operations completed

  16. LibSPE Version 2.0 application_data_t pthread_t spe_context_ptr_t spe_program_handle_t arguments PPE Thread environment policy priority SPE Context events SPE Program PPE Thread Function SPE Stopinfo SPE Gang Context Application Data argp envp Conceptual Diagram

  17. Single SPE Thread • A simple application uses a single PPE thread • Basic scheme for simple application using SPE: 1. Create an SPE context 2. Load executable object into the SPE context’s local store 3. Run SPE context Transfers control to OS requesting scheduling of the context to a physical SPE in the system 4. Destroy SPE context Note: Step 3 represents a synchronous call. Calling Application thread blocks until the SPE stops and returns.

  18. Single Thread (Hello World) SPU #include <stdio.h> int main() { printf("hello world\n"); return 0; } PPU #include <stdlib.h> #include <libspe2.h> int main() { spe_context_ptr_t spe; unsigned int createflags = 0; unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; void * argp = NULL; void * envp = NULL; spe_program_handle_t * program; program = spe_image_open("hello"); spe = spe_context_create(createflags, NULL); spe_program_load(spe, program); spe_context_run(spe, &entry, runflags, argp, envp, NULL); spe_image_close(program); spe_context_destroy(spe); }

  19. Multiple SPE Threads • May want multiple SPEs concurrently • Create N PPE threads for N concurrent SPE contexts • Each PPE thread runs single SPE context • Basic for simple application running N SPE contexts 1. Create N SPE contexts 2. Load SPE executable into each SPE context’s local store 3. Create N PPE threads - In each PPE thread run one SPE context - Terminate PPE thread 4. Wait for all N PPE threads to terminate 5. Destroy all N SPE contexts

  20. Multi-threaded (Hello World) #include <stdlib.h> #include <pthread.h> #include <libspe2.h> #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg->argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

  21. Multi-threaded (Hello World) #include <stdlib.h> #include <pthread.h> #include <libspe2.h> #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg->argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

  22. Multi-threaded (Hello World) #include <stdlib.h> #include <pthread.h> #include <libspe2.h> #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg->argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

  23. Multi-threaded (Hello World) #include <stdlib.h> #include <pthread.h> #include <libspe2.h> #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg->argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

  24. Multi-threaded (Hello World) #include <stdlib.h> #include <pthread.h> #include <libspe2.h> #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg->argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

  25. Multi-threaded (Hello World) #include <stdlib.h> #include <pthread.h> #include <libspe2.h> #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg->argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

  26. Multi-threaded (Hello World) #include <stdlib.h> #include <pthread.h> #include <libspe2.h> #define N 4 struct thread_args { struct spe_context * spe; void * argp; void * envp; }; void my_spe_thread(struct thread_args *arg) { unsigned int runflags = 0; unsigned int entry = SPE_DEFAULT_ENTRY; // run SPE context spe_context_run(arg->spe, &entry, runflags, arg->argp, arg->envp, NULL); // done - now exit thread pthread_exit(NULL); } int main() { pthread_t pts[N]; spe_context_ptr_t spe[N]; struct thread_args t_args[N]; int value[N]; int i; spe_program_handle_t * program; // open SPE program program = spe_image_open("hello"); for ( i=0; i<N; i++ ) { // create SPE context spe[i] = spe_context_create(0, NULL); // load SPE program spe_program_load(spe[i], program); // create pthread t_args[i].spe = spe[i]; t_args[i].argp = &value[i]; t_args[i].envp = NULL; pthread_create(&pts[i], NULL, &my_spe_thread, t_args[i]); } // wait for all threads to finish for ( i=0; i<N; i++ ) { pthread_join (pts[i], NULL); } // close SPE program spe_image_close(program); // destroy SPE contexts for ( i=0; i<N; i++ ) { spe_context_destroy (spe[i]); } return 0; }

  27. Communication Mechanisms • DMA transfers • Moves data and instructions from main storage to LS • Mailboxes • Communication between SPE and PPE or other devices • Hold 32-bit messages • 2 mailboxes for sending (1 entry each) • 1 mailbox for receiving (4 entries) • Signal notification • 32-bit registers

  28. DMA Get/Set Commands • Data moved to/from effective address to local store • Effective address typically is in main memory, but can be other LS mfc_put(lsaddr,ea,size,tag,tid,rid) mfc_get(lsaddr,ea,size,tag,tid,rid) • lsaddr : target address in SPU local store • ea : effective address, i.e main memory address (64 bits) • size: size transfer in bytes • tag: tag to identify this transfer, 16 different tags available • tid : transfer-class id • rid: replacement-class id

  29. DMA Read into Local Store inline void dma_mem_to_ls(unsigned int mem_addr, volatile void *ls_addr,unsigned int size) { unsigned int tag = 0; unsigned int mask = 1; mfc_get(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); } Read contents of mem_addr into ls_addr Set tag mask Wait for all tag DMA completed

  30. DMA Write to Main Memory inline void dma_ls_to_mem(unsigned int mem_addr,volatile void *ls_addr, unsigned int size) { unsigned int tag = 0; unsigned int mask = 1; mfc_put(ls_addr,mem_addr,size,tag,0,0); mfc_write_tag_mask(mask); mfc_read_tag_status_all(); } Write contents of mem_addr into ls_addr Set tag mask Set tag mask

  31. Double Buffer Example • Handling DMA latency is critical to overall performance • Data prefetching is a key technique to hide DMA latency I Buf 1 (n) O Buf 1 (n) SPE exec. SPE program: Func (n) DMAs I Buf 2 (n+1) O Buf 2 (n-1) DMAs outputn-2 inputn Outputn-1 Inputn+1 outputn Inputn+2 SPE exec. Func (inputn-1) Func (inputn) Func (inputn+1) Time

  32. Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include <spu_intrinsics.h> #include <spu_mfcio.h> #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

  33. Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include <spu_intrinsics.h> #include <spu_mfcio.h> #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

  34. Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include <spu_intrinsics.h> #include <spu_mfcio.h> #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

  35. Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include <spu_intrinsics.h> #include <spu_mfcio.h> #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

  36. Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include <spu_intrinsics.h> #include <spu_mfcio.h> #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); // Wait for previous transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

  37. Double Buffer Example /* Example C code demonstrating double buffering using buffers B[0] and B[1]. In this example, an array of data starting at the effective address ea is DMAed into the SPU's local store in 4 KB chunks and processed by the use_data subroutine. */ #include <spu_intrinsics.h> #include <spu_mfcio.h> #define BUFFER_SIZE 4096 volatile unsigned char B[2][BUFFER_SIZE] __attribute__ ((aligned(128))); void double_buffer_example(unsigned int ea, int buffers) { int next_idx, idx = 0; // Initiate first DMA transfer spu_mfcdma32(B[idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; while (--buffers) { next_idx = idx ^ 1; // toggle buffer index spu_mfcdma32(B[next_idx], ea, BUFFER_SIZE, idx, MFC_GET_CMD); ea += BUFFER_SIZE; spu_writech(MFC_WrTagMask, 1 << idx); (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); // Wait for previous transfer done use_data(B[idx]); // Use the previous data idx = next_idx; } spu_writech(MFC_WrTagMask, 1 << idx); // Wait for last transfer done (void)spu_mfcstat(MFC_TAG_UPDATE_ALL); use_data(B[idx]); // Use the last data

  38. Mailboxes • Communicate messages up to 32-bits in length • E.g., buffer completion flags or program status • E.g., when SPE places results in main storage via DMA • SPE can wait until DMA transfer completes then writes to outbound mailbox to notify PPE • Short-data transfers • Storage addresses, function parameters • Can be used to communicate between SPEs, PPE, or other devices • Priviledged software needs to allow one SPE to access mailbox register in another SPE

More Related