690 likes | 843 Views
CBE Tutorial The Toy Version. The Basics. What is covered. Basic SPE thread creation Aligned Data transfers To and from the SPE Unaligned Data transfers From Main memory to SPE Mailbox signals and uses Communication using the mailbox channels. Outline. Introduction to CBE Running Code
E N D
CBE TutorialThe Toy Version The Basics
What is covered • Basic SPE thread creation • Aligned Data transfers • To and from the SPE • Unaligned Data transfers • From Main memory to SPE • Mailbox signals and uses • Communication using the mailbox channels
Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts
Intro to CBE Composed of nine computing elements Created by STI The Cell Broadband Engine • A modified Vector Arch • Limited memory: 256 KiB • All accesses are to and from this local memory • Main Memory Accesses DMA transfers • The brain of the system • Organizer • Runs Linux • PowerPC dual issue arch • Each SPE has a MFC unit • Issue and receive DMA to and from main memory • Gate Keeper of the bus SPE SPE PPE MFC MFC Flex IO • Four rings • Has QoS in a limited fashion (RAM) BEI PPSS Memory Interface Maintain coherency and consistency between all memory units (the MFC, main memory and PPE caches, but not across the local memory of SPEs)
Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts
Running Code The Hardware Way SPU C code: test.c PPU C code: driver.c spu-gcc -o test test.c gcc –c driver.c SPU Executable: test PPU Object File: driver.o ppu32-embedspu -m32 test testtest-embed.o gccdriver.ospu-lib.a -lspe -o driver Embedded Object: test-embeded.o PPU Executable: driver ppu-ar -qcsspu-lib.a test-embed.o Embedded library: spu-lib.a ./driver Command line
Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Bibliography
Example 1Hello, World • To learn about PPU thread management function and thread structures spe_wait spe_thread_create SPE SPE SPE SPE exec PPE
Functions • spe_program_handle_t • Type: Structure • Location: libspe.h [PPU side] • Usage: Keeps details about the SPE program. Note that its name should be the same as the one provided in ppu-embedspu command. • speid_t • Type: Type definition • Location: libspe.h [PPU side] • Usage: Keeps some thread specific information.
Functions API Parts • speid_tspe_create_thread (spe_gid_tgid, spe_program_handle_t *handle, void *argp, void *envp, unsigned long mask, intflags) • Type: Function – Thread management • Location:libspe.h [PPU Side] • Input: • gid Group id for this thread (usually zero) • handle A pointer to the SPU program information • argp A pointer to the arguments for the SPU thread • envp A pointer to an environment (global variables, shell setting, etc). Usually zero • mask Affinity of the processor that should run the thread. Usually -1 (all SPUs equally probable) • flags Bit wise flags for thread specific properties. Usually zero • Output: • speid_t Returns an unique identifier for the thread (plus some extra information) • Usage: Create a thread that will run on a SPU unit. It will load the SPU image into the SPU and begin execution on the element and return to the PPU • POSIX Equivalent:pthread_create
Functions API Parts • intspe_wait(speid_tspeid, int *status, intoptions) • Type: Function - Thread management • Location:libspe.h [PPU Side] • Input: • speid The identifier for the thread • status The return status of the thread • options Options for the waiting behavior. Usually zero • Output: • int Zero represents a correct return value. Any other value represents an error. • Usage: Make a thread wait for all other threads that has been registered with previous calls to spe_wait • POSIX Equivalent: pthread_join
Example 1Hello, World #include <stdlib.h> #include <stdio.h> #include <libspe.h> extern spe_program_handle_t test; #define THREADS 6 int main(){ speid_tspeid[THREADS]; int status, i; for(i = 0; i < THREADS; ++i){ speid[i] = spe_create_thread(0, &test, 0, NULL, -1, 0); if(speid[i] == NULL) return 1; } for(i = 0; i < THREADS; ++i) spe_wait(speid[i], &status, 0); return 0; } #include <stdio.h> int main (){ printf("Hello, World \n"); } SPU binary handle SPU Identifiers Hello, World Hello, World Hello, World Hello, World Hello, World Hello, World PPU C Source Code Create the SPU threads with the binary SPU C Source Code Output window Wait for the threads to finish
Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts
Example 2Aligned Data • To learn about DMA transfers with aligned data mfc_get mfc_put Main Memory Main Memory Local Storage Local Storage
Example 2Aligned Data • Explicit Rule 1: PPU and SPU address (from receiving and sending location) MUST be aligned to 16 or be naturally aligned. 0x10BF3 0x10BF0 0x10BF3 Sender Sender Sender Receiver Receiver Receiver 0x22F3 0x2230 0x2234 Both Address are divisible by 16 or the last hexadecimal digit of the address is zero Both Address last hexadecimal digits are the same Not aligned
Example 2Aligned Data • Explicit Rule 2: Any address on main memory is assumed to be a 64-bit address (!). Thus main memory pointers are of the unsigned long longtype • Explicit Rule 3: DMA transfers should be less than 16 KiB • Implicit Rule 1: If array of structures are used, the structures themselves should be of a size that is a multiple of 16
The Code • Uses a simple Dot Product mfc_put mfc_get Local Storage Main Memory
Functions API Parts • mfc_get(ls,ea,size,tag,tid,rid) • Type: Macro – DMA communication • Location: spu_mfcio.h [SPE Side] • Input: • ls The Starting address in the SPE LS memory • ea The starting address in the main memory (64-bit) • size The number of bytes to transfer • tag DMA tag • tid Transfer class id (usually zero) • rid Replacement class id (usually zero) • Usage: Initialize a DMA transfer for size bytes from Main memory to this SPE local storage. • POSIX Equivalent: None
Functions API Parts • mfc_put(ls,ea,size,tag,tid,rid) • Type: Macro – DMA communication • Location: spu_mfcio.h [SPE Side] • Input: • ls The Starting address in the SPE LS memory • ea The starting address in the main memory (64-bit) • size The number of bytes to transfer • tag DMA tag • tid Transfer class id (usually zero) • rid Replacement class id (usually zero) • Usage: Initialize a DMA transfer for size bytes from this SPE local storage to main memory. • POSIX Equivalent: None
Functions API Parts • mfc_write_tag_mask(mask) • Type: Macro – DMA communication • Location:spu_mfcio.h [SPE Side] • Input: • mask Write the mask contained the group tags for DMA transfers • Usage: Write a mask with the tags for current DMA transfers. Used in conjunction with the mfc_read_tag_status_all to make a DMA (or a group of them) wait for termination. • POSIX Equivalent: None
Functions API Parts • mfc_read_tag_status_all() • Type: Macro – DMA communication • Location:spu_mfcio.h [SPE Side] • Usage: Read the tag status of all DMA transfer that were provided in the mfc_write_tag_mask call. It is a blocking operation and it will only complete when all registered DMAs are completed. • POSIX Equivalent: None
PPE Code Include files and number of threads … #define CMPTS 16384 typedef unsigned long longea_t; typedefstruct { ea_t incomming_array1[4]; ea_t incomming_array2[4]; ea_toutgoing_array; int size; int id; } params; int vec1[CMPTS] __attribute__((aligned(16))); int vec2[CMPTS] __attribute__((aligned(16))); intvect[THREADS * 4] __attribute__((aligned(16))); params p[THREADS] __attribute__((aligned(16))); Number of elements • Address of both incoming arrays • Address of the result array • Size of the DMA chunk • Thread id 64-bit data type Structure for parameter passing Buffers and variables for DMA comm Attribute that makes the variables aligned
PPE Code … intdotp(int vec1[], int vec2[], int size){ inti, res = 0; for(i = 0; i < size; ++i) res += vec1[i] * vec2[i]; return res; } … int main(){ … size = CMPTS / THREADS; … for(i = 0; i < THREADS; ++i){ for(j = 0; j < 4; ++j){ p[i].incomming_array1[j] = (ea_t)(&vec1[i*size + j*(size/4)]); p[i].incomming_array2[j] = (ea_t)(&vec2[i*size + j*(size/4)]); } p[i].outgoing_array = (ea_t)(&vect[i*4 + 0]); p[i].size = size / 4; p[i].id = i; } The Dot product function Utility Functions Variable declarations Size of a thread chunk Filling the vectors Filling the parameter structure: Divide the vector arrays (incoming and outgoing) , decide the size and assign its thread id
PPE Code … speid[status] = spe_create_thread (0, &test, (ea_t)(&p[status]), NULL, -1, 0); … size = 0; for(i = 0; i < THREADS; ++i){ for(j = 0; j < 4; ++j) size += vect[i*4 + j]; } rem = dotp(vec1, vec2, CMPTS); if(rem == size) printf("Sucess :) with %d \n", rem, tm); else printf("Failure :P (%d != %d) \n", rem, size); return 0; } Thread creation and waiting Reduce all the threads results to a single value Calculate the correct result in the PPE and compare with the one obtained from the threads
SPE Code #include <spu_mfcio.h> typedefstruct { ea_t incomming_array1[4]; ea_t incomming_array2[4]; ea_toutgoing_array; int size; int id; } params ; params pp __attribute__((aligned(16))); int *arr1 __attribute__((aligned(16))); int *arr2 __attribute__((aligned(16))); intvecr[4] __attribute__((aligned(16))); intdotp(int *vec1, int *vec2, int size){ … } The DMA and communication header file The Mirror image of the PPE parameter structure Buffer and variables involved in the DMA transfers The Mirror image of the PPE dot product function
SPE Code The PPE parameter structure is passed as one of the parameters to the main function int main (ea_tspeid, ea_targp, ea_tenvp){ … mfc_get(&pp, argp, sizeof(params), 31, 0, 0); mfc_write_tag_mask(1 << 31); mfc_read_tag_status_all(); … arr1 = (int *)memalign(16, sizeof(int) *size); arr2 = (int *)memalign(16, sizeof(int) *size); for(i = 0; i < 4; ++i){ mfc_get(arr1, pp.incomming_array1[i], sizeof(int) * size, 31, 0, 0); mfc_write_tag_mask(1 << 31); mfc_read_tag_status_all(); … vecr[i] = dotp(arr1, arr2, size); } mfc_put(vecr, pp.outgoing_array, sizeof(int) * 4, 31, 0, 0); mfc_write_tag_mask(1 << 31); mfc_read_tag_status_all(); … return 0; } The MFC_GET function that will load the PPE parameter structure The memalign is used to create aligned blocks in the heap The MFC_GET function that will load the first vector The MFC_PUT function that will store the results into main memory
Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts
Example 3Unaligned Data • To learn about DMA transfers with unaligned data (copying) mfc_get mfc_put Main Memory Main Memory Local Storage Local Storage
PPE Code It is the same as the aligned example … int vec1[CMPTS] __attribute__((aligned(32))); int vec2[CMPTS] __attribute__((aligned(32))); …
SPE Code Make Sure that the buffers are unaligned … int *arr1 __attribute__((aligned(8))); int *arr2 __attribute__((aligned(8))); … arr1 = (int *)memalign(7, sizeof(int) *size); arr2 = (int *)memalign(7, sizeof(int) *size); … xfer_unaligned_data(arr1, pp.incomming_array1[i], sizeof(int) * size); xfer_unaligned_data(arr2, pp.incomming_array2[i], sizeof(int) * size); Replace the MFC_GET calls with these calls
SPE Code Get the first chunk of memory that is not aligned (by calculating its address and size unsigned char buff[128] __attribute__ ((aligned(16))); void xfer_data_unaligned(void *ls, GA ga, int size){ int left; unsigned int mask = 0x8; GA tga; intrem; tga = ga & ~0xFULL; rem = ga & 0xFULL; global_id = (global_id + 1) & 0x1F; mfc_get(buff, tga, 16, global_id, 0, 0); mfc_write_tag_mask(1 << global_id); mfc_read_tag_status_all(); memcpy(ls, (void *)((unsigned int)(buff)+(unsigned int)(rem)), 16 - rem); ls += 16 - rem; size -= 16 - rem; ga += (GA)(16 - rem); (Over)Load 16 bytes that contains the remainder bytes that are not aligned. Wait for it to end Copy the temporary buffer to its final location (both the buffer and the location should be aligned
SPE Code Load the rest of the data in chunks of 16 or 128 bytes in series while( size >= 16 ){ left = (size >= 128) ? 128 : 16; mfc_get(buff, ga, left, global_id, 0, 0); mfc_write_tag_mask(1 << global_id); mfc_read_tag_status_all(); memcpy(ls, buff, left); ls += left; size -= left; ga += (GA)(left); } if( size > 0 ){ mfc_get(buff, ga, 16, global_id, 0, 0); mfc_write_tag_mask(1 << global_id); mfc_read_tag_status_all(); memcpy(ls, buff, size); } } Copy temporary buffer to final location (over) Load the last 16 bytes which contains the final data
Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Mailbox communications • Appendix A: API Parts
Example 5Mailbox Communications • Learn about mailboxes and their role in SPU communication spu_readch spu_readchcnt spu_writech spe_write_in_mbox spe_stat_in_mbox spe_read_out_mbox spe_stat_out_mbox PPU SPU SPU SPU SPU SPU SPU SPU SPU
The Code Mailbox mfc_get Local Storage Main Memory
The Code PPU SPU Wait for result Compute Send Results Send Signal to continue Send Signal Send Completion Signal Wait for result Send Completion Signal
Functions API Parts • intspe_write_in_mbox (speid_tspeid ,unsigned intdata); • Type: Function – Mailbox communication • Location:libspe.h [PPU Side] • Input: • speid The identifier of the thread in which the mailbox message will be delivered • data The data that will be passed to the SPE • Output: • int Returns 0 in success and -1 in failure • Usage: Write a 32 bit value into a mailbox belonging to a given SPE thread • POSIX Equivalent:pthread_cond_signal (?)
Functions API Parts • intspe_stat_in_mbox (speid_tspeid); • Type: Function – Mailbox communication • Location:libspe.h [PPU Side] • Input: • speid The identifier of the thread from which the mail box will be checked • Output: • int Returns 0 if the mail box is full or a non negative number that represents the number of entries that are available. • Usage: Check the status of the mailbox for a given thread • POSIX Equivalent: None
Functions API Parts • unsigned intspe_read_out_mbox (speid_tspeid); • Type: Function – Mailbox communication • Location:libspe.h [PPU Side] • Input: • speid The identifier of the thread from which the mail box entry will be obtained. • Output: • int Returns the value form the outbound mailbox of the given SPE or -1 if no data is available. • Usage: Returns the element in the SPE outbound mailbox • POSIX Equivalent: None
Functions API Parts • intspe_stat_out_mbox (speid_tspeid) • Type: Function – Mailbox communication • Location: libspe.h [PPU Side] • Input: • speid The identifier of the thread from which the status will be read • Output: • int Returns 0 if the mailbox is empty or a non negative number that represents the unread messages in the mailbox • Usage: Returns the status of the SPE outbound mailbox • POSIX Equivalent: None
Functions API Parts • spu_readch(imm) • Type: Macro – Mailbox communication • Location:spu_internals.h [SPE Side] • Input: • immA channel identifier • Output: Returns the channel contents • Usage: Returns the channel contents identified by imm. • POSIX Equivalent: None
Functions API Parts • spu_readchcnt(imm) • Type: Macro – Mailbox communication • Location: spu_internals.h [SPE Side] • Input: • imm A channel identifier • Output: Returns the number of entries in this channel • Usage: Returns the number of entries in the channel identified by imm • POSIX Equivalent: None
Functions API Parts • spu_writech (imm, ra) • Type: Macro – Mailbox communication • Location:spu_internals.h [SPE Side] • Input: • imm A channel identifier • ra The value to be written to the channel • Usage: Write the value ra to the channel identified by imm. • POSIX Equivalent: None
PPE Code void check_for_results(speid_tspe[]) { unsigned char flags[THREADS]; inti, val, total = 0; for(i = 0; i < THREADS; ++i){ flags[i] = 1; } while (total != THREADS){ for(i = 0; i < THREADS; ++i){ if(flags[i] == 0) continue; if(spe_stat_out_mbox(spe[i]) != 0){ val = spe_read_out_mbox(spe[i]); if(val != -1){ vect += val; spe_write_in_mbox(spe[i], -1);} else{ flags[i] = 0; total++; }}}}} Check if a mailbox has been written and then consume the value Write a signal to tell the SPE that it has received its value
PPE Code … for(status = 0; status < THREADS; ++status){ speid[status] = spe_create_thread(0, &test, (ea_t)(&p[status]), NULL, -1, 0); if(speid[status] == NULL) return 1; } check_for_results(speid); for(i = 0; i < THREADS; ++i) spe_wait(speid[i], &status, 0); … Blocking Function that will check for mailbox messages
SPE Code Send the result and wait for acknowledgment intsend_result(intvec) { spu_writech(SPU_WrOutMbox, vec); while(!spu_readchcnt(SPU_RdInMbox)); return spu_readch(SPU_RdInMbox); } void send_end() { intvec = -1; spu_writech(SPU_WrOutMbox, vec); } Send the termination signal and return
SPE Code Send partial result to the PPE for(i = 0; i < 4; ++i){ … vecr = dotp(arr1, arr2, size); send_result(vecr); } … send_end(); return 0; Send Termination signal to PPE and terminate
One More Thing!!!! The output of all programs is: Sucess :) with 6020
Outline • Introduction to CBE • Running Code • Example: Hello, World • Example: Aligned Data • Example: Unaligned data • Example: Double Buffer • Example: Mailbox communications • Appendix A: API Parts
Appendix A • API Parts • Structures • Function to create and manage threads • PPU / SPU DMA Functions • PPU / SPU mailbox functions