360 likes | 550 Views
Cell/B.E. Jiří Dokulil. Introduction. Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD Synergistic Processor Element (SPE) 8x computations, no OS big endian. Architecture. Memory access. PPE load & store cache SPE
E N D
Cell/B.E. Jiří Dokulil
Introduction • Cell Broadband Engine • developed Sony, Toshiba and IBM • 64bit PowerPC • PowerPC Processor Element (PPE) • runs OS • SIMD • Synergistic Processor Element (SPE) • 8x • computations, no OS • big endian
Memory access • PPE • load & store • cache • SPE • DMA • up to 16 concurrent per SPE • no direct access to memory • no need for out-of-order processing, no speculation • local storage • no cache
PPE • PowerPC Processor Element • PPU (PowerPC Processor Unit) • PPSS (PowerPC Processor Storage Subsystem) • 64-bit, dual-thread PowerPC Architecture RISC core • 2x32KB L1 (instructions and data) • 512LB L2 (unified) • PowerPC instruction set • vector/SIMD extensions – different from SPE • 32x 128bit vector registers
SPE • Synergistic Processor Element • SPU (Synergistic Processor Unit) • MFC (Memory Flow Controller) • RISC, SIMD • Synergistic Processor Unit Instruction Set Architecture • support for DMA and interprocessor messaging • 256KB LS • 128x128bit register file • DMA access to main memory • segment and page tables of PPE • channels • in MFC • unidirectional message-passing interfaces • memory-mapped I/O (MMIO) registers and queues
EIB • Element Interconnect Bus • four 16-byte-wide data rings • transfer 128byte at a time (one PPE cache line) • internal bandwidth 96bytes per clock cycle • latency depends on the number of hops • bus is a ring • half frequency of SPU
DMA • MFCs support naturally aligned DMA transfer sizes of 1, 2, 4, or 8 bytes, and multiples of 16 bytes • maximum transfer size of 16 KB per transfer • DMA list commands can initiate up to 2048 transfers • peak transfer performance • if both the effective addresses and the LS addresses are 128-byte aligned • and the size of the transfer is an even multiple of 128 bytes • SMM (Synergistic Memory Management) unit • processes address translation • access-permission information • data supplied by the PPE operating system
SIMD example // 16 iterations of a loop int rolled_sum(unsigned char bytes[16]) { int i; int sum = 0; for (i = 0; i < 16; ++i) { sum += bytes[i]; } return sum; }
SIMD example cont. // Vectorized for vector/SIMD multimedia extension int vectorized_sum(unsigned char bytes[16]) { vector unsigned char vbytes; union { int i[4]; vector signed int v; } sum; vector unsigned int zero = (vector unsigned int){0}; // Perform a misaligned vector load of the 16 bytes. vbytes = vec_perm(vec_ld(0, bytes), vec_ld(16, bytes), vec_lvsl(0, bytes)); // Sum the 16 bytes of the vector sum.v = vec_sums((vector signed int)vec_sum4s(vbytes, zero), (vector signed int)zero); // Extract the sum and return the result. return (sum.i[3]); }
Communication • DMA • 2 command queues per SPE • one for commands by SPE • one for commands by PPE and other SPEs • commands have tags (32 different) – status query • one transfer or a list • mailboxes • for each SPE • communication with PPE • 2 outgoing (1 message) • 1 incoming (4 messages) • signals • 2 inbound channels
DMA • put, get • SPE or PPE initiated • tag • 5bit • ordering • out of order • barrier – maintains order (within tag group) • fence – after all previous (within tag group) • simple or lists • lists stored in LS (8bytes per item) -> SPE only • up to 2048 transfers, 16KB each -> 32MB • compare to 256KB LS size
DMA – PPE raw access • MFC registers mapped to virtual address space void *ps = get_ps(); //get the problem state – must be mapped by privileged software unsigned int ls = 0x500; unsigned int long long ea = 0x10000000; unsigned int size = 0x4000; unsigned int tag = 5; unsigned int classid = 0; unsigned int cmd = MFC_GET_CMD; unsigned int cmd_status; do { *((volatile unsigned int *)(ps + MFC_LSA)) = ls; *((volatile unsigned long long *)(ps + MFC_EAH)) = ea; *((volatile unsigned int *)(ps + MFC_Size)) = (size << 16) | tag; *((volatile unsigned int *)(ps + MFC_ClassID)) = (classid << 16) | cmd; /* Read MFC_CMDStatus to enqueue command and check enqueue success.*/ cmd_status = *((volatile unsigned int *)(ps + MFC_CMDStatus)) & 0x3; } while (cmd_status); /* Attempt to enqueue until success */ • only enqueues the command
DMA – PPE raw access cont. • test for completion (poll tag group status) void *ps = get_ps(); unsigned int tag_mask = 1 << 5; unsigned int tag_status; *((volatile unsigned int *)(ps + Prxy_QueryMask)) = tag_mask; __asm__(“eieio”); /* force write to Prxy_QueryMask to complete */ do { tag_status = *((volatile unsigned int *)(ps + Prxy_TagStatus)); } while (!tag_status); • more tag groups unsigned int tag_mask = (1<<5)|(1<<14)|(1<<31);
DMA – SPE • no direct access to the virtual address space • only by DMA • direct access to own command channels • wrch assembly instruction extern void dma_transfer(volatile void *lsa, // local storage address unsigned int eah, // high 32-bit effective address unsigned int eal, // low 32-bit effective address unsigned int size, // transfer size in bytes unsigned int tag_id, // tag identifier (0-31) unsigned int cmd); // DMA command in assembler: wrch $MFC_LSA, $3 wrch $MFC_EAH, $4 wrch $MFC_EAL, $5 wrch $MFC_Size, $6 wrch $MFC_TagID, $7 wrch $MFC_Cmd, $8 in C intrinsic: spu_mfcdma64(lsa, eah, eal, size, tag_id, cmd);
DMA – SPE cont. • poll for completion # Set tag group mask wrch $MFC_WrTagMask, $0 # Set up for immediate tag status update. il $1, 0 repeat: wrch $MFC_WrTagUpdate, $1 rdch $1, $MFC_RdTagStat brz $1, repeat OR #include <spu_intrinsics.h> #include <spu_mfcio.h> unsigned int tag_id = 0; unsigned int tag_mask = 1 << tag_id; spu_writech(MFC_WrTagMask, tag_mask); do { }while(!spu_mfcstat(MFC_TAG_UPDATE_IMMEDIATE)); /* poll for update */
DMA – SPE cont. • wait for completion (stall SPE) # Set tag group mask wrch $MFC_WrTagMask, $0 # 0x1 for any tag, 0x2 for all tags. il $1, 0x1 # Wait for conditional tag status update (stall the SPU). wrch $MFC_WrTagUpdate, $1 rdch $1, $MFC_RdTagStat OR #include <spu_intrinsics.h> #include <spu_mfcio.h> unsigned int tag_id = 0; unsigned int tag_mask = 1 << tag_id; spu_writech(MFC_WrTagMask, tag_mask); /* Wait for all ids in tag group to complete (stall the SPU) */ spu_mfcstat(MFC_TAG_UPDATE_ALL);
DMA – SPE cont. • completion of DMA • source buffer can be reused • data may not have yet been written to the main storage • mailbox-ed notification can reach PPE before the data • SPE can do mfcsync • PPE can do lwsync • more efficient • SPE can notify via DMA • mfceieio must be used between DMAs for ordering
Mailboxes • 32bit messages • blocking for SPE (stalls SPE) • reading of empty inbound • writing of full outbound • SPE can poll the number of messages • non-blocking for PPE (and other devices) • reading returns zeros • writing overwrites last message
Mailboxes – SPE • send (stalling) wrch $SPU_WrOutMbox, $1 or spu_writech(SPU_WrOutMbox, mb_value); • send (active waiting) repeat: rchcnt $2, $SPU_WrOutMbox brz $2, repeat wrch $SPU_WrOutMbox, $1 or do { /* Do other useful work while waiting. */ } while (!spu_readchcnt(SPU_WrOutMbox)); spu_writech(SPU_WrOutMbox, mb_value);
Mailboxes – SPE cont. • read (stalling) rdch $1, $SPU_RdInMbox or mb_value = spu_readch(SPU_RdInMbox); • read (active waiting) repeat: rchcnt $1, $SPU_RdInMbox brz $1, repeat rdch $2, $SPU_RDInMbox or do { /* Do other useful work while waiting.*/ } while (!spu_readchcnt(SPU_RdInMbox)); mb_value = spu_readch(SPU_RdInMbox);
Mailboxes – PPE • read SPE’s outbound mailboxsend void *ps = get_ps(); unsigned int mb_status; unsigned int new; unsigned int mb_value; do { mb_status = *((volatile unsigned int *)(ps + SPU_Mbox_Stat)); new = mb_status & 0x000000FF; } while ( new == 0 ); mb_value = *((volatile unsigned int *)(ps + SPU_Out_Mbox));
Mailboxes – PPE cont. • writing to SPE’s inbound mailbox • problem of overrunning full mailbox //send four messages without overrunning the mailbox void *ps = get_ps(); unsigned int j,k = 0; unsigned int mb_status; unsigned int slots; unsigned int mb_value[4] = {0x1, 0x2, 0x3, 0x4}; do { /*Poll the Mailbox Status Register until the SPU_In_Mbox_Countfield indicates there is at least one slot available in the SPU Read Inbound Mailbox.*/ do { mb_status = *((volatile unsigned int *)(ps + SPU_Mbox_Stat)); slots = (mb_status & 0x0000FF00) >> 8; } while ( slots == 0 ); for (j=0; j<slots && k < 4; j++) { *((volatile unsigned int *)(ps + SPU_In_Mbox)) = mb_value[k++]; } } while ( k < 4 );
CELL SDK 3.1 • http://www.ibm.com/developerworks/power/cell/ • Cell BE Programming Handbook Including PowerXCell 8i • http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1741C509C5F64B3300257460006FD68D • SPE Runtime Management Library • http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/1DFEF31B3211112587257242007883F3 • PPU & SPU C/C++ Language Extension Specification • http://www-01.ibm.com/chips/techlib/techlib.nsf/techdocs/30B3520C93F437AB87257060006FFE5E
libspe & libspe2 • low level APIs to access Cell from C/C++ • new threading model in libspe2 • use threading library of your choice and use libspe2 from there – no “SPE threads” • create e.g. pthread thread and launch SPE code from that – call returns after SPE finishes
Compilation • PPE object • g++ [-m64] -c -Ox • SPE object • spu-gcc -Ox • no –m64 • LS adresses are always 32bit • ppu-embedspu [-m64] <symbol> <object> <output> • link • g++ [-m64]<spe object> <ppe object> -lspe -lspe2
Referencing SPE code from PPE code • extern spe_program_handle_t <symbol>; • spe_program_load(spe_context,&<symbol>);
Launching SPE code (libspe2) struct thread_data { spe_context_ptr_t context; program_data* pd; }; void *ppu_pthread_function(void *arg) { thread_data td = *(thread_data *) arg; spe_context_ptr_t context = td.context; unsigned int entry = SPE_DEFAULT_ENTRY; spe_context_run(context,&entry,0,td.pd,NULL,NULL); pthread_exit(NULL); } spe_context_ptr_t context; pthread_t pthread; thread_data td; context = spe_context_create(0,NULL); spe_program_load(context,&spe_prg); pthread_create(&pthread,NULL,&ppu_pthread_function,&td[spe]); pthread_join(pthread,NULL); spe_context_destroy(context);
SPE code #include <spu_mfcio.h> int main( unsigned long long spe_id, unsigned long long program_data_ea, unsigned long long env) { program_data pd __attribute__((aligned(16))); int tag_id = 1; mfc_get(&pd, program_data_ea, sizeof(pd), tag_id, 0, 0); mfc_write_tag_mask(1<<tag_id); mfc_read_tag_status_any(); … }
Program data • structure shared by SPE and PPE code • unsigned long long for 64bit pointers • void* is 32bit on SPE and 32/64bit on PPE • be careful with the alignment • DMA cannot handle misaligned transfers • size padded to 16byte
DMA – SPE side • (void) mfc_put(volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag,uint32_t tid, uint32_t rid) • initiate transfer from LS • tag is number (e.g. 5) • mfc_putb, mfc_putf
DMA – SPE side cont. • (void) mfc_get(volatile void *ls, uint64_t ea, uint32_t size, uint32_t tag,uint32_t tid, uint32_t rid) • mfc_getb, mfc_getf
DMA status – SPE side • (void) mfc_write_tag_mask (uint32_t mask) • tag mask (e.g. 1<<5) • (uint32_t) mfc_read_tag_status_any(void) • blocks untill any of the specified tag groups has no outstanding operations • (uint32_t) mfc_read_tag_status_all(void) • blocks untill all of the specified tag groups have no outstanding operations
Mailboxes – SPE side • (uint32_t) spu_read_in_mbox(void) • (uint32_t) spu_stat_in_mbox(void) • (void) spu_write_out_mbox(uint32_t data) • (uint32_t) spu_stat_out_mbox(void)
Mailboxes – PPE side • int spe_out_mbox_read (spe_context_ptr_t spe, unsigned int *mbox_data, int count) • int spe_out_mbox_status (spe_context_ptr_t spe) • int spe_in_mbox_write (spe_context_ptr_t spe, unsigned int *mbox_data, int count, unsigned int behavior) • SPE_MBOX_ALL_BLOCKING • blocks until all are sent • SPE_MBOX_ANY_BLOCKING • blocks until at least one message is sent • SPE_MBOX_ANY_NONBLOCKING • sends as many as possible without blocking • int spe_in_mbox_status (spe_context_ptr_t spe)
PPE direct access to SPE • void* spe_ls_area_get (spe_context_ptr_t spe) • less efficient than DMA • int spe_ls_size_get (spe_context_ptr_t spe) • void* spe_ps_area_get (spe_context_ptr_t spe, enum ps_area area) • enum ps_area • SPE_MFC_COMMAND_AREA • MFC registers • SPE_CONTROL_AREA • mailboxes • the get_ps function used in examples from the first part