220 likes | 367 Views
Toolkits. version 1.0. Contents. Introduction of the toolkits used for the contest What is “Longest Common Subsequence : LCS “ ? How to use toolkit ver1.0. Towards the fastest program. Challenge. Compute Longest Common Subsequence : LCS of two given sequence of letters A and B.
E N D
Toolkits version 1.0 Special Cource on Computer Architectures 2010
Contents • Introduction of the toolkits used for the contest • What is “Longest Common Subsequence : LCS “ ? • How to use toolkit ver1.0. • Towards the fastest program Special Cource on Computer Architectures 2010
Challenge • Compute Longest Common Subsequence : LCS of two given sequence of letters A and B. • Compute as many sequences as possible in a given limit of time. Special Cource on Computer Architectures 2010
What is the LCS(1/4) Longest Common Subsequence : LCS • Subsequenceis a sequence consisting of letters from the sequence. • Example: X = < A, B, A, B, C> • <A, B>, <A, B, C>, <B, A>, <A, A>, <A, B, B, C> etc. • Letters should not be continuous order, but keep the order of two letters. • The common subsequence of two sequneces. • Example: X = <A, B, C, A> , Y = <A, B, A> • The longest Common Subsequence is <A, B, A>, the length is 3. • (See) http://en.wikipedia.org/wiki/Longest_common_subsequence_problem Special Cource on Computer Architectures 2010
How to get the LCS(2/4) • How does it compute? • Let be two sequenes X, Y. The i-th LCS and the j-th LCS can be computed from smaller LCS. • That is, LCS(i, j) is computed fromthe follows. • LCS(i-1, j) • LCS(i , j-1) • LCS(i-1, j-1) Special Cource on Computer Architectures 2010
How to get the LCS(3/4) • When the last letter is the same: • Xi= <..., A> • Yj= <..., A> LCS(i , j) is LCS(i-1, j-1) + 1 • When the last letter is not the same: • Xi= <..., A> • Yj= <..., B> LCS(i, j) is larger one from LCS(i-1, j)or LCS(i, j-1) Special Cource on Computer Architectures 2010
How to get the LCS(4/4) Dynamic Programmming, DP • X = <A, B, C, A> , Y = <A, B, A> • Assuming the left table, the algorithm • shown in the previous slide is: Up,max Left, Left-Up + (Xi == Yj ) ? 1 : 0 • Starting from the left most cell, • all entries in the table can be filled • sequentially. LCS!! Special Cource on Computer Architectures 2010
Approach • Implement ppe.c for PPE and spe.c for SPE for computing the LCS. • The following programs must not be changed. • PPEprograms (main_ppe.c,define.h) Special Cource on Computer Architectures 2010
The step in toolkit ver1.0 • Example: Compute the code distance with multiple SPEs. • Files except main_ppe.c and define.hcan be modified. • Each SPE computes based on a block including 128 letters. Special Cource on Computer Architectures 2010
toolkit ver1.0 • ppe.c PPESource code • spe.c SPE Source code • main_ppe.c Modification forbidden • spe.h • define.h Modification forbidden • Makefile • getrndstr.c Get the random sequence of letters. • lcs.c The seqeuntial LCS(For verification of the result) • ans.txt The answer of the sample problem. • rep/ There are files for the sample problem. Special Cource on Computer Architectures 2010
How to user toolkit ver1.0(1/3) • Specify two files as attributes, and compute the LCS of the sequences in the files. • Use multiple SPE in the initialization state. • Limitation: The number of the sequence is multiples of 128. • Example files for various data size are prepared. • Use getrndstr.c to generate arbitoray size of random sequences. $ gcc -O3 -o getrndstr getrndstr.c $ ./getrndstr 128 13 > file9999 Generate file9999 including sequence of 128 litters with random seed 13. Special Cource on Computer Architectures 2010
How to use toolkit ver1.0 (2/3) • After decoding toolkit1.0.tgz, use make for compilation. • How to start example program • make run{number} (From 1 to 5) Problem Number Length of A, B Execution Time The length of the LCS Special Cource on Computer Architectures 2010
How to use toolkit ver1.0(3/3) • Verify the results using lcs.c(Note that, the results of examples executed by make run* are in ans.txt. Use in the other cases.) Special Cource on Computer Architectures 2010
Summary of limitation • The size is multiples of 128(chartype) • Given two sequences are called Sequence A and Sequence B. • Code is based on libspe. • The program can be also in PPE. • At most 7 SPEs can be used in parallel. • The memory on PPE can be used freely. Special Cource on Computer Architectures 2010
Hints • Divide the sequence into sub-blocks, then you can divide the total process. • Parallel processing of the sub-blocks by SPE can improve performance with parallel processing. • Which part can the parallel processing be applied? Special Cource on Computer Architectures 2010
Parallel Processing(1/3) • Data Dependency:For computing the next element, three elements: Left, Up, Left-Up must be fixed. Elemnt Special Cource on Computer Architectures 2010
Parallel Processing(2/3) • If blue part is fixed, the pink part can be computed in parallel. • The same method can be applied to blocks instead of elements. Element Special Cource on Computer Architectures 2010
Parallel Processing(3/3) • In order to compute the pink block: • The right lower most element of the left upper block, • the lower most row of the upper block, and • the right most column of the left block are needed. block block block Special Cource on Computer Architectures 2010
Input/Output of block computation • Input: • The right-lower most element of the upper left block. • The lower most row of the upper block. • The right most column of the left block. • Output: • The right-lower most element of the computed score-table. • The lower most row • The right most column Special Cource on Computer Architectures 2010
Control of SPE subroutine • Make a queue to manage the job on PPE SPE SPE SPE SPE • Process on PPE • Based on the computed block number, add the block number which can start the computation. • Candidates are left/lower blocks. • Read the block number from the queue and assign it to the free SPE. • Continue it until the right most block is computed. PPE Inform the end of job Start the job Add the job Job Queue Head Tail Special Cource on Computer Architectures 2010
Subroutines for DMA(FYI) • Functions for data transfer • dmaget, dmaput : DMA write/read functions supported by the tool kit. • dmaget((void *))spe_addr, ppe_addr, X); • From ppe_addr, read Xbytedata,store them from pe_addrof LocalStore.dmaput is for opposite direction data transfer. SPE(LocalStore) PPE(Main memory) ppe_addr spe_addr 128ByteAligned address Special Cource on Computer Architectures 2010
Towards the fastest program • Improve spe.c to fill the table. • Improve ppe.c to control blocks for computation. • For parallel processing: • Use SIMDinstruction in SPE. • An operation can treat multiple elements. • Anyway, compute a large number of elements with an instruction as possible. • Loop unrolling, builtin expect,double buffering are useful techniques to try. • Good Luck! Special Cource on Computer Architectures 2010