100 likes | 193 Views
Resequencing using C-Linda and Largest Common Subsequence. Nayeong Jeong & Kenn Jacoby May 14, 2000 Parallel Computing II, Prof. Paul Tymann. Parts of the project:. C procedure to position two segments and move them past each other and do exact matching overlap check. (Stage1Engine)
E N D
Resequencing using C-Linda and Largest Common Subsequence Nayeong Jeong & Kenn Jacoby May 14, 2000 Parallel Computing II, Prof. Paul Tymann
Parts of the project: • C procedure to position two segments and move them past each other and do exact matching overlap check. (Stage1Engine) • C-Linda program to coordinate exhaustive search of pool of input data. • Largest Common Sequence Algorithm • C procedure to combine LCS algorithm into previous Stage1Engine which gives Stage2Engine.
Stage1Engine substr1 smallseg 1 largeseg substr2 overlap 2 3 4
C-Linda Algorithm W1 W2 W3 Input file 1 cgattgatgcgcgtgatg 2 agcgtgcgtagagtcgtg 3 aggctctctcgtgtatctcgtgtt 4 gatctctagctcgctagttgtgc 5 cgatattttcgttgatccgctagt . . . . . . . . . 36 tagcatagctcgatcg 1/2 1/3 1/4 1/5 1/6 1/7 . . 1/36 2/3 2/4 2/5 2/6 2/7 . . 2/36 3/4 3/5 3/6 3/7 . . 3/36 Gen 0 1 & 5 matched 2 & 7 matched 3 & 36 matched Mark segments’ tuples that were absorbed by a successful hit as deleted, and condense to a new input array of strings. Check if the number of segments has decreased, and if not, you are done. W1 W2 W3 2/3 2/4 2/6 2/8 . 2/35 1/2 1/3 1/4 1/6 1/8 . 1/35 3/4 3/6 3/8 . . 3/35 Gen 1
LCS Algorithm ccatcctgctgaacgatc Lcs_length = 14 Thresh = 16 atcgtgctgatcgatcgg catcctgctgaacgatcg Lcs_length = 15 Thresh = 16 atcgtgctgatcgatcgg atcctgctgaacgaacgg Lcs_length = 16 Thresh = 16 atcgtgctgatcgatcgg
General Flow worker 0 Tuplespace agtcgatcgcataacg cagactcgcatccagca gccatactacgcaatcacacag cgacactagctcacgactacaa . . . . char * Stage2Engine( ) int lcs_length( ) Combined string worker 1 char * Stage2Engine( )
Testing with input... cgatacgcactacgca gcactacgcatttact cgatacgcattacgca gcactatgcatttact O!--pardon-me-thou-bleeding-piece thou-bleading-piece-of-earth piece-of-earth-that-I-am-meak-and am-meek-and-gentle-with-these-butchers. … Shakespeare’s Julius Caesar
Limitations, Hurdles • Did not use memory allocation due to an unexplainable hang, so there is a limit to the size of the input we can process. Somewhere around 18 sub-sequences which total ~1400 characters when fully combined. • Had to hard code the value of threshold to equal (overlay - 2) because we did not use recursion as we should have in the Stage2Engine( ) procedure. So, 2 characters was our maximum number of substitutions within comparison window that we would allow. • Program does not put a limit to number of substitutions that you can have grouped side by side.
References Introduction to Algorithms, Cormen, Leiserson and Rivest p. 314-320 Largest Common Subsequence Dr. Gary Skuse, RIT Bioinformatics, Biology Dept.