10 likes | 103 Views
Q_start. Q_end. Query sequence:. Query. k -tuples. f(t). F(t). -(t-1). F s (t). S q = (TGCAACAT). SSAHA2 = SSAHA + Cross_Match. Subject. S. Sequence S : (s 1 s 2 , …, s i , …, s m ) i =1,2, …, m. S_end. S_start. Match Start. Match End. Match Start. Match End. TG. 1, 13.
E N D
Q_start Q_end Query sequence: Query k-tuples f(t) F(t) -(t-1) Fs(t) Sq = (TGCAACAT) SSAHA2 = SSAHA + Cross_Match Subject S Sequence S: (s1s2, …, si, …, sm) i =1,2, …, m S_end S_start Match Start Match End Match Start Match End TG 1, 13 1, 13 0 1, 5 E E k-tuple k-tuple Ni Ni Indices and Offsets Indices and Offsets Query K-tuple: (sisi+1...si+k-1) SSAHA seeds 2, 7 2, 7 0 1, 13 0 0 AA AA 1 1 2, 19 2, 19 Edge length Edge length Subject 3, 9 3, 9 0 2, -2 1 1 AC AC 3 3 1, 9 1, 9 2, 5 2, 5 2, 11 2, 11 Exact Match Near Exact Match GC -1 2 2 AG AG 2 2 1, 15 1, 15 2, 35 2, 35 CA 2, 3 2, 1 -2 2, 1 3 3 AT AT 2 2 2, 13 2, 13 3, 3 3, 3 “A” =00; “C” = 01; “G” = 10; “T” = 11 Sequence for cross_match 2, 9 2, 7 -2 2, 1 4 4 CA CA 7 7 2, 3 2, 3 2, 9 2, 9 2, 21 2, 21 2, 27 2, 27 2, 33 2, 33 3, 21 3, 21 3, 23 3, 23 Sequence Representation 2, 21 2, 19 -2 2, 4 SSAHA for matching seeds, cross_match for sequence alignment. 5 5 CC CC 4 4 1, 21 1, 21 2, 31 2, 31 3, 5 3, 5 3, 7 3, 7 2, 27 2, 25 -2 2, 7 6 6 CG CG 1 1 1, 5 1, 5 2, 33 2, 31 -2 2, 7 7 7 CT CT 6 6 1, 23 1, 23 2, 39 2, 39 2, 43 2, 43 3, 13 3, 13 3, 15 3, 15 3, 17 3, 17 Using two binary digits for each base, we may have the following representations: 3, 21 3, 19 -2 2, 7 8 8 GA GA 4 4 1, 3 1, 3 1, 17 1, 17 2, 15 2, 15 2, 25 2, 25 3, 23 3, 21 -2 2, 7 9 9 GC GC 0 0 For any of the m/k no-overlapping k-tuples in the sequence, an integer may be used to represent the k-tuple in a unique way AA 2, 19 2, 16 -3 2, 16 10 10 GG GG 5 5 1, 25 1, 25 1, 31 1, 31 2, 17 2, 17 2, 29 2, 29 3, 1 3, 1 AC 1, 9 1, 5 -4 2, 16 SSAHA Index: 11 11 GT GT 6 6 1, 1 1, 1 1, 27 1, 27 1, 29 1, 29 2, 1 2, 1 2, 37 2, 37 3, 19 3, 19 2, 5 2, 1 -4 2, 19 12 12 TA TA 1 1 3, 25 3, 25 where bi = 0 or 1, depending on the value of the sequence base and Emax is the maximum value of the possible E values. 2, 11 2, 7 -4 2, 21 13 13 TC TC 6 6 1, 7 1, 7 1, 11 1, 11 1, 19 1, 19 2, 23 2, 23 2, 41 2, 41 3, 11 3, 11 CA 2, 3 2, -2 -5 2, 25 14 14 TG TG 3 3 1, 13 1, 13 2, 7 2, 7 3, 9 3, 9 2, 9 2, 4 -5 2, 28 15 15 TT TT 2, 21 2, 16 -5 2, 31 Hash Table: A 2-tuple hashing table of S1, S2 and S3 2, 27 2, 22 -5 3, -3 2, 33 2, 28 -5 3, 9 3, 21 3, 16 -5 3, 16 3, 23 3, 18 -5 3, 18 AT 2, 13 2, 7 -6 3, 19 3, 3 3, -3 -6 3, 21 S1=(GTGACGTCACTCTGAGGATCCCCTGGGTGTGG) S2=(GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT) S3=(GGATCCCCTGTCCTCTCTGTCACATA) Array of index and offset data Query sequence: Sq = (TGCAACAT) SSAHA2 Client F Zfish37251-2938a06.p1c Z35723-a3166b08.q1c 37 210 415 588 174 98.28 Alignment score: 152 Query: 37 ATTGCCATTAAAATAATAATAAAAGGACATATTGATATTTTGGTCATGCTATCTATTCCT 96 ATTGCCATTAAAATAATAAT AAAGGACATATTGATATTTTGG CATGCTATCTATTCCTSbjct: 415 ATTGCCATTAAAATAATAATGAAAGGACATATTGATATTTTGGCCATGCTATCTATTCCT 474Query: 97 AATGTCATCTCTGAATACAAAGACAGCAAATGGCCTGTGAAATAAACCCTGCCTGTCCAA 156 AATGTCATCTCTGAATACAAAGACAGCAAATGGCCTGTGAAATAAACCCTGCCTGTCCAASbjct: 475 AATGTCATCTCTGAATACAAAGACAGCAAATGGCCTGTGAAATAAACCCTGCCTGTCCAA 534Query: 157 TAAGACAATGATCAAACATTCACTATTTTTTATAATAATCTGTATATTCTATAA 210 TAAGACAATGATCAAACATTCACTATTTT TATAATAATCTGTATATTCTATAASbjct: 535 TAAGACAATGATCAAACATTCACTATTTTGTATAATAATCTGTATATTCTATAA 588F Zfish37251-2938a06.p1c Z35723-a3166b08.q1c 240 379 77 222 140 80.00 Alignment score: 36 Query: 240 AAATAAAATAAAATCATTCACATTCAAACAATAATAAAATAACATGATATTTTGGTCATC 299 AAATAAA TAAAAT T C CATT AAACAATAATAAAAT ACATGATATTTTG TCATCSbjct: 77 AAATAAA-TAAAATGTTGC-CATTAAAACAATAATAAAATGACATGATATTTTGATCATC 134Query: 300 -----TATCCCTA-T-T-ATCTCTGAAATCAAAGACAGAGAACACCCTATGAAACCAACC 351 TAT CCTA T T ATCTCTGAAATCAAAGACAG AACA CCT T AAAC AACCSbjct: 135 CTATGTATTCCTAATGTCATCTCTGAAATCAAAGACAGCAAACAGCCTGTAAAACAAACC 194Query: 352 CTGCCTCTCCGATTAGACAATGATCAAA 379 CTGCCT C GATT ACAATGATCAAASbjct: 195 CTGCCTGCCTGATTTTACAATGATCAAA 222 Output Format ? Data Structures and Distributions Hash tables for all the CPU nodes are generated using a certain amount of traces (fasta only) ordered according to species or trace types and stored in the RAM memory of individual nodes. The Oracle Database stores all the traces. SSAHA2 finds matching seeds from the hash tables and calls the Database to pull out the sequences. Full sequence alignment is then performed. The SSAHA Trace Server Zemin Ning, Will Spooner, Mark Rae, Steven Leonard, Martin Widlake and Tony Cox The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK INTRODUCTION Various genome projects have brought the creation of many large biological databases. The total data size of DNA sequences, for example, is estimated to be approximately 200 GB, including WGS and clone reads, finished sequences, refSeq etc. Designing services to make all the data searchable in a fast, sensitive and flexible way, poses significant challenges in both development of algorithms and hardware architecture implementation. In this poster, we outline a system with the potential to accomplish this challenging but extremely worthwhile task. SSAHA2 Client: (1) Communicates over TCP/IP with the SSAHA2 server; (2) Inputs the query data; (3) Outputs the alignment results. SSAHA2 Servers: Run on a 16 or 32 GB Linux 64bit machine. (1) Communicates with the SSAHA2 client; (2) Receives input data and carries out search and alignment; (3) Outputs the search results to the client. Computer Nodes Selection and Data filtration: Species_Code – Human, mouse, zebrafish, etc; Trace_Type – Finished sequence, WGS reads, EST reads, etc; Centre_Name – SC, WIBR, WUGSC, etc. Hardware Requirement for the System: The requirement of hardware for the server system will be 6 (16 GB) or 4 (32 GB) Linux Boxes, each with 4 CPUs: Search Speed: It is aimed to provide a near real-time (under 10 seconds) search service for a clustered 200 GB database. The solution is extensible by plugging extra appliances. SSAHA Memory where k = kmer size; Ns = number of bases. In the hash table, we only store an element which combines sequence index and offset. M = 4*Ns/k+ 4*22k SSAHA Seeds References • Ning, Z., Cox, A.J. and Mullikin, J.C. 2001. SSAHA: A Fast Search Method for Large DNA Databases. Genome Research 11:1725-1729. • *http://www.phrap.com/ • * We would like to thank Professor Phil Green, University of Washington, who has kindly agreed for the • Phrap/Cross_Match package to used for sequence alignment in the SSAHA system.