10 likes | 117 Views
SSAHA: Sequence Search and Alignment by Hashing Algorithm. Fei Lian, Iyanna Atwell, Owen Astrachan Computer Science Department, Duke University, Durham, North Carolina. Introduction. How SSAHA Works. Related Articles.
E N D
SSAHA: Sequence Search and Alignment by Hashing Algorithm Fei Lian, Iyanna Atwell, Owen Astrachan Computer Science Department, Duke University, Durham, North Carolina Introduction How SSAHA Works Related Articles SSAHA, or Sequence Search and Alignment by Hashing Algorithm, is used mainly for fast sequence assembly, SNP detection, and the ordering and orientation of contigs. It bears great significance in the realm of Computer Science and Bioinformatics by allowing scientists to search through hash tables and make out possible matches of DNA. Hash tables perform as lookup tables that find corresponding values through a hash function. It makes use of its fast sequencing by keeping the hash database in memory and using a fast indexing method to locate queries. It is most effective in finding matches between closely-related DNA strands. Thus, SSAHA has great potential in the bioinformatics field. • Takes data and stores it into a hashing function • When a query is present, the function will hash the query and data into k-tuples • After hashing and comparing values, a table is constructed (table 1.) for all given sequences in the database • Finally, that hashed data is arranged in another table (table 2.) so that it can determine where matches are made via complicated algorithms • The chart (table 2.) is finally made with H as the hits and M indicates possible matches to the query • SSAHA returns where the query is found and where it starts • SSAHA makes two tables. The first table finds position of k-tuple. Then, another program/algorithm makes another table, which searches for matches and compiles the data. • Table 2.List of Matches for the Query Sequence • T wt(Q) Positions H M • 0 TG (1, 13) (1, 13, 13) (1, 5, 9) • (2, 7) (2, 7, 7) (1, 13, 13) • (3, 9) (3, 9, 9) (2, 2, 3) • 1 GC (2, 1, 3) • CA (2, 3) (2, 1, 3) (2, 1, 5) • (2, 9) (2, 7, 9) (2, 4, 9) • (2, 21) (2, 19, 21) (2, 7, 7) • (2, 27) (2, 25, 27) (2, 7, 9) • (2, 33) (2, 31, 33) (2, 7, 11) • (3, 21) (3, 19, 21) (2, 7, 13) • (3, 23) (3, 21, 23) (2, 16, 19) • AA (2, 19) (2, 16, 19) (2, 16, 21) • AC (1, 9) (1, 5, 9) (2, 19, 21) • (2, 5) (2, 1, 5) (2, 22, 27) • (2, 11) (2, 7, 11) (2, 25, 27) • CA (2, 3) (2, 2, 3) (2, 28, 33) • (2, 9) (2, 4, 9) (2, 31, 33) • (2, 21) (2, 16, 21) (3, 3, 3) • (2, 27) (2, 22, 27) (3, 9, 9) • (2, 33) (2, 28, 33) (3, 16, 21) • (3, 21) (3, 16, 21) (3, 18, 23) • (3, 23) (3, 18, 23) (3, 19, 21) • AT (2, 13) (2, 7, 13) (3, 21, 23) • (3, 3) (3, -3, 3) • William R. Pearson and David J. Lipman, Improved Tools for Biological Sequence Comparison, in: PNAS | April 15, 1988 | vol. 85 | no. 8 | 2444-2448 • SF Altschul, TL Madden, AA Schaffer, J Zhang, Z Zhang, W Miller and DJ Lipman,Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, from:http://nar.oxfordjournals.org/cgi/content/abstract/25/17/3389?ijkey=ba0bbba2e301195849a11dc719e221c69af7f2ee&keytype2=tf_ipsecsha • Arthur L. Delcher, et. al. Fast Algorithms for large-scale genome alignment and comparison. In: Nucleic Acids Research, 2002, Vol. 30, No. 1. Pgs: 2478-2483 • Stephen F. Altschul, et. al. Basic Local Alignment Search Tool, in: Journal of Moecular Biology, (1990) 215, 403-410. References • http://www.sanger.ac.uk/Software/analysis/SSAHA/ • http://www.genome.org/cgi/content/full/11/10/1725 • http://www1.ncifcrf.gov/app/htdocs/appdb/index.php?info=ssaha • www.duke.edu/~ia14 • Fei Lian • Fei.lian@duke.edu • Iyanna Atwell • Iyanna.atwell@duke.edu Table 1.A 2-tuple Hash Table for S1, S2, and S3 W E(w) Positions AA 0 (2, 19) AC 1 (1, 9)(2, 5)(2, 11) AG 2 (1, 15)(2, 35) AT 3 (2, 13)(3, 3) CA 4 (2, 3)(2, 9)(2, 21)(2, 27)(2, 33)(3, 21)(3, 23) CC 5 (1, 21)(2, 31)(3, 5)(3, 7) CG 6 (1, 5) CT 7 (1, 23)(2, 39)(2, 43)(3, 13)(3, 15)(3, 17) GA 8 (1, 3)(1, 17)(2, 15)(2, 25) GC 9 GG 10 (1, 25)(1, 31)(2, 17)(2, 29)(3, 1) GT 11 (1, 1)(1, 27)(1, 29)(2, 1)(2, 37)(3, 19) TA 12 (3, 25) TC 13 (1, 7)(1, 11)(1, 19)(2, 23)(2, 41)(3, 11) TG 14 (1, 13)(2, 7)(3, 9) TT 15 SSAHA Main Points • The table above represents the hash table compiled from the database and is searched through to find the occurrence of the query sequence. • Positions are what strand the query is found in and its index. • H is the number of hits found between the database and query sequence. It is given by the equation : H1 = i1, j1, − t, j1, H2 = i2, j2 − t, j2, . . . , Hr = ir, jr − t, jr. M is the master list of hits arrange according to the three values of H and their sequential order. • It is referred to as the index, shift, and offset. As for the bold figures, it details the exact match of the query sequence Q=TGCAACAT and the database (three sequences). • From the information provided, the query sequence is found in S2 and begins at the seventh index of S2 and continues for eight bases. • This illustrates how SSAHA works with 2-tuples to store a hashed sequence, then search through it to return where the query is located. • SSAHA is mainly used for single-nucleotide polymorphism detection and large-scale sequence assembly. • The SSAHA algorithm works mainly by the concatenation of exact matches (of base pairs) between k-tuples. • By hashing the database, search time becomes independent of database size. • The time it takes of SSAHA to hash is never more than twice the time it takes to process all of the database data using a BLAST (Basic Local Alignment Search Tool) search. • By using a hash algorithm, relative memory can be allocated to each list of positions L. • SSAHA takes advantage of multiple repetitions in human DNA to allocate unrecognizable base pairs into one single k-tuple containing all A’s (default). Contact Information Acknowledgements A 2-tuple table for three sequences: S1 = GTGACGTCACTCTGAGGATCCCCTGGGTGTGG S2 = GTCAACTGCAACATGAGGAACATCGACAGGCCCAAGGTCTTCCT S3 = GGATCCCCTGTCCTCTCTGTCACATA. Acknowledgements go out to the creators of SSAHA, to classmates who helped with the editing of this poster, and to “ola,” who’s inspiration is unforgettable.