SSAHA - A Fast Search Method Meg Eckman, Kaitlyn McPartland, Nnenna Opara, Marni Siegel

Rat • Cow • Mosquito • Armadillo • Honeybee • Worm • Pufferfish • Chicken • Frog • Dog • Human • Mouse • Zebrafish • Chimpanzee • Macaque • Opossum • Elephant • Fly SSAHA - A Fast Search Method Meg Eckman, Kaitlyn McPartland, Nnenna Opara, Marni Siegel Genome Revolution Focus, Duke University Introduction Genomes Available for Comparison Using SSAHA Online at http://www.sanger.ac.uk/DataSearch/blast.shtml Applications DNA Comparison APT Problem Statement In trying to compare two different genomes of the same species, bioinformatic scientists use SSAHA to rapidly find the similarity between strands of DNA. One application of the SSAHA algorithm and tool is to find SNPs: single nucleotide polymorphisms. In this problem you'll return the location of the first SNP in two strands of the same length, or zero if the strands are the same and don't contain a SNP. The first base-pair in a strand has index 1 in this problem, not index zero. As in the real SSAHA code, any character other than 'g', 't', 'c', or 'a' should be treated as an 'a' for the purposes of matching nucleotides. Definition Class: DNACompare Method: snpFind Parameter: String dna1, String dna2 Returns: int Method signature: int snpFind (String dna, String n); (be sure your method is public) Class public class DNAComparison { public int snpDetector (String dna, String n) { // fill in code here } } Constraints dna1 and dna2 will have a maximum of fifty characters and the same character length. The characters of DNA will be ‘a’, ‘g’, ‘t’, ‘x’ or ‘c’. The character ‘x’ should be treated as an ‘a’. Examples 1. dna1= aaggttcc dna2= aagtttcc Returns: “4” because the first place the strands disagree is the g and t. 2. dna1 = aaaaaaaaaaaa dna2 = aaaaaaaaxaaa Returns: “0” because the program will treat the ‘x’ as an ‘a’. SSAHA stands for Sequence Search and Alignment by Hashing Algorithm, a method of quickly and easily searching the large amounts of data in DNA databases. Created in 2001, SSAHA offers an improved sequence alignment tool to BLAST improves BLAST as a faster tool for sequence alignment. SSAHA uses a hash table to store the locations of strings nucleotides in many different “bins”. Each nucleotide is coded for using a binary system; thus when trying to match a string of nucleotides from a different genome or within the same genome (as in genome shotgun technique), the program only has to look through bins until it finds the proper nucleotide sequence instead of searching through the entire genome. SSAHA compromises memory for speed, but this speed means that SSAHA can handle high throughput such as entire genomes. • SSAHA has been used to… • Refine repetitive sequence searches (Reneker, 2005) • Locate large amounts of genetic variation within the zebrafish species (Guryev, 2006) • Localizing large data sets (Harper, 2006) • Aligning DNA strands (Ning, 2004) Literature Cited Guryev, Koudijs, Berezikov, Johnson, Plasterk, van Eeden, Cuppen. “Genetic Variation on Zebrafish.” Genome Research, March 13, 2006. Reneker, jeff and Chi-Ren Shyu. “Refined Repetitive Sequence Searches Utilizing a Fast Hash Function and Cross Species Information Retrievals.” BMC Bioinformatics 2005, 6:111. Ning, Cox, Mullikin. “SSAHA: A Fast Search Method for Large DNA Databases.” Genome Research, 2001 11: 1725-1729. Harper, Huang, Stryke, Kawamoto, Ferrin, Bbaitt. “Comparison of Methods for Genomic Localization fo Gene Trap Sequences.” MC Genomics, v. 7, 2006. Ning, Spooner, Spargo, Leonard, Rae, Cox. “The SSAHA Trace Server.” Computational Systems Bioinformatics Conference, 2004. http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1332490 http://www.sanger.ac.uk/DataSearch/blast.shtml Raphael, B. “Aspects and applications of symbol manipulation.” ACM/CSC-ER: Proceedings from the 1966 Conference. http://portal.acm.org/citation.cfm?id=800256.810683 Example of a Hash Table (Ning, 2001) Table 1. A 2-tuple Hash Table for S1, S2, and S3 w E(w) Positions AA 0 (2, 19) AC 1 (1, 9) (2, 5) (2, 11) AG 2 (1, 15) (2, 35) AT 3 (2, 13) (3, 3) CA 4 (2, 3) (2, 9) (2, 21) (2, 27) (2, 33) (3, 21) (3, 23) CC 5 (1, 21) (2, 31) (3, 5) (3, 7) CG 6 (1, 5) CT 7 (1, 23) (2, 39) (2, 43) (3, 13) (3, 15) (3, 17) GA 8 (1, 3) (1, 17) (2, 15) (2, 25) GC 9 GG 10 (1, 25) (1, 31) (2, 17) (2, 29) (3, 1) GT 11 (1, 1) (1, 27) (1, 29) (2, 1) (2, 37) (3, 19) TA 12 (3, 25) TC 13 (1, 7) (1, 11) (1, 19) (2, 23) (2, 41) (3, 11) TG 14 (1, 13) (2, 7) (3, 9) TT 15 7) (3 Acknowledgements Timeline We would like to recognize Professor Owen Astrachan for his brilliance, amazing YouTube video creations, and ever sarcastic quips in class. 2004 2000 2006 1989 2001 1966 For Further Information Genetic Variation in Zebrafish is published. SSAHA is published. Formation of a hash table is mentioned in a Stanford professor’s presentation. BLAST is published. John Sulsten and Craig Venter jointly announce the completion of a draft of the Human Genome. SSAHA2 is published. Visit or webpage at www.duke.edu/~kam57.

SSAHA - A Fast Search Method Meg Eckman, Kaitlyn McPartland, Nnenna Opara, Marni Siegel

SSAHA - A Fast Search Method Meg Eckman, Kaitlyn McPartland, Nnenna Opara, Marni Siegel

Presentation Transcript

Prune-and-Search Method

FAST Search & Transfer FAST ESP

Enterprise Search with FAST

Kevin Siegel

Poetry by Kaitlyn

Marjorie Siegel

BY Kaitlyn Dodge

Kaitlyn Crossan

Justin Siegel

I-Search Method

A Method for Fast Delay/Area Estimation

Haha ïƒ SSAHA

Fast search methods

Fast Image Search

Search Fast , Do Smart

Noah Kaitlyn Burkey

Jane Austen Alexis Eckman

FAST Search and Transfer

Jeff Siegel

A sensitive search for decay: the MEG experiment

Prune-and-Search Method

Television Lawyer, Victor Opara. Victor Nnamdi Opara

SSAHA - A Fast Search Method Meg Eckman, Kaitlyn McPartland, Nnenna Opara, Marni Siegel