620 likes | 793 Views
Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640. Advanced ComputationAL Biology Project Presentation. OVERVIEW. Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results
E N D
Team Members: Joshua Wu 11174269 Shuyu (Christine) Xu 11161640 Advanced ComputationAL Biology Project Presentation
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work Now we are here
Project Description Explicit Suffix Trees Suppose that we want to store explicitly all strings that are edge labels of a suffix tree. The main question of this project is how much space explicit suffix trees require comparing to implicit suffix trees. Implement suffix tree algorithm and run it on substrings of real data.
OVERVIEW Project Description Introduction Motivation Bioinformatics Application Explicit vs Implicit Problem Analysis Implement Files Experimental Results Conclusion Possible Future Work Now we are here
Introduction • Any string of length m can be degenerated into m suffixes, and these suffixes can be stored in a suffix tree. • Setup time O(m) (m is length of string) • searching time O(n) (n is length of pattern)
OVERVIEW • Project Description • Introduction • Motivation • Bioinformatics Application • Explicit vs Implicit • Problem Analysis • Implement Files • Experimental Results • Conclusion • Possible Future Work Now we are here
Motivation • "Suffix trees are widely used in the computer field... Recent improvements in the method have cut the memory requirement to 17 bytes per letter, which brings the method to the verge of practicality [for bioinformatics applications]" -- Nat Goodman (Genome Technology).
OVERVIEW • Project Description • Introduction • Motivation • Bioinformatics Application • Explicit vs Implicit • Problem Analysis • Implement Files • Experimental Results • Conclusion • Possible Future Work Now we are here
Bioinformatics Application • multiple genome alignment (Michael Hohl et al., 2002) • selection of signature oligonucleotides for DNA arrays (Kaderali and Schliep, 2002) • identification of sequence repeats (Kurtz and Schleiermacher, 1999)
OVERVIEW • Project Description • Introduction • Motivation • Bioinformatics Application • Explicit vs Implicit • Problem Analysis • Implement Files • Experimental Results • Conclusion • Possible Future Work Now we are here
Explicit vs Implicit • ABC $ Explicit • 1 2 3 4 ABC$ $ BC$ C$ Implicit 1,4 4,4 2,4 3,4
OVERVIEW • Project Description • Introduction • Motivation • Bioinformatics Application • Explicit vs Implicit • Problem Analysis • Implement Files • Experimental Results • Conclusion • Possible Future Work Now we are here
Problem Analysis • Best Case for explicit and implicit suffix trees: All different characters • Best case not likely with DNA inputs: total of 4 characters • Worst case: same characters throughout
Assumptions • In implicit trees, each number will only take up one bit. (the number 10 takes up 1 bit) • Only alphabets will be in the sequence
Example: all different char • ABCD $ 1,5 5,5 • 1 2 3 4 5 2,5 3,5 4,5 • N: string length • N = 5 • Memory = 10 • best case
Example • ABCABC $ 7,7 • 1 2 3 4 5 6 7 1,3 2,3 6,6 • N: string length • N = 7 4,7 7,7 7,7 7,7 • Memory = 20 4,7 4,7
Example: all same character • AAAA $ • 1 2 3 4 5 1,1 5,5 • N=string length • N = 5, 6, 7 2,2 5,5 • Memory = 16, 20, 24 • Memory = 4n-4 3,3 5,5 • Worse case 4,5 5,5
Program Input Data DNA for all kinds of creatures: Homo Sapiens, Monkeys, Chickens, …
OVERVIEW • Project Description • Introduction • Motivation • Bioinformatics Application • Explicit vs Implicit • Problem Analysis • Implement Files • Experimental Results • Conclusion • Possible Future Work Now we are here
Sample input: Homo Sapien • cagctcctgagactgctggcatgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaagtggacctcagacatggctcagccataggacctgccacacaagcagccgtggacacaacgcccactaccacctcccacatggaaatgtatcctcaaaccgtttaatcaataa
Sample input 2: plants • EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG
OVERVIEW • Project Description • Introduction • Motivation • Bioinformatics Application • Explicit vs Implicit • Problem Analysis • Implement Files • Experimental Results • Conclusion • Possible Future Work Now we are here
Sample Input: Homo Sapiens • atgaaggggagccgtgccctcctgctggtggccctcaccctgttctgcatctgccggatggccacaggggaggacaacgatgagtttttcatggacttcctgcaaacactactggtggggaccccagaggagctctatgaggggaccttgggcaagtacaatgtcaacgaagatgccaaggcagcaatgactgaactcaagtcctgcagagatggcctgcagccaatgcacaaggcggagctggtcaagctgctggtgcaagtgctgggcagtcaggacggtgcctaa
Sample Input: Monkey Virus • GGSCFKCGKKGHFAKNCHEHAHNNAEPKVPGLCPRCKRGKHWANECKSKTDNQGNPIPPH
Sample Input: Plants • EARPIVVGPPPPLSGGLPGTENSDQARDGTLPYTKDRFYLQPLPPTEAAQRAKVSASEILNVKQFIDRKAWPSLQNDLRLRASYLRYDLKTVISAKPKDEKKSLQELTSKLFSSIDNLDHAAKIKSPTEAEKYYGQTVSNINEVLAKLG
Sample input: tobacco • SYSITTPSQFVFLSSAWADPIELINLCTNALGNQFQTQQARTVVQRQFSEVWKPSPQVTVRFPDSDFKVYRYNAVLDPLVTALLGAFDTRNRIIEVENQANPTTAETLDATRRVDDATVAIRSAINNLIVELIRGTGSYNRSSFESSSGLVWTSGPAT
Sample Input: Insects • DCLSGRYKGPCAVWDNETCRRVCKEEGRSSGHCSPSLKCWCEGC
Sample Input: Birds • IDTCRLPSDRGRCKASFERWYFNGRTCAKFIYGGCGGNGNKFPTQEACMKRCAKA
Sample Input: SARS • ALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEV
Sample Input: Fish • GHHHHHHLEDPSGGTPYIGSKISLISKAEIRYEGILYTIDTENSTVALAKVRSFGTEDRPTDRPIAPRDETFEYIIFRGSDIKDLTVCEPPKPIM