250 likes | 437 Views
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm. Mathieu Blanchette Martin Tompa Computer Science & Engineering University of Washington. Outline. How are genes regulated? What is phylogenetic footprinting? First solution Improvements and extensions
E N D
Discovery of Regulatory Elements by a Phylogenetic Footprinting Algorithm Mathieu Blanchette Martin Tompa Computer Science & Engineering University of Washington
Outline • How are genes regulated? • What is phylogenetic footprinting? • First solution • Improvements and extensions • Application to regulation of several important genes
Regulation of Genes • What turns genes on and off? • When is a gene turned on or off? • Where (in which cells) is a gene turned on? • How many copies of the gene product are produced?
Regulation of Genes Transcription Factor RNA polymerase DNA Coding region Regulatory Element
Regulation of Genes Transcription Factor RNA polymerase DNA Coding region Regulatory Element
Goal • Identify regulatory elements in DNA sequences. These are: • Binding sites for proteins • Short substrings (5-25 nucleotides) • Up to 1000 nucleotides (or farther) from gene • Inexactly repeating patterns (“motifs”)
Phylogenetic Footprinting(Tagle et al. 1988) • Functional sequences evolve slower than nonfunctional ones. • Consider a set of orthologous sequences from different species • Identify unusually well conserved regions
Substring Parsimony Problem • Given: • phylogenetic tree T, • set of orthologous sequences at leaves of T, • length k of motif • threshold d • Problem: • Find each set S of k-mers, one k-mer from each leaf, such that the “parsimony” score of S in Tis at most d. • This problem is NP-hard.
AGTCGTACGTGAC...(Human) AGTAGACGTGCCG...(Chimp) ACGTGAGATACGT...(Rabbit) GAACGGAGTACGT...(Mouse) TCGTGACGGTGAT... (Rat) Small Example Size of motif sought: k = 4
AGTCGTACGTGAC... AGTAGACGTGCCG... ACGTGAGATACGT... GAACGGAGTACGT... TCGTGACGGTGAT... ACGT ACGT ACGT ACGG Solution Parsimony score: 1 mutation
CLUSTALW multiple sequence alignment (rbcS gene) Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATT Pea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACA Tobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACC Ice-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACC Turnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGC Wheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAA Duckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAA Larch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----A Pea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------A Tobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGA Ice-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAA Turnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------A Wheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC-------- Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATT Larch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTA Pea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTA Tobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATG Ice-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGG Turnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATA Wheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTG Duckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATC Larch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTAC Pea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAAC Tobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAA Ice-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTAC Larch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCA Turnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAG Wheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCC Duckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG
… ACGG: +ACGT: 0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG:ACGT :0 ... … ACGG: 1 ACGT: 0 ... 4k entries AGTCGTACGTG ACGGGACGTGC ACGTGAGATAC GAACGGAGTAC TCGTGACGGTG … ACGG: 2ACGT: 1... … ACGG: 1ACGT: 1... … ACGG: 0ACGT: 2 ... … ACGG: 0 ACGT: +... An Exact Algorithm(generalizing Sankoff and Rousseau 1975) Wu [s] = best parsimony score for subtree rooted at node u, if u is labeled with string s.
Wu [s] = min ( Wv [t] + d(s, t) ) v:child t ofu Recurrence
Wu [s] = min ( Wv [t] + d(s, t) ) v:child t ofu Running Time O(k 42k )timeper node
Wu [s] = min ( Wv [t] + d(s, t) ) v:child t ofu Average sequence length Number of species Total time O(n k (42k + l )) Motif length Running Time O(k 42k )timeper node
Improvements • Better algorithm reduces time from O(n k (42k + l ))toO(n k (4k + l )) • By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k) • Amenable to many useful extensions (e.g., allow insertions and deletions)
Gilthead sea bream (678 bp) Medaka fish (1016 bp) Common carp (696 bp) Grass carp (917 bp) Chicken (871 bp) Human (646 bp) Rabbit (636 bp) Rat (966 bp) Mouse (684 bp) Hamster (1107 bp) Application to -actin Gene
Common carp ACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTGAGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCAGACATTTGGTGGGGCCAACCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC Chicken ACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGGCTTTATTTGTTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGAGCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGATAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT Human GCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTTTTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTTGGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTCCCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG Parsimony score over 10 vertebrates: 0 1 2
Motifs Absent from Some Species • Find motifs • with small parsimony score • that span a large part of the tree • Example: in tree of 10 species spanning 760 Myrs, find all motifs with • score 0 spanning at least 250 Myrs • score 1 spanning at least 350 Myrs • score 2 spanning at least 450 Myrs • score 3 spanning at least 550 Myrs
10 Puffer fish Chicken Pig Mouse Hamster Human 7 2 2 1 2 2 1 0 1 Application to c-fos Gene Asked for motifs of length 10, with 0 mutations over tree of size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26 Found: 0 mutations over tree of size 8 1 mutation over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 28
Application to c-fos Gene Motif Score Conserved in Known? CAGGTGCGAATGTTC 0 4 mammals TTCCCGCCTCCCCTCCCC 0 4 mammals yes GAGTTGGCTGcagcc 3 puffer + 4 mammals GTTCCCGTCAATCcct 1 chicken + 4 mammals yes CACAGGATGTcc 4 all 6 yes AGGACATCTG 1 chicken + 4 mammals yes GTCAGCAGGTTTCCACG 0 4 mammals yes TACTCCAACCGC 0 4 mammals
Other Genes Similar results for the following genes: • insulin • c-myc promoter and intron • growth hormone • interleukin-3 • histone H1 • -globin • dihydrofolate reductase • fibroin • myogenin • prolactin • thyroglobulin • γ-actin 3´ UTR • rbcS • rbcL
Conclusions • Guaranteed optimality for question posed • Time linear in the number of species and the total sequence lengths, exponential in the parsimony score • Practical on real biological data sets • Discovered highly conserved regions, both known and not (yet) known • Available at http://bio.cs.washington.edu/software.html