1 / 45

In silico reconstruction of an ancestral mammalian genome

In silico reconstruction of an ancestral mammalian genome. UQAM Seminaire de bioinformatique Mathieu Blanchette. CGACTGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGT GCATCGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGA TGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGTCGAT

leon
Download Presentation

In silico reconstruction of an ancestral mammalian genome

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. In silico reconstruction of an ancestral mammalian genome UQAM Seminaire de bioinformatique Mathieu Blanchette

  2. CGACTGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGT GCATCGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGA TGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGTCGAT TATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAGCAATA CGACTGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGT GCGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAGCA CGCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGTCGTA ACGTTACGCATGACGATCAGACTACGCATAGATAGAGCCGATCATCT CAGACGACGATCAGACTACTATATCAGCAGATTACGGTGGCATACTA ATCGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAAA CGACGATCAGACTACTATATCAGCAGATTACGGTGCGCGAATTCATA TATTTACGTTACGCATGACGATCAGACTACGCATAGATAGATTGATA CATCAGACGACGATCAGACTACTATATCAGCAGATTACGGTGCATAT TTTACGTTACGCATGACGATCAGACTACGCATAGATAGAGATCATCA TCAGACGACGATCAGACTACTATATCAGCAGATTACGGTAGCATTCT CGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAATGC ACGACGATCAGACTACTATATCAGCAGATTACGGTGATAGATACGAT CGTATTTACGTTACGCATGACGATCAGACTACGCATAGATAGAGATA GCATCAGACGACGATCAGACTACTATATCAGCAGATTACGGTGATAC GCATGACGATCAGACTACGCATAGATAGATTATTACTGGATACTGCA The Human genome • Sequence of ~3*109 nucleotides • Complete sequence is known (2001) HOW DOES IT WORK??

  3. Comparative Genomics • Goal: Functional annotation of the genome • What is the role of each region of the genome? • Very hard to answer…. • Idea: Look not only at what our genome is now, but also at how it evolved • Different types of functional regions have different evolutionary signatures • Complete genomes are sequenced for: • Human, chimp, mouse, rat, house, chicken, zebrafish, pufferfish • Partial genomes are available for: • Dog, cow, rabbit, elephant, armadillo

  4. Mutations G(t) = ACGTAGGCGATCAG---ATCGAT G(t+1)= ACGAAGG--ATCAGGGGATCGAT • Other less frequent mutations: - Duplications - Genome rearrangements (e.g. large inversions) • Mutations happen randomly • Natural selection favors mutations that improve fitness Substitutions Deletions Insertions

  5. A random walk in genome space

  6. Rapid radiation ~75 Myrs ago • Many nearly independent phyla • Many “noisy” copies of ancestor • Accurate reconstruction of ancestors may be feasible Mammalian evolution http://www.broad.mit.edu/personal/jpvinson/phylogenetics/bigtree_1_0.jpg

  7. Ancestral Genome Reconstruction • Given: - Genomic sequences of several mammals • - Phylogenetic tree • Find: The genomic sequence of all their ancestors ARMADILLO TGCTACTAATATTTAGTACATAGAGCCCAGGGGTGCTGCTGAAAGTCTTAAAATGCACAGTGTAGCCCCTCCTCC COW GCCTCTCTTTCTGCCCTGCAGGCTAGAATGTATCACTTAGATGTTCCAAATCAGAAAGTGTTCAGCCATTTCCATACC HORSE GTCACAATTTAGGAAGTGCCACTGGCCTCTAGAGGGTAGAAGACAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCC CAT GTCACAGTTTAGGGGGTACTACTGGCATCTATCGGGTGGAGGATAGGGATACTGATAATCATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCC DOG GTCACAATTTGGGGGATACTACTGGCATCTAATGGGTAGAGGACAGGGATACTGATAATTGCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCC HEDGEHOG GTCATAGTTTGATTATATGGGCTTCTTAGTAGACAAAGAAAAAGATGTTCTGGTAGTCATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTC MOUSE GTCACAGTTTGGAGGATGTTACTGACATCTAGAGAGTAGACTTTAAAGATACTGATAGTCACCCCATTGTGCACCTCC RAT GTCACAATTTGGAGGATGTTACTGGCATCTAGAGAGTAGACTTTAAGGACACTGATAATCATACTATGCTGCACTTCC RABBIT ATCACAATTTGGGGAACACCACTGGCATCTCGGGTAGCAGGCCAGGCATGCTGGTAATTATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACC LEMUR ATCACAATTGGGGGTGCCACGGTCCTCCAGTGGGTAGAGAACAGGGAGGCTGATAACCACCCTGCAGTGCACAGGGCAGTGCCCCACTCCCACCAC MOUSE-LEMUR ATCACAGTTGGGGGATGCCACTGGCCTCAAGTGGGTAGAGAACAGGGAGGCTGAAAACCACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCC VERVET GTCAGAATTTGGGGGATGCTTCTGGCTCTACTTGGGTAGAGAAACAGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTCTACTTGGGTAGAGAAACAGGAATGCTTATAATCATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCC BABOON GTCAGAATTTGGGGGATGCTTCTGGCTCTACTTGGGTAGAAAAACAGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTCGACTTGGGTAGAGAAGCGGGGATGCTTATAATCATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCC GORILLA GTCACGATTTGGGGGATGCTTCTGGCTCAACTTGGGTAGAGAAGTGGGGATGCTTATACTCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC CHIMP GTCACGATTTGGGGGATGCTTCTGGCTCAACTTGGGTAGAGAAGCGGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC HUMAN GTCACGATTTGGGGGATGCTTCTGGCTCAACTTGGGTAGAGAAGCGGGGATGCTTATAATCATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCC All of it: Functional, non-functional, introns, intergenic, repeats, everything*! • Mutational operations • Small-scale : Substitutions, deletions, insertions (inc. transposons) • Large scale: Genome rearrangement, segmental/tandem duplications • (*): Heterochromatin non-included

  8. Reconstruction algorithm • Identify syntenic regions in each species • Blastz (Schwartz et al.) and Chaining/netting program (Kent et al.) • In ENCODE case: targeted BAC sequencing

  9. Reconstruction algorithm 2) Compute multiple genome alignment • TBA program (Blanchette, Miller, et al.) • Goal: Phylogenetic correctness • Two nucleotides are aligned if and only if they have a common ancestor. ARMADILLO ----------------TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA----------GTCTTAAAATGCACAGTGTAGCCCCTCCTCC------------ACAAAGAATTAACTAGCCCAGAATGTCAGGA--------GT--A-CCAAG COW GCCTCTCTTT-----------CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA---------------ATCAGAAAGTGTTCAG----------CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA--------TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA--------GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC----------ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA--------GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT----------GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG--------GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC----------ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA--------TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC----------ACCCCATTGTGCAC---------------------CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC----------ATACTATGCTGCAC---------------------TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT----------ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC-----------CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC----------ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA--------GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC----------ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA--------GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA--------GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC----------ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG--------GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG--------GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG--------GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG--------GT--GTCCAGG

  10. Reconstruction algorithm 3) Reconstruct insertion/deletion history • Find most likely explanation for gaps observed ARMADILLO ----------------TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA----------GTCTTAAAATGCACAGTGTAGCCCCTCCTCC------------ACAAAGAATTAACTAGCCCAGAATGTCAGGA--------GT--A-CCAAG COW GCCTCTCTTT-----------CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA---------------ATCAGAAAGTGTTCAG----------CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA--------TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA--------GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC----------ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA--------GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT----------GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG--------GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC----------ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA--------TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC----------ACCCCATTGTGCAC---------------------CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC----------ATACTATGCTGCAC---------------------TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT----------ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC-----------CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC----------ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA--------GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC----------ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA--------GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA--------GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC----------ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG--------GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG--------GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG--------GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG--------GT--GTCCAGG

  11. Reconstruction algorithm 3) Reconstruct insertion/deletion history • Find most likely explanation for gaps observed ARMADILLO ----------------TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA----------GTCTTAAAATGCACAGTGTAGCCCCTCCTCC------------ACAAAGAATTAACTAGCCCAGAATGTCAGGA--------GT--A-CCAAG COW GCCTCTCTTT-----------CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA---------------ATCAGAAAGTGTTCAG----------CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA--------TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA--------GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC----------ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA--------GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT----------GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG--------GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC----------ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA--------TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC----------ACCCCATTGTGCAC---------------------CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC----------ATACTATGCTGCAC---------------------TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT----------ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC-----------CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC----------ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA--------GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC----------ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA--------GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA--------GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC----------ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG--------GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG--------GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG--------GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG--------GT--GTCCAGG

  12. Reconstruction algorithm 3) Reconstruct insertion/deletion history • Find most likely explanation for gaps observed • This defines the presence/absence of a base at each position of each ancestor ARMADILLO ----------------TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA----------GTCTTAAAATGCACAGTGTAGCCCCTCCTCC------------ACAAAGAATTAACTAGCCCAGAATGTCAGGA--------GT--A-CCAAG COW GCCTCTCTTT-----------CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA---------------ATCAGAAAGTGTTCAG----------CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA--------TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA--------GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC----------ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA--------GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT----------GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG--------GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC----------ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA--------TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC----------ACCCCATTGTGCAC---------------------CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC----------ATACTATGCTGCAC---------------------TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT----------ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC-----------CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC----------ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA--------GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC----------ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA--------GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA--------GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC----------ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG--------GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG--------GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG--------GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG--------GT--GTCCAGG NNNNNNNNNNNNNNNNNNNNNNNNNNNN-----N-NNNNN-NNNNNNN-NN-NNNNNNNNNNNNNNNNN----------NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

  13. Reconstruction algorithm 4) Infer max.-like. nucleotide at each position • Felsenstein algo. with context-sensitive model • Ancestral sequences are inferred! ARMADILLO ----------------TGCTACTAATAT-----T-TAGTA-CATAGAG-CC-CAGGGGTGCTGCTGAAA----------GTCTTAAAATGCACAGTGTAGCCCCTCCTCC------------ACAAAGAATTAACTAGCCCAGAATGTCAGGA--------GT--A-CCAAG COW GCCTCTCTTT-----------CTGCCCTGCAGGC-TAGAA-TGTATCA-CT-TAGATGTTCCAA---------------ATCAGAAAGTGTTCAG----------CCATTTCCATACCACC----AGGAGCTA-CAATGTTGGGCTGCAGCTA--------TTTGGATCAAA HORSE GTCACAATTTAGGAAGTGCCACTGGCCT-----C-TAGAG-GGTAGAA-GA-CAGGGATGCTAATAATCATCCCACGTCATCCTACAGTGCTCAGAACAGCACCCCTACCCTCACCCCATCAACAAAGAATTATCCAGCCCAAAATGCCAATA--------GT--GCCCAGA CAT GTCACAGTTTAGGGGGTACTACTGGCAT-----C-TATCG-GGTGGAG-GA-TAGGGATACTGATAATC----------ATTCTACAGTGCACAGGACAGTACCCCTACTTTCACCCCACAA-CAAAGAATTATCCAGCCCAAAATGCCAACA--------GT--GCTCAGA DOG GTCACAATTTGGGGGATACTACTGGCAT-----C-TAATG-GGTAGAG-GA-CAGGGATACTGATAATT----------GCTTTACAGTGCACAGGACAGCACCCTTATCTTCACCCCAAAAGCAAAGTATTATCCAGCCCCAAATGCCAATG--------GT--GCTCAGA HEDGEHOG GTCATAGTTT----GATTATATGGGCTT-----CTTAGTA-GACAAAGAAA-AAGATGTTCTGGTAGTC----------ATTCTGCTTTCCATATGATAGCACTCCCATCTTCACTTCCAAAATTAAGAGTCATCATACTCAGTGTGCCAATA--------TG--GCCCAGA MOUSE GTCACAGTTTGGAGGATGTTACTGACAT-----C-TAGAG-AGTAGAC-TT-TAAAGATACTGATAGTC----------ACCCCATTGTGCAC---------------------CTCCAACAATAATGGCTCATCGAAACCTAAATGCCAATCTGCCAATTAT--GTCCATG RAT GTCACAATTTGGAGGATGTTACTGGCAT-----C-TAGAG-AGTAGAC-TT-TAAGGACACTGATAATC----------ATACTATGCTGCAC---------------------TTCCAACAATAATGGCTCATCTAGACCTAAATACCAATCTGCCAATTAT--ATCCATG RABBIT ATCACAATTTGGGGAACACCACTGGCAT-----C-TCGGGTAGCAGGC----CAGGCATGCTGGTAATT----------ATACTACAGTGCACAGTACAGTTCCCCACATCCCGCACCAACAACA--GGTTTATGCTGCCCAAAGTGCCAGTGTGC-----------CCACG LEMUR ATCACAA-TTGGGGG-TGCCACGGTCCT-----C-CAGTG-GGTAGAG-AA-CAGGGAGGCTGATAACC----------ACCCTGCAGTGCACAGGGCAGTGCC-CCACTCCCACCACAACAATGGAGAATTATTGGGCCCCAAATGCCAATA--------GT--GCCCAAG MOUSELEMUR ATCACAG-TTGGGGGATGCCACTGGCCT-----C-AAGTG-GGTAGAG-AA-CAGGGAGGCTGAAAACC----------ACCCTGCAGAGCACGGGGCAGTGCCTTCACCACCACTCCAACAACGGAGAATTATTGGGTCCCAAATGCCAATA--------GT—-GCCCAGG VERVET GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGAACCCAAAATGTTAATA--------GT--GTCCAGG MACAQUE GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAG-AAACAGGAATGCTTATAATC----------ATCCTACAGTGCACAGGTCAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGCTAATG--------GT--GTCCAGG BABOON GTCAGAATTTGGGGGATGCTTCTGGCTC-----T-ACTTG-GGTAGAA-AAACAGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTATCGAAGAATCATTGGACCCAAAATGTTAATG--------GT--GTCCAGG ORANGUTAN GTCACGATTTGGGAGATGCTTCTGGCTC-----G-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCAACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCACTGGACCCAAAATGTTAATG--------GT--GTCCAGG GORILLA GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGTGGGGATGCTTATACTC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGG CHIMP GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCGAAAATGTTAATG--------GT--GTCCAGA HUMAN GTCACGATTTGGGGGATGCTTCTGGCTC-----A-ACTTG-GGTAGAG-AAGCGGGGATGCTTATAATC----------ATCCTACAGTGCACAGGACAGTACCCCCACCCACACTCCAGTAATGAAGAATCATTAGACCTAAAATGTTAATG--------GT--GTCCAGG GTCACAATTTGGGGGATGCTACTGGCAT-----C-TAGTG-GGTAGAG-AA-CAGGGATGCTGATAATC----------ATCCTACAGTGCACAGGACAGTGCCCCCACCCCCACTCCAACAACAAAGAATTATCCGGCCCAAAATGCCAATA--------GT--GCCCAGG

  14. Optimal indel reconstructionNot so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN-------NNNN NNNNNN-----NNNN

  15. Reconstructing indel historyNot so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN-------NNNN NNNNNN-----NNNN

  16. Reconstructing indel historyNot so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN-------NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNN NN------NNNNNNN NNNN-------NNNN NNNNNN-----NNNN

  17. Reconstructing indel historyNot so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN-------NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNN NN------NNNNNNN NNNN-------NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NN----------------------NNNNNNN NNNN-----------------------NNNN NNNNNN---------------------NNNN

  18. Reconstructing indel historyNot so easy! NNNNNNNNNNNNNNN NN------NNNNNNN NNNN-------NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNN NN------NNNNNNN NNNN-------NNNN NNNNNN-----NNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NN----------------------NNNNNNN NNNN-----------------------NNNN NNNNNN---------------------NNNN

  19. Inferring indel history • Given: • A multiple sequence alignment, • A phylogenetic tree, • Probability model for deletions • Probability depends on deletion length and branch length • Probability model for insertions • Probability depends on insertion length, branch length, and content • Find: The most likely set of insertions and deletions that lead to the given alignment • NP-hard (Chindelevitch et al. 2006) • Fredslund et al. (2003): Restricted enumeration • Blanchette et al. (2004): Greedy algorithm • Chindelevitch et al. (2006): Integer Linear Programming

  20. Partial Results - Deletions only • If only deletions are allowed and all deletions have the same probability (cost), then: • Rectangle-covering problem, where the tree determines which sets of rows of admissible • NNNNNNN---NN-----N • NNNNNNNN--NN-----N • N---NNNNNNNNNN---N • NN--NNNNNNNNNNNNNN • Exact polynomial-time greedy algorithm • Idea: There always exists a “forced moved”, i.e. a gap that can only be covered by a single maximal deletion

  21. Measuring accuracy • We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. - Start with a random (realistic) ancestral sequence AGCATAGA

  22. Measuring accuracy • We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. 1) Simulate evolution along the mammalian tree AGCATAGA ACGACGATA AGCATA AGCATCAG AGCAAATC AGACTACA AGCATCAGC AGG AGGCT AGGACATCA AGGACACCA AGGACACCA AGGACCCCA AGGACCCCA AGGATTC AGGATTC AGGATTC AGGGTTC AGGGTTC AGCATAGA AGCATTAGA AGCATTGAGA AGGATAGA

  23. Measuring accuracy • We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. - Use TBA to align the sequences generated AG-C-AT--- ACGA-CG--- A----GC--- AGC--AT--- AGCA-A---- AGAC-TA--- AGCAATC--- AGGC------ AGGC------ AGGA-CA--- AGGA-CACCA AGGA-CACCA AGGA-CCCCA AGGA-CCCCA AGGA--TTC- AGGA--TTC- AGGA--TTC- AGGG--TTC- AGGG--TTC- AGCATAGA AGCATTAGA AGCATTGAGA AGGATAGA

  24. Measuring accuracy • We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. - Reconstruct indelhistory: AG-C-AT--- ACGA-CG--- A----GC--- AGC--AT--- AGCA-A---- AGAC-TA--- AGCAATC--- AGGC------ AGGC------ AGGA-CA--- AGGA-CACCA AGGA-CACCA AGGA-CCCCA AGGA-CCCCA AGGA--TTC- AGGA--TTC- AGGA--TTC- AGGG--TTC- AGGG--TTC- AGCATAGA AGCATTAGA AGCATTGAGA AGGATAGA

  25. Measuring accuracy • We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. - Infer ancestral sequences at each node AG-C-AT--- ACGA-CG--- A----GC--- AGC--AT--- AGCA-A---- AGAC-TA--- AGCAATC--- AGGC------ AGGC------ AGGA-CA--- AGGA-CACCA AGGA-CACCA AGGA-CCCCA AGGA-CCCCA AGGA--TTC- AGGA--TTC- AGGA--TTC- AGGG--TTC- AGGG--TTC- AGCATAGA AGTATAGGA AGCATTAGA AGTATTTAGA AGCATTGAGA AGCTTGAGA AGGATAGA AGATCGA

  26. Measuring accuracy • We use simulations of mammalian sequence evolution to evaluate the accuracy of the reconstruction on neutrally evolving DNA. For each node, align true and predicted ancestor Count: Missing bases + Added bases + Substituted bases AGCATAGA AGTATAGGA ACGCATTAGA 3 errors/10 bp Error rate = 0.3 AGTATTTAGA ACGCATT-AGA A-GTATTTAGA AGCATTGAGA AGCTTGAGA AGGATAGA AGATCGA

  27. Simulation details • We simulate neutrally evolving regions of 50kb • We model: - Lineage-specific neutral mutation rates - Insertions and deletions based on empirical frequency and length distributions - Insertion of transposable elements - CpG effect • We don’t model: - DNA polymerase slippage - Positive selection - Genome rearrangement, duplications • Sanity checks: Simulated sequences are similar to actual mammalian sequences: • Same pair-wise percent identity • Same frequency and length distribution of insertions and deletions • Same repetitive content and age distribution of repeats

  28. Guess which ancestor can be best reconstructed? Eizirik et al. 2001

  29. Reconstructability and tree topology A R R n dependent descendents n independent descendents B • Star phylogeny • Leaves are independent • Accuracy approaches 100% exponentially fast as n increases • Bifurcating root • Information lost between R and A or B can’t be recovered • Can’t do better than if A and B were reconstructed perfectly • Accuracy < 100% -  for all n

  30. Eizirik et al. 2001

  31. How many species do we need? Best choice of species: - Sample many taxa - Choose slowly evolving species

  32. What if the fast-radiation model is wrong?

  33. Reconstructing real ancestors

  34. COW • For this set of species, simulations predict: • Expected accuracy ~95% RAT CHIMP, GORILLA, ORANGUTAN, MACAQUE, VERVET, BABOON MOUSE-LEMUR

  35. External validation using ancestral transposons Actual mammalian ancestor Transposon consensus Human relic

  36. External validation using ancestral transposons Reconstructedmammalian ancestor 0.314 subst/site 0.117 subst/site Actual mammalian ancestor Transposon consensus Human relic 0.391 subst/site

  37. External validation using ancestral transposons Reconstructedmammalian ancestor 0.117 subst/site Error = 0.026 subst/site 0.314 subst/site Actual mammalian ancestor Transposon consensus Human relic 0.391 subst/site

  38. What’s next? Whole genome! • Data available • Whole genomes: Human, chimp, mouse, rat, dog • Unassembled/ low coverage genomes: Cow, rabbit, armadillo, elephant • Challenges: • Fewer species • Unassembled contigs • Genome rearrangements • Recombination hotspots We expect that 90% of the Boreoeutherian genome can be reconstructed with ~90% accuracy

  39. Why should we care? • Ancestral genome allows to see what and when changes happened in our genome • Allows detection and “dating” of lineage specific innovations (e.g. FOXP2). • Allows a better understanding of the forces driving genome evolution • New model organism? • Human genome is 4 times closer to the ancestral genome than to the mouse genome: better model for human phenotypes?

  40. Even if we had the full genomes of all living mammalian species: • Technological problem: • We can’t synthesize large regions of DNA • Many regions can’t be reconstructed at all: • Heterochromatin • Regions with high recombination rates • 99% base-by-base accuracy is not enough • One mistake may be enough to make life impossible

  41. Acknowledgements • David Haussler, Brian Raney UC Santa Cruz • Webb Miller Penn State Univ. • Eric Green NHGRI • UC Santa Cruz group: • Adam Siepel, Robert Baertsch, Gill Bejerano, Jim Kent • McGill group: • Leonid Chindelevitch, Zhentao Li, Eric Blais

More Related