550 likes | 651 Views
CS273A. Lecture 11: Comparative Genomics II. MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: Harendra Guturu & Panos Achlioptas. Announcements. Some mid term feedback feedback : You seem to like us We like you too!
E N D
CS273A Lecture 11: Comparative Genomics II MW 12:50-2:05pm in Beckman B302 Profs: Serafim Batzoglou & Gill Bejerano TAs: HarendraGuturu & PanosAchlioptas http://cs273a.stanford.edu [BejeranoFall13/14]
Announcements Some mid term feedback feedback: • You seem to like us • We like you too! • Teach us more biology / Teach us more algorithms • We’ll highlight follow-up classes towards the end of the quarter • Give us more references • Start with Wikipedia. Then ask us for any specifics on Piazza. • How do all the different topics we cover tie together? • They all teach you about the human genome! • Its functions, its evolution and its contribution to disease – it’s a big canvas • What are the most important problems in the field? • Different people will give you different answers • Every topic we introduce to you is not fully resolved! • Homework is very technical. Hard to focus on the insights. • This is part of our daily challenge. • We should make you like the taste of it, because we sure do! • Your project will give you a taste of real open ended research. http://cs273a.stanford.edu [BejeranoFall13/14]
TTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAGTTATATTGAATTTTCAAAAATTCTTACTTTTTTTTTGGATGGACGCAAAGAAGTTTAATAATCATATTACATGGCATTACCACCATATACATATCCATATCTAATCTTACTTATATGTTGTGGAAATGTAAAGAGCCCCATTATCTTAGCCTAAAAAAACCTTCTCTTTGGAACTTTCAGTAATACGCTTAACTGCTCATTGCTATATTGAAGTACGGATTAGAAGCCGCCGAGCGGGCGACAGCCCTCCGACGGAAGACTCTCCTCCGTGCGTCCTCGTCTTCACCGGTCGCGTTCCTGAAACGCAGATGTGCCTCGCGCCGCACTGCTCCGAACAATAAAGATTCTACAATACTAGCTTTTATGGTTATGAAGAGGAAAAATTGGCAGTAACCTGGCCCCACAAACCTTCAAATTAACGAATCAAATTAACAACCATAGGATGATAATGCGATTAGTTTTTTAGCCTTATTTCTGGGGTAATTAATCAGCGAAGCGATGATTTTTGATCTATTAACAGATATATAAATGGAAAAGCTGCATAACCACTTTAACTAATACTTTCAACATTTTCAGTTTGTATTACTTCTTATTCAAATGTCATAAAAGTATCAACAAAAAATTGTTAATATACCTCTATACTTTAACGTCAAGGAGAAAAAACTATAATGACTAAATCTCATTCAGAAGAAGTGATTGTACCTGAGTTCAATTCTAGCGCAAAGGAATTACCAAGACCATTGGCCGAAAAGTGCCCGAGCATAATTAAGAAATTTATAAGCGCTTATGATGCTAAACCGGATTTTGTTGCTAGATCGCCTGGTAGAGTCAATCTAATTGGTGAACATATTGATTATTGTGACTTCTCGGTTTTACCTTTAGCTATTGATTTTGATATGCTTTGCGCCGTCAAAGTTTTGAACGATGAGATTTCAAGTCTTAAAGCTATATCAGAGGGCTAAGCATGTGTATTCTGAATCTTTAAGAGTCTTGAAGGCTGTGAAATTAATGACTACAGCGAGCTTTACTGCCGACGAAGACTTTTTCAAGCAATTTGGTGCCTTGATGAACGAGTCTCAAGCTTCTTGCGATAAACTTTACGAATGTTCTTGTCCAGAGATTGACAAAATTTGTTCCATTGCTTTGTCAAATGGATCATATGGTTCCCGTTTGACCGGAGCTGGCTGGGGTGGTTGTACTGTTCACTTGGTTCCAGGGGGCCCAAATGGCAACATAGAAAAGGTAAAAGAAGCCCTTGCCAATGAGTTCTACAAGGTCAAGTACCCTAAGATCACTGATGCTGAGCTAGAAAATGCTATCATCGTCTCTAAACCAGCATTGGGCAGCTGTCTATATGAATTAGTCAAGTATACTTCTTTTTTTTACTTTGTTCAGAACAACTTCTCATTTTTTTCTACTCATAACTTTAGCATCACAAAATACGCAATAATAACGAGTAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTTTCCTACGCATAATAAGAATAGGAGGGAATATCAAGCCAGACAATCTATCATTACATTTAAGCGGCTCTTCAAAAAGATTGAACTCTCGCCAACTTATGGAATCTTCCAATGAGACCTTTGCGCCAAATAATGTGGATTTGGAAAAAGAGTATAAGTCATCTCAGAGTAATATAACTACCGAAGTTTATGAGGCATCGAGCTTTGAAGAAAAAGTAAGCTCAGAAAAACCTCAATACAGCTCATTCTGGAAGAAAATCTATTATGAATATGTGGTCGTTGACAAATCAATCTTGGGTGTTTCTATTCTGGATTCATTTATGTACAACCAGGACTTGAAGCCCGTCGAAAAAGAAAGGCGGGTTTGGTCCTGGTACAATTATTGTTACTTCTGGCTTGCTGAATGTTTCAATATCAACACTTGGCAAATTGCAGCTACAGGTCTACAACTGGGTCTAAATTGGTGGCAGTGTTGGATAACAATTTGGATTGGGTACGGTTTCGTTGGTGCTTTTGTTGTTTTGGCCTCTAGAGTTGGATCTGCTTATCATTTGTCATTCCCTATATCATCTAGAGCATCATTCGGTATTTTCTTCTCTTTATGGCCCGTTATTAACAGAGTCGTCATGGCCATCGTTTGGTATAGTGTCCAAGCTTATATTGCGGCAACTCCCGTATCATTAATGCTGAAATCTATCTTTGGAAAAGATTTACAATGATTGTACGTGGGGCAGTTGACGTCTTATCATATGTCAAAGTCATTTGCGAAGTTCTTGGCAAGTTGCCAACTGACGAGATGCAGTAACACTTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCACAAACTTTAAAACACAGGGACAAAATTCTTGATATGCTTTCAACCGCTGCGTTTTGGATACCTATTCTTGACATGATATGACTACCATTTTGTTATTGTTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATGTTTTCAATGTAAGAGATTTCGATTATCTTATAGTTCATACATGCTTCAACTACTTAATAAATGATTGTATGATAATAAAG Genome Evolution
human chimp macaque mouse rat cow dog opossum platypus chicken zfish tetra fugu Comparative Genomics “Nothing in Biology Makes Sense Except in the Light of Evolution” Theodosius Dobzhansky “Nothing in Evolution Makes Sense Except in the Light of Computation” Yours Truly T http://cs273a.stanford.edu [BejeranoFall13/14]
Gene tree Speciation Duplication Loss Terminology Orthologs : Genes related via speciation (e.g. C,M,H3) Paralogs: Genes related through duplication (e.g. H1,H2,H3) Homologs: Genes that share a common origin (e.g. C,M,H1,H2,H3) single ancestral gene Species tree http://cs273a.stanford.edu [BejeranoFall13/14]
Conservation implies function purifying selection vs.neutral evolution Note: Lack of sequence conservation does NOT imply lack of function. NOR does it rule out function conservation. http://cs273a.stanford.edu [BejeranoFall13/14]
Dotplots are a simple way of seeing alignments We really like to see good visual demonstrations, not just tables of numbers It’s a grid: put one sequence along the top and the other down the side, and put a dot wherever they match. You see the alignment as a diagonal Note that DNA dotplots are messier because the alphabet has only 4 letters Smoothing by windows helps: Dotplots http://cs273a.stanford.edu [BejeranoFall13/14]
Chaining Alignments Chaining highlights homologous regions between genomes, bridging the gulf between syntenic blocks and base-by-base alignments. Local alignments tend to break at transposon insertions, inversions, duplications, etc. Global alignments tend to force non-homologous bases to align. Chaining is a rigorous way of joining together local alignments into larger structures. http://cs273a.stanford.edu [BejeranoFall13/14]
Another Chain Example Human Sequence Mouse Sequence A B C A B C D E B’ D E In Human Browser In Mouse Browser Implicit Human sequence Implicit Mouse sequence … … D E … … Mouse chains Human chains D E D E B’ http://cs273a.stanford.edu [BejeranoFall13/14]
Chains join together related local alignments likely ortholog likely paralogs shared domain? Protease Regulatory Subunit 3 http://cs273a.stanford.edu [BejeranoFall13/14]
Note: repeats are a nuisance human If, for example, human and mouse have each 10,000 copiesof the same repeat: We will obtain and need to output 108 alignments of all these copies to each other. Note that for the sake of this comparison interspersed repeats and simple repeats are equal nuisances. However, note that simple repeats, but not interspersed repeats, violate the assumption that similar sequences are homologous. mouse Solution: 1 Discover all repetitive sequences in each genome. 2 Mask them when doing genome to genome comparison. 3 Chain your alignments. 4 Add back to the alignments only repeat matches that lie within pre-computed chains. This re-introduces back into the chains (mostly)orthologous copies. (Which is valuable!) http://cs273a.stanford.edu [BejeranoFall13/14]
Chains • a chain is a sequence of gapless aligned blocks, where there must be no overlaps of blocks' target or query coords within the chain. • Within a chain, target and query coords are monotonically non-decreasing. (i.e. always increasing or flat) • double-sided gaps are a new capability (blastz can't do that) that allow extremely long chains to be constructed. • not just orthologs, but paralogs too, can result in good chains. but that's useful! • chains should be symmetrical -- e.g. swap human-mouse -> mouse-human chains, and you should get approx. the same chains as if you chain swapped mouse-human blastz alignments. • chained blastz alignments are not single-coverage in either target or query unless some subsequent filtering (like netting) is done. • chain tracks can contain massive pileups when a piece of the target aligns well to many places in the query. Common causes of this include insufficient masking of repeats and high-copy-number genes (or paralogs). [Angie Hinrichs, UCSC wiki] http://cs273a.stanford.edu [BejeranoFall13/14]
Before and After Chaining http://cs273a.stanford.edu [BejeranoFall13/14]
Chaining Algorithm Input - blocks of gapless alignments from (b)lastz Dynamic program based on the recurrence relationship:score(Bi) = max(score(Bj) + match(Bi) - gap(Bi, Bj)) Uses Miller’s KD-tree algorithm to minimize which parts of dynamic programming graph to traverse. Timing is O(N logN), where N is number of blocks (which is in hundreds of thousands) j<i See [Kent et al, 2003] “Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes” http://cs273a.stanford.edu [BejeranoFall13/14]
Netting Alignments Commonly multiple mouse alignments can be found for a particular human region, particularly for coding regions. Net finds best match mouse match for each human region. Highest scoring chains are used first. Lower scoring chains fill in gaps within chains inducing a natural hierarchy. http://cs273a.stanford.edu [BejeranoFall13/14]
Net highlights rearrangements A large gap in the top level of the net is filled by an inversion containing two genes. Numerous smaller gaps are filled in by local duplications and processed pseudo-genes. http://cs273a.stanford.edu [BejeranoFall13/14]
Nets attempt to capture the ortholog (they also hide everything else) http://cs273a.stanford.edu [BejeranoFall13/14]
Nets/chains can reveal retrogenes (and when they jumped in!) http://cs273a.stanford.edu [BejeranoFall13/14]
Nets • a net is a hierarchical collection of chains, with the highest-scoring non-overlapping chains on top, and their gaps filled in where possible by lower-scoring chains, for several levels. • a net is single-coverage for target but not for query. • because it's single-coverage in the target, it's no longer symmetrical. • the netter has two outputs, one of which we usually ignore: the target-centric net in query coordinates. The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again. • nets do a good job of filtering out massive pileups by collapsing them down to (usually) a single level. • GB: for human inspection always prefer looking at the chains! [Angie Hinrichs, UCSC wiki] http://cs273a.stanford.edu [BejeranoFall13/14]
Before and After Netting http://cs273a.stanford.edu [BejeranoFall13/14]
Convert / LiftOver "LiftOver chains" are actually chains extracted from nets, or chains filtered by the netting process. LiftOver – batch utility http://cs273a.stanford.edu [BejeranoFall13/14]
Drawbacks Chains > > > > chr1 > > > > > > > chr1 > > > < < < < chr5 < < < < < < < < chr1 < < < < Nets > > > > chr1 > > > > > > > chr1 > > > < < < < chr5 < < < < Inversions not handled optimally http://cs273a.stanford.edu [BejeranoFall13/14]
Self Chain reveals paralogs (self net is meaningless) http://cs273a.stanford.edu [BejeranoFall13/14]
Let’s put the chains and nets to good use… http://cs273a.stanford.edu [BejeranoFall13/14]
The Genotype - Phenotype divide Can we find evolutionary patterns that are distinct enough to be phenotypically revealing? Problem #1: Too many nucleotide changes between any pair of related species (or individuals). The vast majority of these are near/neutral. Species A Species B http://cs273a.stanford.edu [BejeranoFall13/14]
Matching Genotype to Phenotype is hard Phenotype Genotype Number of rearrangements Most mutationsare near/neutral. http://cs273a.stanford.edu [BejeranoFall13/14]
What about a tree of related species? What if we could find evolutionary patterns that were distinct enough to be phenotypically revealing? Species A Species B Genomes: Inherited with Modifications. Traits: Come and Go. . . . ancestor Species H http://cs273a.stanford.edu [BejeranoFall13/14]
What happens when an ancestral trait “goes”? ancestral trait information ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time http://cs273a.stanford.edu [BejeranoFall13/14]
ancestral trait information A lot of DNA and many traitsvary between any two species. ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time http://cs273a.stanford.edu [BejeranoFall13/14]
ancestral trait information A lot of DNA and many traitsvary between any two species. What about independent trait loss? vitamin C synthesis, tail, body hair,dentition features, etc. etc. ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time http://cs273a.stanford.edu [BejeranoFall13/14]
ancestral trait information ancestor Trait information is no longer under selection Phenotype Genome Erodes away over evolutionary time http://cs273a.stanford.edu [BejeranoFall13/14]
The PG screen matches trait presence/absence pattern http://cs273a.stanford.edu [BejeranoFall13/14] [Hiller et al., 2012a]
The PG screen Capture the independent genomic switch from purifying selection neutral evolutionin all and only the trait loss species. Robust to: Different trait disabling times. Different trait disabling mutations. http://cs273a.stanford.edu [BejeranoFall13/14]
Branding ;-) Forward Genetics: Search for mutations that segregate with the trait Forward Genomics: Search for regions that are lost only in species lacking the trait phenotype genotype But does it work? http://cs273a.stanford.edu [BejeranoFall13/14]
Vitamin C Synthesis human rats & mice synthesize vitamin C cannot synthesize vitamin C http://cs273a.stanford.edu [BejeranoFall13/14]
The Vitamin C synthesis “phenotree” vitamin C synthesis was lost 3-4 times independently in mammalian evolution Fwd Genomics asks: Do one or moregenomic locilook like THAT? http://cs273a.stanford.edu [BejeranoFall13/14]
Start by using chains and nets! ACCCTATCGATTGCA TCCGTATCG-TT-CA ACTCT-TCGATT-AA species 1 species 2 outgroup First we use lastz, chaining & netting to align the reference genome to orthologous sequences in all other species’ genomes.
We quantify divergence by comparing sequences to the reconstructed ancestral sequence Mutation in species 1 or 2? Insertion in species 1 or deletion in species 2 ? reconstruct ancestral sequence ACCCTATCGATTGCA TCCGTATCG-TT-CA ACTCT-TCGATT-AA species 1 species 2 outgroup ACCCTATCGATT-CA ancestor species 1 14 identical bases ACCCTATCGATTGCA TCCGTATCG-TT-CA species 2 11 identical bases 93% percent of identical bases: species 1 79% more diverged species 2
Sequencing errors mimic divergence ACCCTATCGATT-CAATGG ancestor species 1 89% identical bases ACCCTATCGATTGCAAGGG species 2 61% identical bases TCCGTAACG--T-CTATCG sequence quality scores high sequencing error rate treat species 2 as missing data
Assembly gaps mimic divergence Sanger reads assembly gap ????????? species 1 species 2 species 3 species 4 species 5 conserved region treat species 1 as missing data
Reconstruct the evolutionary history of all conserved regions, coding and non-coding 544,549 conserved regions 93% 70% 85% reconstruct ancestrallocus ... matrix:33 species x544,549 regions • Reconstruct ancestral sequence • Measure extant species divergence • Avoid • Low quality sequence • Assembly gaps • Seek perfect phenotree match http://cs273a.stanford.edu [BejeranoFall13/14]
We quantify the match to the vitamin C pattern by counting the number of species that violate the pattern Percent identity Percent identity 0 100 0 100 1 violation 2 violations http://cs273a.stanford.edu [BejeranoFall13/14]
Regions matching the vitamin C trait are clustered perfect match 544,549 conserved regions 0 1 2 3 4 no. of violating species 5 6 7 8 9 10 no match these conserved regions are all exons of a single gene http://cs273a.stanford.edu [BejeranoFall13/14]
This gene is more divergedin all non-vitamin C synthesizing species http://cs273a.stanford.edu [BejeranoFall13/14]
What is the function of this gene ? 33 genomes X 544,549 regions Vitamin C pattern Gulo - gulonolactone (L-) oxidase encodes the enzyme responsible for vitamin C biosynthesis Note: No likely shared disabling mutation. We learned about both evolution and function. http://cs273a.stanford.edu [BejeranoFall13/14]
The Power of Forward Genomics 33 genomes X 544,549 regions Vitamin C pattern Gulo - gulonolactone (L-) oxidase Forward genomics works. Can it work for continuous traits? With only two independent losses? And many unknown values? http://cs273a.stanford.edu [BejeranoFall13/14]
Bile Bile is a fluid produced by the liver that aids the digestion of lipids in the small intestine. http://cs273a.stanford.edu [BejeranoFall13/14]
Bile Phospholipids Different mammals have remarkably different levels of biliary phospholipids: http://cs273a.stanford.edu [BejeranoFall13/14]
ABCB4 is a phospholipid transporter http://cs273a.stanford.edu [BejeranoFall13/14]
Find “Cure” Models for Human Disease Human ABCB4 mutations lower patient biliary phospholipid levels to guinea pig levels but are detrimental. Our discovery: Guinea pig and horse have inactivated the Abcb4 gene in their natural state. How can they do it? create KO gene Natural KO try to fix/treat find nature’s cure! http://cs273a.stanford.edu [BejeranoFall13/14]