500 likes | 524 Views
Genome evolution. Lecture 1: Introduction (and some background on Markov Processes). Amos Tanay, Ziskind 204, ext 3579 עמוס תנאי amos.tanay@weizmann.ac.il http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/. Evolution. Code. Genome. Talking and eating….
E N D
Genome evolution Lecture 1: Introduction (and some background on Markov Processes) Amos Tanay, Ziskind 204, ext 3579 עמוס תנאי amos.tanay@weizmann.ac.il http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/
Evolution Code Genome
Talking and eating… Retina in human and octopus Evolving tiny iphone fingers? octopus human retina Nerve fibers (Most of us) don’t think we evolved to be optimal.. The Appendix
But when looking at the genome.. • Very commonly, evolution is assigned with super-powers • Any non random pattern in the genome is assumed to have a “meaning” • ..because otherwise “evolution would have eliminated it” • In a way – the current attitude to evolution among many biologists is semi-religious Mathematics helps demystifying the inner works of evolution
The Lamarckian view – directed evolution Jean Baptiste Lamarck
Statistics, Genetics and Molecular Biology Frequency of recessive allele (blue flower color) in “desert snow” flowers (Lynanthus parruae) Fischer Haldane 0.717 0.005 0.000 0.000 0.032 0.573 0.657 0.000 0.009 0.000 0.002 0.302 0.007 0.004 0.000 0.000 0.126 0.504 0.005 0.106 Dobzhansky Mayr 0.008 0.000 0.339 0.000 0.224 0.010 0.068 0.000 0.014 0.411
AA AA Generations/time Aa Aa aa aa Modeling the dynamics of allele frequencies Modeling evolution – take I Blue allele A A Generations/time a Yellow allele a Modeling the dynamics of allele frequencies
t+1 Modeling evolution – take I Blue allele A A Generations/time a Yellow allele a Modeling the dynamics of allele frequencies t
Try it at home I: • Simulate a population of 10,000 “genomes”, all having the same genotype AA • Replace 10 individuals with an “aA” allele • Simulate a new generation: synthesize 20,000 random pairs and select 10,000 out of them • Plot the fraction of the “a” allele in the population • Repeat the experiment many times: can you say something (empirical/analytic) about the probability of ending up with a population lacking “a” completely?
Statistics, Genetics and Molecular Biology The code – Genomic sequences …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… Monod Jacob Crick The machine – Protein networks in cells
B. Molecular understanding of the mechanisms that store, transmit and process biological information Statistics, Genetics and Molecular Biology 1920 A. Quantitative description of populations their genes and their evolution Fischer Haldane 1930 Dobzhansky Mayr 1950 Monod Crick Jacob A+B = The (only?) real quantitative theory of biology!
Neutral Evolution Selectionists: Mutations are occurring by chance - some get selected and these are the changes we see between genomes Kimura et al.: Most of the changes between genomes are neutral - not a result of selection! …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… …ACGAATAGCAAAAGGGCAGATGGCATTCTAGATCGAAAGCATGAAACTAGATAGCAT… …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… Kimura …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT…
Neutral Evolution Kimura’s analytic achievement was the solution of a certain class of Partial Differential Equations that describe the dynamic of allele frequencies under neutral evolution But we can try and understand the essence of neutral evolution even without fancy mathematics: Neutral changes Along the path are fixated Last common ancestor Coalescent time t=n t=1
Modeling evolution – take II t=1 t=2 t=3 t=4 t=5 t=6 t=7 Markov A G G G A A A One letter: pAG A Markov Chain: A set of states (A,C,G,T) Transition probabilities A G pGA C T
Theorem: rates Modeling evolution – take II Kolmogorov A Markov process: A set of states (A,C,G,T) Transition rates
From hundreds to billions loci…. Genome = many independent nucleotides x2 x5 x1 x4 x3 x6 Universal Q 1960 Multiple copies of the same Markov process 1970 1980 Protein analysis Phylogenetic reconstruction 1990 2000 2010
From hundreds to billions loci…. Genome = many independent nucleotides x2 x5 x1 x4 x3 x6 Universal Q 1960 Multiple copies of the same Markov process 1970 1980 1990 2000 2010
Conserved == because it was selected for? Conserved == Functional? “Its conserved” Don’t change! Warn you! …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… The everlasting power of conservation
Mutations are not simple Mutations cannot be simple! DNA replication – wiki version.. Try it at home II: 1. Download human-chimp pairwise alignment: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/ 2. Count how many time a human T is align to chimp A,C,G,T 3. Recount, this time classifying according to the bases before and after (16 possibilities) 4. Do you see differences in the fraction of conserved T’s?
Evolution of an AAAA..AA sequence X’t Xt Xt+1 Maximum Likelihood flanking aware model Xt Xt+1 Maximum Likelihood independent loci model X’’t Movie removed to save space Simulation based on real parameters from yeast genomes
Evolution of an AAAA..AA sequence X’t Xt Xt+1 Maximum Likelihood flanking aware model Xt Xt+1 Maximum Likelihood independent loci model X’’t Movie removed to save space Simulation based on real parameters from yeast genomes
TTT QTXT CTT QCXT CTT What happened? stationary distributions TCT TTT TAT TGT Stationary distribution condition Theorem: (under reasonable conditions..) - A Markov process will converge to a stationary distribution Theorem: the stationary distribution will be uniform iff the transition probabilities are symmetric
Try it at home III: 1. Download a human chromosome: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/ 2. Count the number of appearances of each word on 6 DNA bases 3. Now do it again on windows of 1MB. 4. Are the statistics always the same? How can this be tested? What can explain possible differences?
Example: CpG islands CG CG CA TG CA TG CpG islands: high CpG stationary distribution Due to variable mutability Got very famous: (3.4M hits in google..) – and are still very confusing May play a role in cancer – no optimality here CG 1% diverged 40% diverged >99% of the genome: CpG is methylated <1% of the genome CpG is unmethylated
More complex example: codon bias nucleotides Amino-acids The genetic code is degenerate: More than one codon (triplet) for each amino acid Codon bias: codons are not appearing uniformly in the genome Popular theory: codon usage is optimized to maximize protein yield
Evolution Genome (Genetic code) CODE1 CODE2 (Codon bias) Function Function2 (Protein coding sequence) (Translation control)
The sequence context affect the probability of a short insertion or deletion within >100 fold Insertion/Deletion usually break the gene frame and destroy the derived protein Sequence context Around deletions l=2 Translating garbage A C G T Codons are selected to reduce the potential of frame shifting mutations! Distance from deletion junction Sequence context Around inserts l=2 Randomized exons vs. Real exons A Relative Number of delete susceptible loci Relative Number of Insert susceptible loci C G T Insert length Deletion length Distance from insert
Evolution Mutability Genome (Genetic code) CODE1 CODE2 (Codon bias) Function Function2 (Protein coding sequence) (Translation control)
Genome sequences are something new and exciting • We can approach them using the old techniques and methodologies… • Focusing on one feature at a time • Looking at the static picture • Boring ourselves to death Or we can take new bold approaches… This is what Lamark, Darwin, Myer, Crick or Kimura would have done… Jacob
3X109 {ACGT} 3X109 {ACGT} Genome alignment Humans and Chimps ~5-7 million years • Where are the “important” differences? • How did they happen?
9% 1.2% 0.8% 3% 1.5% 0.5% 0.5% Where are the “important” differences? How did new features were gained? Gorilla Chimp Gibbon Baboon Human Macaque Marmoset Orangutan
Antibiotic resistance: Staphylococcus aureus Timeline for the evolution of bacterial resistance in an S. aureus patient (Mwangi et al., PNAS 2007) • Skin based • killed 19,000 people in the US during 2005 (more than AIDS) • Resistance to Penicillin: 50% in 1950, 80% in 1960, ~98% today • 2.9MB genome, 30K plasmid • How do bacteria become resistant to antibiotics? • Can we eliminate resistance by better treatment protocols, given understanding of the evolutionary process?
Ultimate experiment: sequence the entire genome of the evolving S. aureus Mutations Resistance to Antibiotics 8 9 10 11 12 13 14 15…18 1 2 3 4-6 7 S. Aureus got found just few “right” mutations and survived multi-antibiotics
Yeast Genome duplication • The budding yeast S. cerevisiae genome have extensive duplicates • We can trace a whole genome duplication by looking at yeast species that lack the duplicates (K. waltii, A. gosypii) • Only a small fraction (5%) of the yeast genome remain duplicated
How can an organism tolerate genome duplication and massive gene loss? • Is this critical in evolving new functionality?
“Junk” and ultraconservation Baker’s yeast 12MB ~6000 genes 1 cell The worm c.elegans 100MB ~20,000 genes ~1000 cells Humans 3GB ~27,000 genes ~50 trillions cells
ENCODE Data intergenic exon intron exon intron exon intron exon intergenic
Things you need to know or catch up with: • Basic discrete probability, std distributions • Basic graphs and combinatorics • A clue on biology will be helpful, not vital What you’ll learn: • Foundations: Population genetics • Foundations: Molecular evolution • inference in graphical models • comparative genomics • Intro to genome organization and key concepts in evolution Books: Graur and Li, Molecular Evolution Lynch, Origin of genome architecture Hartl and Clark, Population genetics Durbin et al. Biological sequence analysis Karlin and Taylor, Markov Processes Freidman and Koller draft textbook (handouts) Papers as we go along.. N. Friedman D. Koller BN and beyond
Course duties • 4 exercises, 40% of the grade • 1 Genomic exercise (in pairs) for 10% of the grade • Compare two genomes of your choice: mammals, worms, flies, yeasts, bacteria, plants • Exam: 60% • Master key computational results
Probabilities • Our probability space: • DNA/Protein sequences: {A,C,G,T} • Time/populations • Queries: • If a locus have an A at time t, what is the chance it will be C at time t+1? • If a locus have an A in an individual from population P, what is the chance it will be C in another individual from the same population? • What is the chance to find the motif ACGCGT anywhere in a random individual of population P? what is the chance it will remain the same after 2m years? Conditional Probability: Chain Rule: Bayes Rule: A B
Modeling processes Modeling the process t Comparing end points (probability distribution
t Poisson process 0 1 2 3 4 Random walk -1 0 1 2 3 Markov chain Brownian motion A B C D Discrete time T=1 T=2 T=3 T=4 T=5 Continuous time
Markov chains Transition probability Stationary transition probabilities One step transitions Stationary process General Stochastic process: The Markov property: A set of states: Finite or Countable. (e.g., Integers, {A,C,G,T}) Discrete time: T=0,1,2,3,….
Markov chains 4 Nucleotides A G C T A G C T A G C T A G C T 20 Amino Acids A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V The loaded coin A B T=1 A B T=2 A B T=3 A B T=4 pab A B 1-pab 1-pba pba
Markov chains Transition matrix P: A discrete time Markov chain is completely defined given an initial condition and a probability matrix. The Markov chain graph G is defined on the states. We connect (a,b) whenever Pab>0 Distribution after T time steps given x as an initial condition Matrix power
Right,left eigenvector: When an eigen-basis exists We can find right eigenvectors: And left eigenvectors: With the eigenvalue spectrum: Which are bi-orthogonal: And define the spectral decomposition: Spectral decomposition T=1 T=2 T=3 A A A B B
Spectral decomposition To compute transition probabilities: O(|E|)*T ~ O(N2)*T per initial condition T matrix multiplications to preprocess for time T Using spectral decomposition: O(Spectral pre-process) + 2 matrix multiplications per condition
Fixed point: l2 = second largest eigenvalue. Controlling the rate of process convergence Convergence Spec(P) = P’s eignvalues, l1 > l2>... l1= largest, always = 1. A Markov Chain is irreducible if its underlying graph is connected. In that case there is a single eigenvalue that equals 1. What does the left eigenvector corresponding to l1 represent?