440 likes | 560 Views
Genome evolution: a computational approach. Lecture 1: Modern challenges in evolution. Markov processes. Amos Tanay, Ziskind 204, ext 3579 עמוס תנאי amos.tanay@weizmann.ac.il http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/. The Genome. intergenic. exon. intron.
E N D
Genome evolution: a computational approach Lecture 1: Modern challenges in evolution. Markov processes. Amos Tanay, Ziskind 204, ext 3579 עמוס תנאי amos.tanay@weizmann.ac.il http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/
The Genome intergenic exon intron exon intron exon intron exon intergenic Triplet Code A T C G
3X109 {ACGT} 3X109 {ACGT} Genome alignment Humans and Chimps ~5-7 million years • Where are the “important” differences? • How did they happen?
9% 1.2% 0.8% 3% 1.5% 0.5% 0.5% Where are the “important” differences? How did new features were gained? Gorilla Chimp Gibbon Baboon Human Macaque Marmoset Orangutan
Antibiotic resistance: Staphylococcus aureus Timeline for the evolution of bacterial resistance in an S. aureus patient (Mwangi et al., PNAS 2007) • Skin based • killed 19,000 people in the US during 2005 (more than AIDS) • Resistance to Penicillin: 50% in 1950, 80% in 1960, ~98% today • 2.9MB genome, 30K plasmid • How do bacteria become resistant to antibiotics? • Can we eliminate resistance by better treatment protocols, given understanding of the evolutionary process?
Ultimate experiment: sequence the entire genome of the evolving S. aureus Mutations Resistance to Antibiotics 8 9 10 11 12 13 14 15…18 1 2 3 4-6 7 S. Aureus got found just few “right” mutations and survived multi-antibiotics
Yeast Genome duplication • The budding yeast S. cerevisiae genome have extensive duplicates • We can trace a whole genome duplication by looking at yeast species that lack the duplicates (K. waltii, A. gosypii) • Only a small fraction (5%) of the yeast genome remain duplicated
How can an organism tolerate genome duplication and massive gene loss? • Is this critical in evolving new functionality?
“Junk” and ultraconservation Baker’s yeast 12MB ~6000 genes 1 cell The worm c.elegans 100MB ~20,000 genes ~1000 cells Humans 3GB ~27,000 genes ~50 trillions cells
ENCODE Data intergenic exon intron exon intron exon intron exon intergenic
Grand unifying theory of everything Biology (phenotype) Genomes (genotype) Strings of A,C,G,T (Total DNA on earth: A lot, but only that much)
recombination mutation Species B selection Species A Fitness Evolution: bird’s eyes view Ecology (many species) Geography (Communication barriers) Environment (Changing fitness)
(Probability, Calculus/Matrix theory, some graph theory, some statistics) Course outline Probabilistic models Genome structure Inference Mutations Parameter estimation Population Inferring Selection
Models: Markov chains discrete continuous Bayesian networks Factor Graphs Inference: Dynamic programming Sampling Variational methods Generalized Belief propagation Parameter estimation: EM, function optimization Introduction to the human genome Point mutations Insertion/Deletions Repeats Basic population genetics Drift/Fitness/Selection Probabilistic models Genome structure Inference Mutations Parameter estimation Population Selection Protein coding genes Transcription factor binding sites RNA Networks
Things you need to know or catch up with: • Graph theory • Basic definitions,Trees, Cycles • Matrix algebra • Basic definitions, Eigenvalues • Probability • Basic discrete probability, std distributions What you’ll learn: • Modern methods for inference in complex probabilistic models in general • Intro to genome organization and key concepts in evolution • Inferring selection using comparative genomics Books: Graur and Li, Molecular Evolution Lynch, Origin of genome architecture Hartl and Clark, Population genetics Durbin et al. Biological sequence analysis Karlin and Taylor, Markov Processes Freidman and Koller draft textbook (handouts) Papers as we go along.. N. Friedman D. Koller BN and beyond
Course duties • 5 exercises, 40% of the grade • Mainly theoretical, math questions, usually ~120 points to collect • Trade 1 exercise for ppt annotations (extensive in-line notes) • 1 Genomic exercise (in pairs) for 10% of the grade • Compare two genomes of your choice: mammals, worms, flies, yeasts, bacteria, plants • Exam: 60% (110% in total)
Ancestral Genome sequence 2 Ancestral Genome sequence 1 Model Parameters Model Parameters Genome sequence 1 Genome sequence 2 Genome sequence 3 (0) Modeling the genome sequences Probabilistic modeling: P(data | q) Using few parameters to explain/regenerate most of the data Hidden variables make model explicit and mechanistic • Inferring ancestral genomes • Based on some model compute the distribution of ancestral genomes (2) Learning an evolutionary model Using extant genomes, learn a “reasonable” model
Ancestral Genome sequence 2 Ancestral Genome sequence 1 Model Parameters Genome sequence 1 Genome sequence 2 Genome sequence 3 • Decoding the genome • Genomic regions with different function evolve differently • Learn to read the genome through evolutionary modelling (2)Understanding the evolutionary process The model parameters describe evolution (3) Inferring phylogenies Which tree structure explain the data best? Is it a tree?
Probabilities • Our probability space: • DNA/Protein sequences: {A,C,G,T} • Time/populations • Queries: • If a locus have an A at time t, what is the chance it will be C at time t+1? • If a locus have an A in an individual from population P, what is the chance it will be C in another individual from the same population? • What is the chance to find the motif ACGCGT anywhere in a random individual of population P? what is the chance it will remain the same after 2m years? Conditional Probability: Chain Rule: Bayes Rule: A B
Random Variables & Notation • Val(X) – set of possible values of RV X • Upper case letters denote RVs (e.g., X, Y, Z) • Upper case bold letters denote set of RVs (e.g., X, Y) • Lower case letters denote RV values (e.g., x, y, z) • Lower case bold letters denote RV set values (e.g., x)
Stochastic Processes and Stationary Distributions Process Model t Stationary Model
t Poisson process 0 1 2 3 4 Random walk -1 0 1 2 3 Markov chain Brownian motion A B C D Discrete time T=1 T=2 T=3 T=4 T=5 Continuous time
The Poisson process Events are occurring interpedently in disjoint time intervals : an r.v. that counts the number of events up to time t. Assume: probability of two or more events in time h is Now: .
The Poisson process Probability of m events at time t:
The Poisson process Solving the recurrence:
Markov chains Transition probability Stationary transition probabilities One step transitions Stationary process General Stochastic process: The Markov property: A set of states: Finite or Countable. (e.g., Integers, {A,C,G,T}) Discrete time: T=0,1,2,3,….
Markov chains 4 Nucleotides A G C T A G C T A G C T A G C T 20 Amino Acids A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V The loaded coin A B T=1 A B T=2 A B T=3 A B T=4 pab A B 1-pab 1-pba pba
Markov chains Transition matrix P: A discrete time Markov chain is completely defined given an initial condition and a probability matrix. The Markov chain graph G is defined on the states. We connect (a,b) whenever Pab>0 Distribution after T time steps given x as an initial condition Matrix power
Right,left eigenvector: When an eigen-basis exists We can find right eigenvectors: And left eigenvectors: With the eigenvalue spectrum: Which are bi-orthogonal: And define the spectral decomposition: Spectral decomposition T=1 T=2 T=3 A A A B B
Spectral decomposition To compute transition probabilities: O(|E|)*T ~ O(N2)*T per initial condition T matrix multiplications to preprocess for time T Using spectral decomposition: O(Spectral pre-process) + 2 matrix multiplications per condition
Fixed point: l2 = second largest eigenvalue. Controlling the rate of process convergence Convergence Spec(P) = P’s eignvalues, l1 > l2>... l1= largest, always = 1. A Markov Chain is irreducible if its underlying graph is connected. In that case there is a single eigenvalue that equals 1. What does the left eigenvector corresponding to l1 represent?
Continuous time Think of time steps that are smaller and smaller Markov Conditions on transitions: Kolmogorov Theorem: exists (may be infinite) exists and finite
Rates and transition probabilities The process’s rate matrix: Transitions differential equations (backward form):
Summing over different path lengths: 1-path 2-path 3-path 4-path 5-path Matrix exponential The differential equation: Series solution:
Computing the matrix exponential Series methods: just take the first k summands reasonable when ||A||<=1 if the terms are converging, you are ok can do scaling/squaring: Eigenvalues/decomposition: good when the matrix is symmetric problems when having similar eigenvalues Multiple methods with other types of B (e.g., triangular)
Alignment AGCAACAAGTAAGGGAAACTACCCAGAAAA…. AGCCACATGTAACGGTAATAACGCAGAAAA…. Statistics A G C T A G C T Modeling: simple case Learning Inference Modeling Genome 1 Genome 2 Maximum likelihood model:
Alignment AGCAACAAGTAAGGGAAACTACCCAGAAAA…. AGCCACATGTAACGGTAATAACGCAGAAAA…. Statistics A G C T A G C T (t=1) Modeling: simple case Learning Inference Modeling Genome 1 Genome 2
Q,t’ Q,t Q,t+t’ Modeling: but is it kosher?
Symmetric processes Definition: we call a Markov process symmetric if its rate matrix is symmetric: What would a symmetric process converge to? whiteboard/ exercise Reversing time:
Reversibility Time: t s Definition: A reversible Markov process is one for which: i j j i Claim: A Markov process is reversible iff such that: whiteboard/ exercise If this holds, we say the process is in detailed balance. qji pi pj qij
Q,t’ Q,t Q,t’ Q,t Q,t+t’ Reversibility Claim: A Markov process is reversible iff we can write: where S is a symmetric matrix. whiteboard/ exercise