1 / 41

The 1983 Kreitman Data (M. Kreitman 1983 Nature) from Hartl & Clark, 1997

The 1983 Kreitman Data (M. Kreitman 1983 Nature) from Hartl & Clark, 1997. 11 alleles 3200 bp long. 43 segregating sites (columns with variation). 1 amino replacement event 1 insertion-deletion (indel). Coalescent Theory in Biology

lucus
Download Presentation

The 1983 Kreitman Data (M. Kreitman 1983 Nature) from Hartl & Clark, 1997

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The 1983 Kreitman Data (M. Kreitman 1983 Nature) from Hartl & Clark, 1997 11 alleles 3200 bp long. 43 segregating sites (columns with variation). 1 amino replacement event 1 insertion-deletion (indel)

  2. Coalescent Theory in Biology www://stagen.ncsu.edu/thorne/workshop.html & www.daimi.au.dk/~compbio/coalescent/ Fixed Parameters: Population Structure, Mutation, Selection, Recombination,... Reproductive Structure Genealogies of non-sequenced data Genealogies of sequenced data CATAGT CGTTAT TGTTGT Parameter Estimation Model Testing

  3. 1 2 2N 1 2 2N

  4. 10 Alleles’ Ancestry for 15 generations

  5. The same disentangeled

  6. Descendents and Extinction of a lineage

  7. Females Males 1 2 Nf 1 2 Nm 1 2 Nf 1 2 Nm

  8. Basic Idea: Waiting for common ancestors. What is the distribution until 2 alleles had a common ancestor, X2?: P(X2 > 1) = (2N-1)/2N = 1-(1/2N), P(X2 > j) = (1-(1/2N))j P(X2 = j) = (1-(1/2N))j-1 (1/2N) Mean, E(X2) = 2N. Ex.:2N = 20.000, Generation time 30 years, E(X2) = 600000 years.

  9. P(k):=P{k alleles had k distinct parents} The first can choose any (2N), the second (2N-1), .., the k’th (2N-k+1), ie. 2N * (2N-1) *..* (2N-k+1). The total number of ways to choose parents is withoput this restriction is (2N)k . P(k)=1*(1-1/2N)*..*(1-(k-1)/N) If k2 << 2N then P(k) approximately is Which again can be approximated by Precludes: Simultaneous events & Multifurcations

  10. Genealogy for sample of 3

  11. The Exponential Distribution. The Geometric Distribution: {0,1,..} Geo(p): P{Z=j)=pj(1-p) P{Z>j)=pj E(Z)=1/p. The Exponential Distribution: R+ Expo(a) Density: f(t) = ae-at, P(X>t)= e-at 0 1 2 3 Properties: X expo(a) Y expo(b) independent i. P(X>t2|X>t1) = P(X>t2-t1) (t2 > t1) ii. E(X) = 1/a. iii. P(Z>t)=(ca.)P(X>t) small a (p=e-a). iv. P(X < Y) = a/(a + b). v. min(X,Y) ~expo(a + b).

  12. Discrete --> Continous Time tc:=td/2N m mutation pr. nucleotide pr.generation. L: sequence length µ=m*L Mutation pr. allele pr.generation. 2N - allele number. Q := 4N*µ (Expected number of mutations before common anc. 2 alleles) Waiting time back to first mutation is expo(Q). Probability for two genes being identical: P(Coalescence < Mutation) = 1/(1+Q). Note: Mutation rate and population size usually appear together as a product, making separate estimation difficult. Xk is expo( ) distributed => E(Xk)=2N/{k*(k-1)/2}

  13. Continuous-Time Coalescent 1.0 corresponds to 2N generations 1.0 0.0 1 4 2 6 5 3

  14. The Coalescent 1 The tree generating process for k alleles: 1) wait expo({k*(k-1)/2}) until first coalescence time. 2) Choose random pair (i,j) and declare them one allele. Go to 1) but with k->(k-1). Continue until k=1. Consequence: The Waiting Process and the Merging Process are independent. q=4N*ms*L, where L is length of gene, ms is the probability of mutation in a nucleotide/generation. mg= ms*L would then be the probability for mutation in a gene/gneration when is very small (is in the order of 10-9.)

  15. Coalescent Combinatorics I The probability that there are k differences between two sequences. Going back in time 2 kinds of events can occur (mutations ( - or a coalescent (1). This gives a geometric distribution: --*-------*------*----- ----*----*----*----*---

  16. The Coalescent 2. ! 1 ! Infinity /\ 2 / \ 1 2 / \ /\ \ 3 / \ \ 1/3 1 k / \ \ \ \ 1/ 2/(k-1) Total height of tree: Hk= 2-(1/k) i.Infinitely many alleles finds 1 allele in finite time. ii. In takes less than twice as long for k alleles to find 1 ancestors as it does for 2 alleles. Total branch length in tree, Lk: 2*(1 + 1/2 + 1/3 +..+ 1/(k-1)) ca= 2*ln(k-1)

  17. A set of realisations(from Felsenstein) Observations: Variation great close to root. Trees are unbalanced.

  18. Sampling more sequences (from Felsenstein) The probability that the ancestor of the sample of size n is in a sub-sample of size k is Letting n go to infinity gives (k-1)/(k+1), i.e. even for quite small samples it is quite large.

  19. Coalescent Combinatorics II The basal division splits the leaves into (k,n-k) sets with probability: 1/(n-1) 18 The number of ”coalescent topologies”: 3 The number of ”unrooted tree topologies”: 1 2 3 4

  20. i j k 1 (i,j) n m k-1 1 (i,j) (n,m) k-2 1 Trees: Rooted, bifurcating & nodes time-ranked. Recursion: Tk= Tk-1 Initialisation: T1= T2=1

  21. Trees: Unrooted & valency 3 1 2 3 1 2 1 3 1 1 1 1 1 1 2 2 2 2 2 2 4 3 4 2 3 4 4 3 3 3 3 4 4 3 4 4 5 5 5 5 5 Recursion: Tn= (2n-5) Tn-1 Initialisation: T1= T2= T3=1

  22. 1 2 3 i j 2N 1 2 3 i j 2N Robustness of the Coalescent Robustness of “The Coalescent”: If offspring distribution is exchangeable and Var(n1) --> s2 & E(n1m) < Mm for all m, then Rst:= converges to The Coalescent in distribution. Exchangeable Distribution: X(1),X(2),..,X(2N) same distribution as X((1)),X(( 2)),..,X(( 2N))

  23. Infinite Allele Model 4 5 1 2 3

  24. Ewens' formula. (1972 TPB 3.87-112) The only observation made in the infinite allele models is identity/non-identity among all pairs of alleles. I.e. The central observation is a series of classes and their sizes. P5(2,0,1,0,0) is the probability of seeing 2 singles and one allele in 3 copies in a sample of 5. Obviously, a1+2a2+ +iai +nan=n Pn(a1,a2, ,an) = En(k types) = Pn(a1,a2, ,an;k) = Where Snk is the (Stirling numbers of second kind) number of ways to make k sets of {1,..,n}. k is a minimal sufficient statistic for Pn(a1,a2, ,an;k) the probability of the data conditioned on k is -less and there is no simpler such statistic.

  25. Ewens' formula - example. (1972 TPB 3.87-112) P5(2,0,1,0,0) = E5(k types) = P5(2,0,1,0,0;3) = Assume has been observed and that 0.5 mutation is expected per unit (2N) time.

  26. Infinite Site Model 4 5 1 2 3 Final Aligned Data Set:

  27. Probability of a data set under infinite allele site model(Griffiths87-95)

  28. Age-labelled trees (Griffiths-Ethier-Tavare) 1 1 2 3 1 2 3 2 1 2 3 Q

  29. Watterson’s Estimator Var(segr) = ACCTGAACGTAGTTCGAAG ACCTGAACGTAGTTCGAAT ACCTGACCGTAGTACGAAT ACATGAACGTAGTACGAAT ACATGAACGTAGTACGAAT * * * * 1 3 2 4 5 Expected Number Segregating Sites: *(1+1/2+ +1/(k-1))  := Segr/(1+1/2+ +1/(k-1))

  30. Finite Site Model acctgcat acctgcat acctgcat acgtgcat acgtgctt acctgcat tcctgcat acgtgcgt 4 5 1 2 3 tcctgcat tcctgcat acgtgctt acgtgctt acgtgcgt acctgcat tcctgcat tcctgcat s s s Final Aligned Data Set:

  31. Jukes-Cantor: Total Symmetry. Rate-matrix, R: T O A C G T F A -3* R C  -3* O G  -3* M T  -3* Transition prob. after time t, a = *t: P(equal) = 1/4 + 3/4*e-4*a = 1 - *t P(diff.) = 3/4 - 3/4* e-4*a = *t Stationary Distribution: (1,1,1,1)/4.

  32. Felsenstein81 for each column Multiply for all columns GCAGGTT TCAGCCT TCAGCAT Probability of Data – Finite Site. 3 step approach: I Probability of Data given topology and branch lengths II Integrate over branch lengths III Sum over topologies Conclusion: Exact Calculation Computationally Intractible!!

  33. Pairwise Distance Estimator 1 3 2 4 5 s1 = ACCTGT.....AG s2 = ACTTGT.....AG s3 = ACCTGT.....AC s4 = TCTTGC.....AC * * * * 6 6 4 4 4 4 4 A mutation on a (n,n-k) branch will be counted n*(n-k) times. I.e. deep branches have higher weights. PD := Average Pairwise Distance (above 2.166) VarPD) = (n+1)/3(n-1) + 2 2(n2+n+3)/9n(n-1)

  34. The Coalescent & Population Growth Growth will elongate leaf edges relative to deep edges! If the population size is known as function of time N(t), time can be scaled as tt = . For exponential growth, e-at this gives tt(eat-1)/a

  35. Tajima’s Test D = (PD -W)/Sd(PD -W) A large value indicates shortened tips A small value indicates shortened deep branches. Mitochondria (Ingman et al. 2000) 52 complete molecules 521 segregating sites PD = 44.2 W = 115.3 V(D) 31.8 D = -2.23

  36. Ancestors to Ancestors hi,j = probability that i individuals has j ancestors after time t. i[k] = i(i-1)..(i-k+1) i (k) = i(i+1)..(i+k-1)

  37. 1,2 1,2,3 2,3 1,3 3 Ancestors to 2 Ancestors = (3/2)(e-t - e-3t) 1,2,3 1,2,3 1,2,3 3 areas: No coalescence A: e-3t 1 specific B: 1-e-t 2 coalescences C: ? A + 3B - 2C = 1 Exactly one coalescence: 1-A-C

  38. 7 7 6 5 4 4 t1 t 0 Pk(t1) = hi,k(t1)* hk,j(t- t1)/ hi,j(t)

  39. 986 H. sapiens ts Te Neanderthal Tt Homo Sapiens & the Neanderthal (nordborg) Two Scenarios: Constant Female Pop.Size 3.400 Growing for 50.000 years to 5*108. Problem: Can the observed be explained by one common H.sapiens - Neanderthal population? Constant Pop.size Recent Growth 30.000 100.000 30.000 100.000 E(A()) 4.86 1.75 782 2.86 P(topology) .085 .56 3.3 10-6 .24 P(topology & Tt > 4Te) .0063 .035 3.7 10-8 .002

  40. G H C G H C Gorilla Chimp H.sapiens The Coalescent & Phylogenetics Gene Trees (GT) & Specie Trees (ST) Pop. Size N ?? T ?? = C + H or (CH) P{GT = ST} = 1 - e-T/2N * (2/3)

  41. Basic Coalescent Summary Genealogical approach to population genetics ”The Coalescent” - generic probability distribution on allele trees. Without recombination, one data set only ”samples” one genealogy!

More Related