1 / 65

Phylogeny of Mixture Models

Phylogeny of Mixture Models. Daniel Štefankovič Department of Computer Science University of Rochester joint work with Eric Vigoda College of Computing Georgia Institute of Technology. Outline. Introduction (phylogeny, molecular phylogeny) Mathematical models (CFN, JC, K2, K3)

pepin
Download Presentation

Phylogeny of Mixture Models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phylogeny of Mixture Models • Daniel Štefankovič • Department of Computer Science • University of Rochester • joint work with • Eric Vigoda • College of Computing • Georgia Institute of Technology

  2. Outline Introduction (phylogeny, molecular phylogeny) Mathematical models (CFN, JC, K2, K3) Maximum likelihood (ML) methods Our setting: mixtures of distributions ML, MCMC for ML fails for mixtures Duality theorem: tests/ambiguous mixtures Proofs (strictly separating hyperplanes, non-constructive ambiguous mixtures)

  3. Phylogeny development of a group: the development over time of a species, genus, or group, as contrasted with the development of an individual (ontogeny) orangutan gorilla chimpanzee human

  4. Phylogeny – how? development of a group: the development over time of a species, genus, or group, as contrasted with the development of an individual (ontogeny) past – morphologic data (beak length, bones, etc.) present – molecular data (DNA, protein sequences)

  5. Molecular phylogeny INPUT: aligned DNA sequences Human: Chimpanzee: Gorilla: Orangutan: ATCGGTAAGTACGTGCGAA TTCGGTAAGTAAGTGGGATTTAGGTCAGTAAGTGCGTTTTGAGTCAGTAAGAGAGTT OUTPUT: phylogenetic tree orangutan gorilla chimpanzee human

  6. Example of a real phylogenetic tree Eucarya Universal phylogeny deduced from comparison of SSU and LSU rRNA sequences (2508 homologous sites) using Kimura’s 2-parameter distance and the NJ method. The absence of root in this tree is expressed using a circular design. Archaea Bacteria Source: Manolo Gouy, Introduction to Molecular Phylogeny

  7. Dictionary Leaves = Taxa = {chimp, human, ...} Vertices = Nodes Edges = Branches Tree = Tree orangutan gorilla Unrooted/Rooted trees chimpanzee human

  8. Outline Introduction (phylogeny, molecular phylogeny) Mathematical models (CFN, JC, K2, K3) Maximum likelihood (ML) methods Our setting: mixtures of distributions ML, MCMC for ML fails for mixtures Duality theorem: tests/ambiguous mixtures Proofs (strictly separating hyperplanes, non-constructive ambiguous mixtures)

  9. Cavender-Farris-Neyman (CFN) model Weight of an edge = probability that 0 and 1 get flipped 0.15 0.06 0.32 0.22 0.09 0.12 orangutan gorilla chimpanzee human

  10. CFN model Weight of an edge = probability that 0 and 1 get flipped 0 0.15 0.06 0.32 0.22 0.09 0.12 orangutan gorilla chimpanzee human

  11. CFN model Weight of an edge = probability that 0 and 1 get flipped 0 0.15 0.06 0.32 0.22 0.09 0.12 orangutan gorilla chimpanzee human 1 with probability 0.32 0 with probability 0.68

  12. CFN model Weight of an edge = probability that 0 and 1 get flipped 0 0.15 0.06 0.32 0.22 0.09 0.12 1 orangutan gorilla chimpanzee human 1 with probability 0.32 0 with probability 0.68

  13. CFN model Weight of an edge = probability that 0 and 1 get flipped 0 0.15 0.06 0.32 0.22 0.09 0.12 1 orangutan gorilla chimpanzee human

  14. CFN model Weight of an edge = probability that 0 and 1 get flipped 0 0.15 0.06 0.32 0.22 0.09 0.12 1 orangutan gorilla chimpanzee human 1 with probability 0.15 0 with probability 0.85

  15. CFN model Weight of an edge = probability that 0 and 1 get flipped 0 0.15 0 0.06 0.32 0.22 0 0.09 0.12 0 1 0 1 orangutan gorilla chimpanzee human

  16. CFN model Weight of an edge = probability that 0 and 1 get flipped 0 0.15 0 0.06 0.32 0.22 0 0.09 0.12 0 1 0 1 orangutan gorilla chimpanzee human 0……….. 0……….. 1……….. 1………..

  17. CFN model Weight of an edge = probability that 0 and 1 get flipped 1 0.15 0 0.06 0.32 0.22 0 0.09 0.12 0 0 1 1 orangutan gorilla chimpanzee human 01……….. 00……….. 10……….. 11………..

  18. CFN model Weight of an edge = probability that 0 and 1 get flipped 1 0.15 1 0.06 0.32 0.22 1 0.09 0.12 1 1 1 0 orangutan gorilla chimpanzee human 011…….. 001…….. 101…….. 110……..

  19. CFN model Weight of an edge = probability that 0 and 1 get flipped 0.15 0.06 0.32 0.22 0.09 0.12 orangutan gorilla chimpanzee human 0000,0001,0010,0011,0100,0101,0110,0111,… Denote the distribution on leaves (T,w) T = tree topology w = set of weights on edges

  20. Generalization to more states Weight of an edge = probability that 0 and 1 get flipped transition matrix A G C T A G C T A A A G C A T orangutan gorilla chimpanzee human

  21. Models: Jukes-Cantor (JC) Rate matrix exp( t.R ) A G C T A there are 4 states G C T

  22. Models: Kimura’s 2 parameter (K2) Rate matrix exp( t.R ) A G C T A purine/pyrimidine mutations less likely G C T

  23. Models: Kimura’s 3 parameter (K3) Rate matrix exp( t.R ) A G C T A take hydrogen bonds into account G C T

  24. Reconstructing the tree? Let D be samples from (T,w). Can we reconstruct T (and w) ? • parsimony • distance based methods • maximum likelihood methods (using MCMC) • invariants • ? Main obstacle for all methods: too many leaf-labeled trees (2n-3)!!=(2n-3)(2n-5)…3.1

  25. Outline Introduction (phylogeny, molecular phylogeny) Mathematical models (CFN, JC, K2, K3) Maximum likelihood (ML) methods Our setting: mixtures of distributions ML, MCMC for ML fails for mixtures Duality theorem: tests/ambiguous mixtures Proofs (strictly separating hyperplanes, non-constructive ambiguous mixtures)

  26. Maximum likelihood method Let D be samples from (T,w). Likelihood of tree S is L(S) = maxw Pr(D | S,w) For |D|!1 then the maximum likelihood tree is T

  27. MCMC Algorithms for max-likelihood Combinatorial steps: NNI moves (Nearest Neighbor Interchange) Numerical steps (i.e., changing the weights) Move with probability min{1,L(Tnew)/L(Told)}

  28. MCMC Algorithms for max-likelihood Only combinatorial steps: NNI moves (Nearest Neighbor Interchange) Does this Markov Chain mix rapidly? Not known!

  29. Outline Introduction (phylogeny, molecular phylogeny) Mathematical models (CFN, JC, K2, K3) Maximum likelihood (ML) methods Our setting: mixtures of distributions ML, MCMC for ML fails for mixtures Duality theorem: tests/ambiguous mixtures Proofs (strictly separating hyperplanes, non-constructive ambiguous mixtures)

  30. Mixtures one tree topology multiple mixtures Can we reconstruct the tree T? The mutation rates differ for positions in DNA

  31. Reconstruction from mixtures - ML Theorem 1: maximum likelihood: fails to for CFN, JC, K2, K3 For all 0<C<1/2, all x sufficiently small: (i) maximum likelihood tree ≠ true tree (ii) 5-leaf version: MCMC torpidly mixing Similarly for JC, K2, and K3 models

  32. Reconstruction from mixtures - ML Related results: [Kolaczkowski,Thornton] Nature, 2004. Experimental results for JC model [Chang] Math. Biosci., 1996. Different example for CFN model.

  33. Reconstruction from mixtures - ML Proof: Difficulty: finding edge weights that maximize likelihood. For x=0, trees are the same -- pure distribution, tree achievable on all topologies. So know max likelihood weights for every topology. (observed)T log (T,w) If observed comes from \mu(S,v) then it is optimal to take T=S and w=v (basic property of log-likelihood)

  34. Reconstruction from mixtures - ML Proof: Difficulty: finding edge weights that maximize likelihood. For x=0, trees are the same -- pure distribution, tree achievable on all topologies. So know max likelihood weights for every topology. For x small, look at Taylor expansion bound max likelihood in terms of x=0 case and functions of Jacobian and Hessian. =

  35. Outline Introduction (phylogeny, molecular phylogeny) Mathematical models (CFN, JC, K2, K3) Maximum likelihood (ML) methods Our setting: mixtures of distributions ML, MCMC for ML fails for mixtures Duality theorem: tests/ambiguous mixtures Proofs (strictly separating hyperplanes, non-constructive ambiguous mixtures)

  36. Reconstruction – other algorithms? • Duality theorem: Every model has either: • A) ambiguous mixture distributions on 4 leaf trees • (reconstruction impossible) • B) linear tests (reconstruction easy) GOAL: Determine tree topology The dimension of the space of possible linear tests: CFN = 2, JC = 2, K2 = 5, K3 = 9

  37. Ambiguity in CFN model For all 0<a,b<1/2, there is c=c(a,b) where: above mixture distribution on tree T is identical to below mixture distribution on tree S. Previously: non-constructive proof of nicer ambiguity in CFN model [Steel,Szekely,Hendy,1996]

  38. What about JC?

  39. What about JC? Reconstruction of the topology from mixture possible. Linear test = linear function which is >0 for mixture from T2 <0 for mixture from T3 There exists a linear test for JC model. Follows immediately from Lake’1987 – linear invariants.

  40. Lake’s invariants ! Test f=μ(AGCC) + μ(ACAC) + μ(AACT) +μ(ACGT) - μ(ACGC) - μ(AACC) - μ(ACAT) - μ(AGCT) For μ=μ(T1,w), f=0 For μ=μ(T2,w), f<0 For μ=μ(T3,w), f>0

  41. Linear invariants v. Tests Linear invariant = hyperplane containing mixtures from T1 Test = hyperplane strictly separating mixtures from T2 from mixtures from T3 pure distributions from T2

  42. Linear invariants v. Tests Linear invariant = hyperplane containing mixtures from T1 Test = hyperplane strictly separating mixtures from T2 from mixtures from T3 pure distributions from T2 mixtures from T2

  43. Linear invariants v. Tests Linear invariant = hyperplane containing mixtures from T1 Test = hyperplane strictly separating mixtures from T2 from mixtures from T3 mixtures from T3 mixtures from T2

  44. Linear invariants v. Tests Linear invariant = hyperplane containing mixtures from T1 Test = hyperplane strictly separating mixtures from T2 from mixtures from T3 mixtures from T3 mixtures from T2 test

  45. Duality theorem: Every model has either: • A) ambiguous mixture distributions on 4 leaf trees • (reconstruction impossible) • B) linear tests (reconstruction easy) Separating hyperplanes Separating hyperplane theorem: ambiguous mixture separating hyperplane

  46. Duality theorem: Every model has either: • A) ambiguous mixture distributions on 4 leaf trees • (reconstruction impossible) • B) linear tests (reconstruction easy) Strictly separating hyperplanes ??? Separating hyperplane theorem ?: ambiguous mixture strictly separating hyperplane?

  47. Strictly separating not always possible Separating hyperplane theorem ?: ambiguous mixture strictly separating hyperplane? NO strictly separating hyperplane {(0,0)} { (x,y) | x>0 } [ { (0,y) | y>0 }

  48. When strictly separating possible? NO strictly separating hyperplane {(0,0)} { (x,y) | x>0 } [ { (0,y) | y>0 } (x,y2 – xz) x¸0, y>0 Lemma: Sets which are convex hulls of images of open sets under a multi-linear polynomial map have a strictly separating hyperplane. standard phylogeny models satisfy the assumption

  49. Outline Introduction (phylogeny, molecular phylogeny) Mathematical models (CFN, JC, K2, K3) Maximum likelihood (ML) methods Our setting: mixtures of distributions ML, MCMC for ML fails for mixtures Duality theorem: tests/ambiguous mixtures Proofs (strictly separating hyperplanes, non-constructive ambiguous mixtures)

  50. Proof Lemma: For sets which are convex hulls of images of open sets under a multi-linear polynomial map – strictly separating hyperplane. Proof: P1(x1,…,xm),…,Pn(x1,…,xm), x=(x1,…,xm) 2 O WLOG linearly independent

More Related