240 likes | 342 Views
Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci & a couple of unrelated observations. Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research At Newton Institute Dec 07. Lecture Plan.
E N D
Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci& a couple of unrelated observations Elchanan Mossel, UC Berkeley Joint work with: Sebastien Roch, Microsoft Research At Newton Institute Dec 07
Lecture Plan • A simple observation about gene trees and population trees. • A comment: on “optimal” and “absolute converging” tree reconstruction • A comment on: “Generic models”. • A comment on: “Network Reconstruction”. • Disclaimer: Last talk – a bit philosophical (but would be happy to provide hard technical proofs )
Gene Trees and Population Trees • Main goal in phylogenetics: • Recovering species/population histories. • Data: Current Genes. • Issue: In recent populations: gene trees may differ from population trees. • Model for evolution of trees in populations: • Coalescence: • Fixed size population N • Each individual chooses a random parent in previous generation. • # generations = N£ branch-length • Main Question: How to reconstruct population trees from gene trees?
Gene Trees: The Engineering Approach • Two common “engineering” approaches: • Approach 1: • Assume all genes come from a single tree. • Kubato-Degnan: Inconsistent. • Approach 2: • Build tree for each tree on its own. • Take majority tree. • Degnan-Rosenberg: Inconsistent. • Q: What should be done instead?
Gene Trees: A Rigorous Approach • M-Roch: A consistent estimator of the molecular distance between two populations d(P1,P2) is: • D(P1,P2) = min {dg(P1,P2) : g 2 Genes} • )distances between populations are identifiable. • )tree is identifiable • Under standard coalescence assumptions, get good rate: • P(topology error) · (# pops) £ exp(-c # genes) • c = shortest branch length. • Estimator can be “plugged in” into any distance based method for reconstructing trees. • In M-Roch, use NJ, but similarly work for: • Short-quartets (ESSW) • Distorted metrics and forests (M) • etc.
Comments on Absolute Convergence • Algorithmic paradigm: Want to reconstruct tree on • n species using • sequence length L and • running time T. • “Absolute Convergence”: L = poly(n); T = poly(n). • Q: Is this the best we can do?
resolution of Steel’s conjecture ancestral reconstruction phylogenetic reconstruction ? Short branches := all branches < lc Long branches := all branches > lc lc depends on mutation model but not on tree, tree size etc. [Daskalakis- M-Roch’06] seq. length L = c log n short branches seq. length L = nC [M’04] long branches n = # species
The algorithmic challenge Conj:For short branches, if data is generated from the model: ML identifies the correct using L = O(log n) samples (best bound known is L = exp(O(N)). Conclusion: In order to “beat” ML, need algorithms with L = O(log n) Challenge: The constant in O is important! Challenge: Deal with short/long branches (contract edges; output forest) Challenge: General mutation models (not just CFN, JC). Comment: Rigorous methods have running time gaurentee. Comment: For L=poly(n), know how to deal with all challenges: ESSW M’07 (forests – long edges). Gornieu et. al (short edges).
On generic parameters From Rhodes talk: “Generic models are easier to identify”. Typically – genetic parameters. How about generic trees?
Mixtures and Phenomena in High Dims • The Geometry of High Dimensions: “Almost every collection of k vectors are almost orthogonal in high enough dimension n”. • M-Roch (in preparation): For every k, as n -> 1the probability that a mixture of k trees on n leaves is identifiable goes to 1. • Holds for most reasonable measures on the space of trees and most mutation models. • Basic idea: In generic situations can (almost) cluster samples according to trees. • Gives an efficient algorithm. • Similar results hold for rates across sites .
A Comment on Dynamic Programming L • Q (Zhang): • Given a tree is it possible to find the • most informativek species? • In terms of Pasrsimony? • In terms of ML? • Note: If we know Parsimony/ML score for left/right sub-tree, we know it for the root. • Q: Can use dynamic programming? • A: Yes – but with the right “data structure” • Information per node: • Discrete version of • the set • of achievable distributions. • Called “Density Evolution” in coding theory / spin-glass theory. • Additive error = 1/poly(n). L2 L1 L L2 L1
Hardness of Distinguishing Network Models with Hidden Nodes G1 • Basic question: Is it possible to recover a network G from observation at a subset of the nodes? • Easier question: Suppose we observe X1,…,Xr. Is it possible to determine if they come from nodes S in G1or nodes T in G2? • Problem: It may be that the two distributions are the same. • Assume: The two distributions are different (large total variation distance) • Q: Assuming the two distributions are different how hard is it to tell if it’s coming from G1or G2? • Related question: What is a computational model of a biologist? G2
The distinguishing problem for Trees T1 • Q: Assuming the two distributions are different how hard is it to tell if it’s coming from T1or T2? • Note: For trees the problem is easy: • Perform likelihood test. • Easy to do efficiently (peeling, pruning, dynamics programming). • # samples needed poly(n). T2
Two Models of a Biologist • The Computationally Limited Biologist: Cannot solve hard computational problems, in particular cannot sample from a general G-distributions. • The Computationally Unlimited Biologist: Can sample from any distribution. • Related to the following problem: Can nature solve computationally hard problems? From Shapiro at Weizmann
Hardness Results G1 • The Computational Limited Biologist (Bogdanov-M): Distinguishing problem can be solved efficiently iff NP=RP. • Computational Unlimited Biologist (Bogdanov-M): The problem is at least zero-knowledge hard. • Zero-Knowledge Problem: Can we decide if samples from a computationally efficient distribution is coming from the uniform distributions? • Related to cryptography. G2
Reconstructing Networks • Motivation: abundance of stochastic networks in biology, social networks, neuro-science etc. etc. • Network defines a distribution as follows: • G=(V,E) = Graph on [n] = {1,2,…,n} • Distribution defined on AV, where A is some finite set. • Too each clique C in G, associate a function C : AC -> R+ and: P[] = CC(C) • Called Markov Random Field, Factorized Distribution etc. • Directed models also common. • Markov Property: If S separates A from B then A and B are conditionally independent given S
Reconstructing Networks . • Task 1: Given samples of , find G. • Task 2: Given samples of restricted to a set S find G. • Will consider the problem when n large and maximum degree d is small. • (Note that specification of the model is of size max(n,,exp(max |C|)) )
Reconstructing Networks – A Trivial Algorithm • Lower bound (Bresler-M-Sly): • In order to recover G of max-deg d need at least c d log n samples. • Pf follows by “counting # of networks”. • Upper bound (Bresler-M-Sly): • If distribution is “non-degenerate” c d log n samples suffice. • Trivial Algorithm: • For each v 2 V: • Enumerate on N(v) • For each w 2 V check if v ind. of w given N(v). • Non-Degeneracy: • For every v and every w 2 N(v) there exists two assignments to N(v)1 and 2 that differ at w and: dTV(P(v | 1), P(v | 2)) ¸ • For soft-core model suffices to have for all = u,v • maxa,b,c,d |(c,a)-(d,a)+(c,b)-(d,b)| > • Running time = O(nd+1 log n)
A Trivial Algorithm – Related Result • Trivial Algorithm: • For each v 2 V: • Enumerate on N(v) • For each w 2 V check v ind. of w given N(v). • Related work • Algorithm was suggested before. • Abbeel, D. Koller, A. Ng: without restrictions learn a model whose KL distance from generating model is small (no guarantee of obtaining the true model; in order to get O(1) KL distance need poly samples). • M. J. Wainwright, P. Ravikumar, J. D: Use L1regularization to get true model for Ising models, sampling complexity O(d5 log n) – no running time bounds. • Other related work: assuming special form of potentials
possible w’s Variants of the Trivial Algorithm • If graph has exponential decay of correlations • Corr(u,v) · exp(-c d(u,v)) • Suffices to enumerate over N(v) • among w correlated with v. • Running time: O(n2 log n + n f(d)). • Missing nodes: Suppose G is triangle free, • then a variant of the algorithm can find one hidden node. • Idea (with M. Biskup’s help): Run the algorithm as if the node is not hidden • Noise: The algorithm tolerates small amounts of noise (statistical robustness). • Q: What about higher amounts of noise? • (From Bresler-M-Sly)
Higher Noise & Non Identifiable Example • Bresler-M-Sly: Example of non-identifiably • Consider • G1= path of length 2, • G2 = triangle + Noise. • Assume Ising model with random interactions and random noise. • Then with constant probability, cannot distinguish between the models. • Ising: P[] = u,v 2 E exp((u) (v)) • Intuitive reason: dimension of distribution is 3 in both cases. = hidden nodes = observed nodes
Thanks !! • Sebastien Roch • Costis Daskalakis • Andrej Bogdanov
Thanks !! Fascinating workshop: Principal Organiser: Professor Mike Steel (University of Canterbury, NZ) Organisers: Professor Vincent Moulton (University of East Anglia) and Dr Katharina Huber (University of East Anglia) Sponsored by: Allan Wilson Centre for Molecular Ecology and Evolution As part of a great program: Organisers: Professor V Moulton (East Anglia), Professor M Steel (Canterbury) and Professor D Huson (Tubingen)