640 likes | 767 Views
ReCombinatorics: Phylogenetic Networks with Recombination. Two recent results and Two Open Questions. CPM, June 18, 2008 Pisa, Italy. What is population genomics?. The Human genome “sequence” is done.
E N D
ReCombinatorics: Phylogenetic Networks with Recombination Two recent results and Two Open Questions CPM, June 18, 2008 Pisa, Italy
What is population genomics? • The Human genome “sequence” is done. • Now we want to sequence many individuals in a population to correlate similarities and differences in their sequences with genetic traits (e.g. disease or disease susceptibility). • Presently, we can’t sequence large numbers of individuals, but we can sample the sequences at SNP sites.
SNP Data • A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more). • SNP maps have been compiled with a density of about 1 site per 1000. • SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.
Haplotype Map Project: HAPMAP • NIH lead project ($100M) to find common SNP haplotypes (“SNP sequences”) in the Human population. • Association mapping: HAPMAP used to try to associate genetic-influenced diseases with specific SNP haplotypes, to either find causal haplotypes, or to find the region near causal mutations. • The key to the logic of Association mapping is historical recombination in populations. Nature has done the experiments, now we try to make sense of the results.
The Perfect Phylogeny Model for SNP sequences Only one mutation per site allowed. sites 12345 Ancestral sequence 00000 1 4 Site mutations on edges 3 00010 The tree derives the set M: 10100 10000 01011 01010 00010 2 10100 5 10000 01010 01011 Extant sequences at the leaves
When can a set of sequences be derived on a perfect phylogeny? Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test
A richer model 10100 10000 01011 01010 00010 10101 added 12345 00000 1 4 M 3 00010 2 10100 5 Pair 4, 5 fails the four gamete-test. The sites 4, 5 are incompatible. 10000 01010 01011 Real sequence histories often involve recombination.
Sequence Recombination 01011 10100 S P 5 Single crossover recombination 10101 A recombination of P and S at recombination point 5. The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix).
Network with Recombination: ARG 10100 10000 01011 01010 00010 10101 new 12345 00000 1 4 M 3 00010 2 10100 5 P 10000 01010 The previous tree with one recombination event now derives all the sequences. 01011 5 S 10101
A Min ARG for Kreitman’s data ARG created by SHRUB
An illustration of why we are interested in recombination:Association Mapping of Complex Diseases Using ARGs
Association Mapping • A major strategy being practiced to find genes influencing disease from haplotypes of a subset of SNPs. • Disease mutations: unobserved. • A simple example to explain association mapping and why ARGs are useful, assuming the true ARG is known. Disease mutation site 0 1 0 0 1 SNPs
The single disease mutation occurs between sites 2 and 3! What part of 01100d, e, finherit? 1 2 3 4 5 d: e: f: ? ? Very Simplistic Mapping the Unobserved Mutation of Mendelian Diseases with ARGs 00000 Assumption (for now): A sequence is diseasediff it carries the single disease mutation 4 00010 a:00010 3 1 10010 00100 5 00101 2 b:10010 01100 S P S 4 c:00100 01101 P g:00101 3 d:10100 f:01101 Where is the disease mutation? e:01100 Diseased
Mapping Disease Gene with Inferred ARGs • “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005 • But we do not know the true ARG! • Goal: infer ARGs from SNP data for association mapping • Not easy and often approximation (e.g. Zollner and Pritchard) • Improved results to do the inference Y. Wu (RECOMB 2007)
Results on Reconstructing the Evolution of SNP Sequences • Part I: Clean mathematical and algorithmic results: Galled-Trees, near-uniqueness, graph-theory lower bound, and the Decomposition theorem, Forest Theorem and the History Lower bound. • Part II: Practical computation of Lower and Upper bounds on the number of recombinations needed. Construction of (optimal) phylogenetic networks; uniform sampling; haplotyping with ARGs; LD mapping … • Part III: Varied Biological Applications • Part IV: Extension to Gene Conversion • Part V: The Minimum Mosaic Model of Recombination (CPM 2007) This talk will discuss two topics in Part I
Minimizing Recombinations in unconstrained networks • Problem: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations used to generate M, allowing only one mutation per site. This has biological meaning in appropriate contexts. • The general minimization problem is NP-hard. • We can solve this problem in poly-time for the special case of Galled-Trees, to be defined.
The Decomposition Theorem Since the minimization problem is NP-hard we want to break up a problem into subproblems that can be solved separately and combined.
Incompatible Sites A pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.
1 2 3 4 5 Incompatibility Graph G(M) a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 4 M 1 3 2 5 Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.
The connected components of G(M) are very informative For example we have the Theorem: The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network. We will see that the non-trivial connected components are the key to the finest possible decomposition, and have other essential uses.
Recombination Cycles • In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. • The cycle specified by those two paths is called a ``recombination cycle”.
A maximal set of intersecting cycles forms a Blob 00000 4 00010 3 1 10010 00100 5 00101 2 01100 S P S 4 01101 p 3 If directions on the edges are removed, a blob is a bi-connected component of the network.
Blobed Trees • Contracting each blob in a network results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. Simple, but key insight. • So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree. The blobs are the non-tree-like parts of the network. A blob that is just a single cycle is called a “gall”, and a network where all blobs are galls is called a ``Galled-Tree”.
Every network is a tree of blobs. A network where every blob is a single cycle is a Galled-Tree. Ugly tangled network inside the blob.
A Simple Observation In any network N for M, all sites from the same non-trivial connected component of G(M) must appear together in a single blob in N.
The Decomposition Theorem Theorem: For any set of sequences M, there is a phylogenetic network that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. This “fully-decomposed” networkis the finest decomposition possible.
Example: Network for input M with one blob 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100
The fully- decomposed network for M Incompatibility Graph 4 4 3 1 3 2 5 1 s p a: 00010 2 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101
Moreover, the backbone tree and the partition of sites into blobs, and the sequences exported from any blob are all invariant features of the fully-decomposed networks for M, and can be determined in polynomial-time.
So, we can find a network for M by solving the (rooted) recombination minimization problem for each connected component of G(M) separately, and then connect those subnetworks in an invariant way. The resulting network will be a network with the fewest recombination nodes over all fully-decomposed networks for M.
Algorithmically • Finding the tree part of the blobbed-tree is easy. • Determining the sequences labeling the exterior nodes on any blob is easy. • Determining a “good” structure inside a blob B is the problem of generating the sequences of the exterior nodes of B. • It is easy to test whether the exterior sequences on B can be generated with only a single recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob. • That can be solved by successively removing each exterior sequence and testing if the remaining sequences can be generated on a perfect phylogeny of the correct form.
Proof Ideas Let C be a connected component of G(M). Define M[C] as the sequences in M restricted to the sites in C.
1 2 3 4 5 a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 4 C2 C1 M 1 3 2 5 B1 B2 1 3 4 2 5 a b c d e f g 0 0 0 0 0 0 0 0 1 0 1 1 0 1 a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g 0 1 0 M[C1] M[C2]
Now for each connected component C in G(M), call each distinct sequence in M[C] a supercharacter, and let W be the indicator matrix for the supercharacters. So W indicates which rows of M contain which particular supercharacters. 1 2 3 4 5 6 7 8 1 3 4 2 5 a b c d e f g 1 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 5 5 5 5 6 7 8 a b c d e f g 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 2 3 4 3 3 3 a 0 0 1 b 1 0 1 c 0 1 0 d 1 1 0 e 0 1 0 f 0 1 0 g 01 0 W M[C1] M[C2]
Proof Ideas Lemma: No pair of supercharacters are incompatible. So by the NASC for a Perfect Phylogeny, there is a unique perfect phylogeny T for W.
Proof Ideas For each connected component C of G(M), all supercharacters that originate from C label edges in T that are incident with one single node v[C] in T. So, if we expand each node v[C] to be a network that generates the supercharacters from C (the sequences in M[C]), and connect each network correctly to the edges in T, the resulting network is a fully-decomposed blobbed-tree that generates M.
However … While fully-decomposed networks always exist, they do not necessarily minimize the number of recombination nodes, over all possible networks. That is, sometimes it pays to put sites from different connected components together on the same blob.
Sufficient Conditions But we can prove several useful sufficient conditions for when there is a fully-decomposed network that minimizes the number of recombinations, over all possible networks. The deepest result: Theorem: Let N be a phylogenetic network for input M, let L be the set of sequences that label the nodes of N, and let G(L) be the incompatibility graph for L. If G(L) and G(M) have the same number of connected components, then there is a fully-decomposed network for M with the same number of recombinations as in N. JCB December 2007
Corollary A fully-decomposed network exists that minimizes the number of recombinations, unless every optimal network uses some recombination node(s) labeled by sequence(s) not in M, and the addition of those sequences to M creates an incompatibility between sites in different components of G(M).
000000 Sequences in M are in black. Sequence 100010 is not in M. 4 3 1 5 G(M) has two components. Each requires two recs, but this combined network needs only three. 4 s p 6 2 100010 5 p s 3 s p 000100 001000 011010 100101 010010 100001 G(L) has one component. The addition of sequence 100010 reduces the number of components from 2 to 1.
G(M) for the original data Two components, so two blobs, each blob requires two recombs, by the HK lower bound theorem, so a fully decomposed networks needs at least four recombinations 1 2 3 4 5 6
1 2 3 4 5 6 G(L) created from the original data, and the addition of the new interior sequence 100010. G(L) has only one connected component compared to two components for G(M).
A Practical Sufficient Condition If M can be derived on a network N in which every edge contains at most one site, and every node is labeled with a sequence in M, then there is a fully-decomposed network for M which minimizes the number of recombinations over all possible networks for M.
Another Practical Sufficient Condition If M can be derived on a network N where the number of recombinations equals the (poly-computable) Haplotype Lower Bound, then there is a fully decomposed network for M which minimizes the number of recombinations over all possible networks.
Theorem: For any K, there is a dataset where best fully-decomposed network uses K recombinations more than optimal. In that construction, the ratio of the number of recombinations in the best fully-decomposed network to the optimal is constant as K grows. Open Question: Construct examples where the show that the ratio can be arbitrarily large.
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON’07 July 16, 2007
000 00 00 100 10 10 Empty. One type-3 operation 010 01 01 011 01 History Bound (Myers & Griffiths 2003) M 000 100 010 011 111 • Iterate the following operations • Remove a column with a single 0 or 1 • Remove a duplicate row • Remove any row • History bound: the minimum number of type-3 operations needed to reduce the matrix to empty
Graphical interpretation of history bound (HistB) • Each operation in the history computation corresponds to an operation that deconstructs the optimal, but unknown ARG. • We deconstruct the optimal ARG by removing tree parts as long as possible; then remove an exposed recombination node; repeat. • Removing an exposedrecombination node in the ARG corresponds to a single type-3 operation. So when deconstructing the optimal ARG, the number of recombination nodes = number of type-3 operations. • Since the optimal ARG is unknown, the history bound is the minimum number of type-3 operations needed to make the matrix empty.
a: 00010 b: 10010 4 M c: 00100 3 d: 10100 e: 01100 f: 00101 1 g: 00101 Operations on M correspond to operations on the optimal ARG s p a: 00010 2 c: 00100 b: 10010 d: 10100 2 5 e: 01100 g: 00101 f: 00101
12345 a: 00010 b: 10010 4 c: 00100 3 d: 10100 e: 01100 f: 00101 1 s Type-2 operation p a: 00010 2 c: 00100 b: 10010 d: 10100 2 5 e: 01100 f: 00101