670 likes | 1.01k Views
L6: Haplotype phasing. Genotypes and Haplotypes. Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles Current Genotyping technology doesn’t give phase. 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0. 2 1 2 1 0 0 1 2 0.
E N D
Genotypes and Haplotypes • Each individual has two “copies” of each chromosome. • At each site, each chromosome has one of two alleles • Current Genotyping technology doesn’t give phase 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 2 1 2 1 0 0 1 2 0 Genotype for the individual
Haplotype Phasing • Haplotype Phasing is the resolution of a genotype into the two haplotypes. • Haplotypes increase the power of an association between marker loci and phenotypic traits • Current approaches to Haplotyping • Via technological innovations (expensive) • Statistical Methods (ML, Phase,PL) • Combinatorial approach to the phasing problem • Efficient, provable quality of solution • Not completely generalizable (as yet)
Clark’s idea • Using the HWE principle, infer phase using homozygous sites. • Not described as an algorithm, but as a methodology to infer phase. 0 1 1 1 0 0 1 1 0 1 1 0 2 0 0 2 0 0 2 1 2 0 0 0 0 0 0
Maximum likelihood estimation of phase • Input: Genotypes 1…m with counts n1, n2,.. • Output: Haplotype frequencies (also individual haplotype assignments) • Define (unknown) genotype probabilities P1,P2,P3… • Likelihood Function (based on genotype probabilities)
Genotypes and Haploptypes • Let cj be the number of haplotype pairings that will give us genotype j, Then • Use HWE to compute Pr(hk,hl)
The Expectation Step • Q: Given haplotype frequencies, what are the paired haplotype frequencies • A: Initially • Subsequently, (gth iteration)
The M Step • it is 0, 1, or 2 (# of times haplotype t occurs in paired haplotype t)
Bayesian approach to phasing • Idea: Small variants of common haplotypes should also be considered common even though they have low frequency
Phase • As described, each haplotype arises from the prior set only through mutations. Recombination is not considered • In subsequent versions, recombination is explicitly considered in the equation
Phase results • Phase versus EM versus Clark • Error rate: Proportion of individuals incorrectly predicted
The Perfect Phylogeny Model • We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed. • In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root. 12345 00000 1 4 3 00010 2 10100 5 10000 01010 01011 Extant Haplotypes
Haplotyping via Perfect Phylogeny PPH: Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 00 1 2 b 00 a a b c c 01 01 10 10 10
The Alternative Explanation No tree possible for this explanation
The 4 Gamete Test for Perfect Phylogeny • Arrange the haplotypes in a matrix, two haplotypes for each individual. • Then (with no duplicate columns), the haplotypes fit a unique perfect phylogeny if and only if no two columns contain all four pairs (Buneman): 0,0 and 0,1 and 1,0 and 1,1 00 10 01 11
The Alternative Explanation No tree possible for this explanation
The Tree Explanation Again 0 0 1 2 b 0 0 a b a c c 0 1 0 1
The Combinatorial Problem • Input: A ternary matrix (0,1,2) M with N rows • Output: A binary matrix M’ created from M by replacing each 2 in M with a 0 and 1, such that M’ passes the 4 gamete test • Gusfield (Recomb2002) proposed a solution which used a reduction to Matroids. • We present a (slightly inefficient) solution using elementary techniques • Independently by (Eskin, Halperin, Karp’02)
Initial Observations • Forced Expansions: • EX 1: If two columns(sites) of M contain the following rows 2 0 0 2 Then M’ will contain a row with 1 0 and a row with 0 1 in those columns. • EX 2: Similarly, if two columns of M contain the rows 2 1 2 0 Then M’ will contain rows with 1 1 and 0 0 in those columns
Initial Observations If a forced expansion of two columns creates rows 0 1, and 1 0 in those columns, then any 2 2 in those columns must be set to be 0 1 1 0 We say that two columns are forced out-of-phase. 22 If a forced expansion of two columns creates 1 1, and 0 0 in those columns, then any 2 2 in those columns must be set to be 1 1 0 0 We say that two columns are forced in-phase. 22
Immediate Failure It can happen that the forced expansion of cells creates a 4x2 submatrix that fails the 4-Gamete Test. In that case, there is no PPH solution for M. 20 12 02 Example: Will fail the 4-Gamete Test
An O(ns^2)-time Algorithm • Find all the forced phase relationships by considering columns in pairs. • Find all the inferred, invariant, phase relationships. • Find a set of column pairs whose phase relationship can be arbitrarily set, so that all the remaining phase relationships can be inferred. • Result: An implicit representation of all solutions to the PPH problem.
A Running Example 1 2 3 4 5 6 7 A B C D E F
Companion Graph G_c 7 1 6 4 3 2 5 1 1 2 3 4 5 6 7 A B C D E F • Each node represents a column in M, and each edge indicates that the pair of columns has a row with 2’s in both columns. • The algorithm builds this graph, and then checks whether any pair of nodes is forced in or out of phase.
Phasing Edges in G_c 1 • Each Red edge indicates that the columns are forced in-phase. • Each Blue edge indicates that the columns are forced out-of-phase. 7 6 3 4 2 5 Let G_f be the sub-graph of G_c defined by the red and blue edges.
7 6 3 4 2 5 Connected Components in G_f 1 • Graph G_f has three connected components .
Phase-parity Lemma • Lemma 1: There is a solution to the PPH problem for M if and only if there is a coloring of the black edges of G_c with the following property: For any triangle in G_c containing at least one black edge, the coloring makes either 0 or 2 of the edges blue (i.e., out of phase) That’s nice, but how do we assign the colors?
7 6 3 4 2 5 Graph G_f A Weak Triangulation Rule 1 • Theorem 1: If there are any black edges whose ends are in the same connected component of G_f, at least one edge is in a triangle where the other edges are not black • In every PPH solution, it must be colored so that the triangle has an even number of Blue (out of Phase) edges. • This an “inferred” coloring.
7 6 3 4 2 5 Graph G_f
7 6 3 4 2 5 Graph G_f
7 6 3 4 2 5 Graph G_f
7 6 3 4 2 5 Graph G_f
Corollary • Inside any connected component of G_f, ALL the phase relationships on edges (columns of M) are uniquely determined, either as forced relationships based on pair-wise column comparisons, or by triangle-based inferred colorings. • Hence, the phase relationships of all the columns in a connected component of G_f are INVARIANT over all the solutions to the PPH problem. • The black edges in G_f can be ordered so that the inferred colorings can be done in linear time. Modification of DFS.
Phase Parity Lemma: Proof If X ≠ 2, and Y ≠ 2, Then the two columns are forced
Phase Parity Lemma: proof A B C • Lemma: If a triangle contains a black edge, then a PPH solution exists only if there are 0 or 2 blue edges in the final coloring. • Proof: • No black edge unless x==2, or y==2 or z==2 (previous lemma) • If there is a row with all 2s, then there must be an even number of blue edges B A C
Proof of Weak Triangulation Theorem A • Arbitrary chordless cycles are possible in the graph, with forced edges. • See example. The pattern 0,2; 2,0; and 2,2 implies a blue (out of phase) edge • A single unforced edge changes the picture E B D C A B C D E
Proof of Weak Triangulation Theorem • Let (J,J’) be a black edge connecting a ‘long’ path J,K,…K’,J’ of forced edges • In the Matrix, x ≠ 2, otherwise there is a chord. Likewise y≠2 • By previous lemma, (J,J’) is forced K K’ J’ J K J J’ K’
Finishing the Solution Problem: A connected component C of G may contain several connected components of G_f, so any edge crossing two components of G_f will still be black. How should they be colored?
7 6 3 4 2 5 1 • How should we color the remaining black edges in a connected component C of G_c?
Answer • For a connected component C of G with k connected components of Gf, select any subset S of k-1 black edges in C, so that S together with the red and blue edges span all the nodes of C. • Arbitrarily, color each edge in S either red or blue. • Infer the color of any remaining black edges by • successive use of the triangle rule. 7 6 3 4 2 5
6 4 7 3 2 5
Theorem 2 • Any selected S works (allows the triangle rule to work) and any coloring of the edges in S determines the colors of any remaining black edges. • Different colorings of S determine different colorings of the remaining black edges. • Each different coloring of S determines a different solution to the PPH problem. • All PPH solutions can be obtained in this way, i.e. using just one selected S set, but coloring it in all 2^(k-1) ways.
Corollary • In a single connected component C of G with k connected components in Gf, there are exactly 2^(k-1) different solutions to the PPH problem in the columns of M represented by C. • If G_c has r connected components and t connected components of G_f, then there are exactly 2^(t-r) solutions to the PPH problem. • There is one unique PPH solution if and only if each connected component in G is a connected component in G_f.
Conclusion • In the special case of blocks with no recombination, and no recurrent mutations, the haplotypes satisfy a perfect phylogeny • Given a set of genotypes, there is an efficient (O(ns^2)) algorithm for representing all possible haplotype solutions that satisfy a prefect phylogeny • Efficiency: • Input is size O(ns), • All operations except building the graph are O(ns+s^2) • Valid PPH only if s = O(n). Is O(ns) possible? • Current best solution is O(ns+n^(1-e) s^2) using Matrix Multiplication idea • Future work involves combining this with some heuristics to deal with general cases (lo recombination/hi recombination)
Simulated Data • Coalescent model (Hudson) • No Recombination • 400 chromosomes, 100 sites • Infinite sites • Recombination • 100 chromosomes • Infinite sites • R=4.0 2501 • Pr(Recombination) = 4*10^(-9) between adjacent bases
Error Measurement • Discrepancy = 1 (Num Haplotypes incorrectly predicted) • Switch Error = 2 02222 22222 01010 00101 01010 10101 00101 01010 00000 11111