L6: Haplotype phasing

L6: Haplotype phasing

Genotypes and Haplotypes • Each individual has two “copies” of each chromosome. • At each site, each chromosome has one of two alleles • Current Genotyping technology doesn’t give phase 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 2 1 2 1 0 0 1 2 0 Genotype for the individual

Haplotype Phasing • Haplotype Phasing is the resolution of a genotype into the two haplotypes. • Haplotypes increase the power of an association between marker loci and phenotypic traits • Current approaches to Haplotyping • Via technological innovations (expensive) • Statistical Methods (ML, Phase,PL) • Combinatorial approach to the phasing problem • Efficient, provable quality of solution • Not completely generalizable (as yet)

Clark’s idea • Using the HWE principle, infer phase using homozygous sites. • Not described as an algorithm, but as a methodology to infer phase. 0 1 1 1 0 0 1 1 0 1 1 0 2 0 0 2 0 0 2 1 2 0 0 0 0 0 0

Maximum likelihood estimation of phase • Input: Genotypes 1…m with counts n1, n2,.. • Output: Haplotype frequencies (also individual haplotype assignments) • Define (unknown) genotype probabilities P1,P2,P3… • Likelihood Function (based on genotype probabilities)

Genotypes and Haploptypes • Let cj be the number of haplotype pairings that will give us genotype j, Then • Use HWE to compute Pr(hk,hl)

Likelihood using haplotype frequencies

The Expectation Step • Q: Given haplotype frequencies, what are the paired haplotype frequencies • A: Initially • Subsequently, (gth iteration)

The M Step • it is 0, 1, or 2 (# of times haplotype t occurs in paired haplotype t)

Bayesian approach to phasing • Idea: Small variants of common haplotypes should also be considered common even though they have low frequency

Phase

Phase • As described, each haplotype arises from the prior set only through mutations. Recombination is not considered • In subsequent versions, recombination is explicitly considered in the equation

Phase results • Phase versus EM versus Clark • Error rate: Proportion of individuals incorrectly predicted

Combinatorial Approach to Haplotyping

The Perfect Phylogeny Model • We assume that the evolution of extant haplotypes can be displayed on a rooted, directed tree, with the all-0 haplotype at the root, where each site changes from 0 to 1 on exactly one edge, and each extant haplotype is created by accumulating the changes on a path from the root to a leaf, where that haplotype is displayed. • In other words, the extant haplotypes evolved along a perfect phylogeny with all-0 root. 12345 00000 1 4 3 00010 2 10100 5 10000 01010 01011 Extant Haplotypes

Haplotyping via Perfect Phylogeny PPH: Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny 00 1 2 b 00 a a b c c 01 01 10 10 10

The Alternative Explanation No tree possible for this explanation

The 4 Gamete Test for Perfect Phylogeny • Arrange the haplotypes in a matrix, two haplotypes for each individual. • Then (with no duplicate columns), the haplotypes fit a unique perfect phylogeny if and only if no two columns contain all four pairs (Buneman): 0,0 and 0,1 and 1,0 and 1,1 00 10 01 11

The Alternative Explanation No tree possible for this explanation

The Tree Explanation Again 0 0 1 2 b 0 0 a b a c c 0 1 0 1

The Combinatorial Problem • Input: A ternary matrix (0,1,2) M with N rows • Output: A binary matrix M’ created from M by replacing each 2 in M with a 0 and 1, such that M’ passes the 4 gamete test • Gusfield (Recomb2002) proposed a solution which used a reduction to Matroids. • We present a (slightly inefficient) solution using elementary techniques • Independently by (Eskin, Halperin, Karp’02)

Initial Observations • Forced Expansions: • EX 1: If two columns(sites) of M contain the following rows 2 0 0 2 Then M’ will contain a row with 1 0 and a row with 0 1 in those columns. • EX 2: Similarly, if two columns of M contain the rows 2 1 2 0 Then M’ will contain rows with 1 1 and 0 0 in those columns

Initial Observations If a forced expansion of two columns creates rows 0 1, and 1 0 in those columns, then any 2 2 in those columns must be set to be 0 1 1 0 We say that two columns are forced out-of-phase. 22 If a forced expansion of two columns creates 1 1, and 0 0 in those columns, then any 2 2 in those columns must be set to be 1 1 0 0 We say that two columns are forced in-phase. 22

Immediate Failure It can happen that the forced expansion of cells creates a 4x2 submatrix that fails the 4-Gamete Test. In that case, there is no PPH solution for M. 20 12 02 Example: Will fail the 4-Gamete Test

An O(ns^2)-time Algorithm • Find all the forced phase relationships by considering columns in pairs. • Find all the inferred, invariant, phase relationships. • Find a set of column pairs whose phase relationship can be arbitrarily set, so that all the remaining phase relationships can be inferred. • Result: An implicit representation of all solutions to the PPH problem.

A Running Example 1 2 3 4 5 6 7 A B C D E F

Companion Graph G_c 7 1 6 4 3 2 5 1 1 2 3 4 5 6 7 A B C D E F • Each node represents a column in M, and each edge indicates that the pair of columns has a row with 2’s in both columns. • The algorithm builds this graph, and then checks whether any pair of nodes is forced in or out of phase.

Phasing Edges in G_c 1 • Each Red edge indicates that the columns are forced in-phase. • Each Blue edge indicates that the columns are forced out-of-phase. 7 6 3 4 2 5 Let G_f be the sub-graph of G_c defined by the red and blue edges.

7 6 3 4 2 5 Connected Components in G_f 1 • Graph G_f has three connected components .

Phase-parity Lemma • Lemma 1: There is a solution to the PPH problem for M if and only if there is a coloring of the black edges of G_c with the following property: For any triangle in G_c containing at least one black edge, the coloring makes either 0 or 2 of the edges blue (i.e., out of phase) That’s nice, but how do we assign the colors?

7 6 3 4 2 5 Graph G_f A Weak Triangulation Rule 1 • Theorem 1: If there are any black edges whose ends are in the same connected component of G_f, at least one edge is in a triangle where the other edges are not black • In every PPH solution, it must be colored so that the triangle has an even number of Blue (out of Phase) edges. • This an “inferred” coloring.

7 6 3 4 2 5 Graph G_f

Corollary • Inside any connected component of G_f, ALL the phase relationships on edges (columns of M) are uniquely determined, either as forced relationships based on pair-wise column comparisons, or by triangle-based inferred colorings. • Hence, the phase relationships of all the columns in a connected component of G_f are INVARIANT over all the solutions to the PPH problem. • The black edges in G_f can be ordered so that the inferred colorings can be done in linear time. Modification of DFS.

Phase Parity Lemma: Proof If X ≠ 2, and Y ≠ 2, Then the two columns are forced

Phase Parity Lemma: proof A B C • Lemma: If a triangle contains a black edge, then a PPH solution exists only if there are 0 or 2 blue edges in the final coloring. • Proof: • No black edge unless x==2, or y==2 or z==2 (previous lemma) • If there is a row with all 2s, then there must be an even number of blue edges B A C

Proof of Weak Triangulation Theorem A • Arbitrary chordless cycles are possible in the graph, with forced edges. • See example. The pattern 0,2; 2,0; and 2,2 implies a blue (out of phase) edge • A single unforced edge changes the picture E B D C A B C D E

Proof of Weak Triangulation Theorem • Let (J,J’) be a black edge connecting a ‘long’ path J,K,…K’,J’ of forced edges • In the Matrix, x ≠ 2, otherwise there is a chord. Likewise y≠2 • By previous lemma, (J,J’) is forced K K’ J’ J K J J’ K’

Finishing the Solution Problem: A connected component C of G may contain several connected components of G_f, so any edge crossing two components of G_f will still be black. How should they be colored?

7 6 3 4 2 5 1 • How should we color the remaining black edges in a connected component C of G_c?

Answer • For a connected component C of G with k connected components of Gf, select any subset S of k-1 black edges in C, so that S together with the red and blue edges span all the nodes of C. • Arbitrarily, color each edge in S either red or blue. • Infer the color of any remaining black edges by • successive use of the triangle rule. 7 6 3 4 2 5

6 4 7 3 2 5

Theorem 2 • Any selected S works (allows the triangle rule to work) and any coloring of the edges in S determines the colors of any remaining black edges. • Different colorings of S determine different colorings of the remaining black edges. • Each different coloring of S determines a different solution to the PPH problem. • All PPH solutions can be obtained in this way, i.e. using just one selected S set, but coloring it in all 2^(k-1) ways.

Corollary • In a single connected component C of G with k connected components in Gf, there are exactly 2^(k-1) different solutions to the PPH problem in the columns of M represented by C. • If G_c has r connected components and t connected components of G_f, then there are exactly 2^(t-r) solutions to the PPH problem. • There is one unique PPH solution if and only if each connected component in G is a connected component in G_f.

Conclusion • In the special case of blocks with no recombination, and no recurrent mutations, the haplotypes satisfy a perfect phylogeny • Given a set of genotypes, there is an efficient (O(ns^2)) algorithm for representing all possible haplotype solutions that satisfy a prefect phylogeny • Efficiency: • Input is size O(ns), • All operations except building the graph are O(ns+s^2) • Valid PPH only if s = O(n). Is O(ns) possible? • Current best solution is O(ns+n^(1-e) s^2) using Matrix Multiplication idea • Future work involves combining this with some heuristics to deal with general cases (lo recombination/hi recombination)

Simulated Data • Coalescent model (Hudson) • No Recombination • 400 chromosomes, 100 sites • Infinite sites • Recombination • 100 chromosomes • Infinite sites • R=4.0 2501 • Pr(Recombination) = 4*10^(-9) between adjacent bases

Error Measurement • Discrepancy = 1 (Num Haplotypes incorrectly predicted) • Switch Error = 2 02222 22222 01010 00101 01010 10101 00101 01010 00000 11111

No Recombination

L6: Haplotype phasing

L6: Haplotype phasing

Presentation Transcript

Phasing Out PFOS and PBDEs: Voluntary and Regulatory Steps

SNP and Haplotype Analysis Algorithms and Applications

The Rh System

20 Year Review of Local Government

Efficient Algorithms for SNP Genotype Data Analysis using Hidden Markov Models of Haplotype Diversity

Government Intervention in Agriculture

Overview of New Connectivity Scheme for the EVLA Correlator B. Carlson

20 Year Review of Local Government

Proc Allele (SAS/Genetics) Single SNP analysis Tests for Multi-allelic Markers Haplotype tests

The Genome is Organized in Chromatin

Haplotype Analysis based on Markov Chain Monte Carlo

Government Intervention in Agriculture

Introduction to SNP and Haplotype Analysis

Haplotype analysis

Product

Advanced Algorithms and Models for Computational Biology -- a machine learning approach

Storm Water Permits

400D

Frankfort City Comprehensive Plan Charrette 3, March 8, 2006

Subsidiary Meeting - NEB IT June 2013

Patterson Space and Heavy Atom Isomorphous Replacement

Department of Transport