A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science University of California, Davis RECOMB 2005

Haplotypes to Genotypes • Each individual has two “copies” of each chromosome. • At each site, each chromosome has one of two states denoted by 0 and 1 • From haplotypes to genotypes: For each site of an individual, if both haplotypes have state 0, then the genotype has state 0. Same rule for state 1. If two haplotypes have state 0 and 1, or 1 and 0, then the state of the genotype is 2.

Haplotypes to Genotypes Sites: 1 2 3 4 5 6 7 8 9 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 Two haplotypes per individual Merge the haplotypes Genotype for the individual 2 1 2 1 0 0 1 2 0

Genotypes to Haplotypes For each site, if the genotype has state 0 or 1, then the two haplotypes must have states 0, 0 or 1, 1. If the genotype has state 2, the two haplotypes can either have states 0, 1 or 1, 0. 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 Two haplotypes per individual 2 1 2 1 0 0 1 2 0 Genotype for the individual

Haplotype Inference Problem • For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is harder and more expensive to collect than genotype data. • Haplotype Inference Problem: Given a set of n genotypes, determine the original set of n haplotype pairsthat generated the n genotypes. • NIH leads HAPMAP project to find common haplotypes in the human population.

Haplotype Inference Problem • If the genotype has state 2 at k sites, there are 2k –1 possible explaining haplotype pairs. • How to determine which haplotype pair is the original one generating the genotype? • We need a model of haplotype evolution to help solve the haplotype inference problem.

The Perfect Phylogeny Model of Haplotype Evolution sites 12345 00000 Ancestral haplotype 1 4 Site mutations on edges 3 00010 2 10100 5 10000 01010 01011 Extant haplotypes at the leaves

Assumptions of Perfect Phylogeny Model • No recombination, only mutation. • Infinite-site assumption: one mutation per site.

The Perfect Phylogeny Haplotyping(PPH) Problem Given a set of genotypes, find an explaining set of haplotypes that fits a perfect phylogeny Site 1 2 b 00 a a b c c 01 01 10 10 Genotype matrix 10 Haplotype matrix Perfect phylogeny

Prior Work • Several existing algorithms that solve the PPH problem, but none of them is in linear time. • Our contribution: • A linear time algorithm. • Our implementation is about 250 times faster than the fastest one of previous algorithms for large data set.

A P-Class of PPH Solutions • P-Class: Maximum common subgraph in all PPH solutions • Each P-Class consists of two subtrees root 4 1 2 Sites: 1 2 3 4 5 a b c d 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 5 3 b,d Genotypes a,d a,c Genotype Matrix b,c One PPH Solution

P-Class Property of PPH Solutions • All PPH solutions can be obtained by choosing how to flip each P-Class. root Switching points root 1 Switching points 4 5 a,d 1 4 2 2 b,c b,d a,d 5 3 3 b,d a,c b,c a,c One PPH Solution Second PPH Solutions

The Key Theorem • Every PPH solution can be obtained by choosing a flip for each P-Class. • Conversely, after fixing one P-Class, every distinct choice of flips of P-Classes, leads to a distinct PPH solution. • If there are k P-Classes, there are 2k – 1 distinct PPH solutions.

Shadow Tree • Contains classes • Each class in the shadow tree is a subgraph of a P-Class • Merging classes results in larger classes, classes are never split • Contains tree edges and shadow edges

The Algorithm • Process the genotype matrix one row at a time, starting at the first row, and modify the shadow tree • The genotype matrix only contains entries of value 0 and 2.

Overview of the Algorithm for One Row • Procedure FirstPath • Procedure SecondPath • Procedure FixTree • Procedure NewEntries

OldEntryList • OldEntryList : column indices that have entries of value 2 in this row and also have entries of value 2 in some previous rows 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 OldEntryList for row 3: 1, 2, 3, 5 3 Genotype Matrix

Procedures FirstPath and SecondPath • FirstPath : Construct a first path towards the root of the shadow tree which passes through tree edges of as many columns in OldEntryList as possible • SecondPath : Construct a second path towards the root of the shadow tree which passes through tree edges of columns in OldEntryList and not on the first path

Shadow Tree After Processing the First Two Rows root Genotype Matrix 1 1 2 2 2 0 0 2 0 0 2 2 2 2 2 0 2 2 0 0 2 0 1 2 2 3 4 4 2 OldEntryList for row 3 : 1, 2, 3, 5 3 5 5 3

Algorithm – FirstPath root OldEntryList: 1, 2, 3, 5 1 1 CheckList: 3 , 2 Edges 4 and 5 cannot be on the same path to the root in any PPH solution 2 4 4 2 3 5 5 3

Algorithm – SecondPath root OldEntryList: 1, 2, 3, 5 1 1 CheckList: 2, 3 2 4 4 2 3 5 5 3

2 Shadow Tree to PPH Solutions 1 Sites: 1 2 3 4 5 2 root a b c d 2 2 2 0 0 2 0 0 2 2 2 2 2 022 0 0 2 0 4 5 1 1 3 One PPH Solution 2 Genotype Matrix 4 4 2 3 5 5 3 Final shadow tree

2 Shadow Tree to PPH Solutions root 5 1 1 1 a,d 4 2 b,c 2 4 4 3 2 b,d a,c 3 5 5 3 Second PPH Solution Final shadow tree

Implementation – Leaf Count • Leaf count of column i (L[i]): the number of 2's plus twice the number of 1's in column i. • L[i] is the number of leaves below mutation i, in every perfect phylogeny for the genotype matrix. • Along any path to the root in any PPH solution, the successive edges are labeled by columns with strictly increasing leaf counts. Leaf Count: 4 3 2 1

Time Complexity • Constant number of simple operations on each edge per row • Each traversal in the shadow tree goes through O(m) edges. • The algorithm does constant number of traversals in the shadow tree for each row. • Total time: O(nm) n, m are the number of rows and columns in the genotype matrix.

Results

Thank you ! Paper and program can be downloaded at: http://wwwcsif.cs.ucdavis.edu/~gusfield/lpph/

A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem

Presentation Transcript

A Randomized Polynomial-Time Simplex Algorithm for Linear Programming

Linear Clustering Algorithm

A Randomized Linear-Time Algorithm to Find Minimum Spaning Trees

IP issues from the viewpoint of the JPO

McCrieght’s algorithm for linear-time suffix tree construction

Haplotyping via Perfect Phylogeny

Haplotyping Algorithm

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs

The Patent Prosecution Highway

Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies

Haplotyping via Perfect Phylogeny: A Direct Approach

Perfect Phylogeny MLE for Phylogeny Lecture 14

Unitary Patent and Unified Patent Court - enforcement and forum shopping

The Perfect Time for a Virginia Beach Wedding

A Randomized Linear-Time Algorithm to Find Minimum Spanning Trees

A Randomized Polynomial-Time Simplex Algorithm for Linear Programming

Haplotyping via Perfect Phylogeny

Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies

The Perfect Phylogeny Model for binary sequences

A linear time algorithm for recognizing a K 5 -minor