300 likes | 478 Views
Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event. Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis. Haplotyping Problem. Diploid organisms have two copies of (not identical) chromosomes.
E N D
Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University of California, Davis WABI 2005
Haplotyping Problem • Diploid organisms have two copies of (not identical) chromosomes. • A single copy is haplotype, a vector of Single Nucleotides Polymorphisms (SNPs) • SNP: a site with two types of nucleotides occur frequently, 0 or 1 • The mixed description is genotype, vector of 0,1,2 • If both haplotypes are 0, genotype is 0 • If both haplotypes are 1, genotype is 1 • If one is 0 and the other is 1, genotype is 2
Haplotypes and Genotypes Sites: 1 2 3 4 5 6 7 8 9 • Haplotype Inference (HI) Problem: given a set of n genotypes, infer n haplotype pairs that form the given genotypes 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 Two haplotypes per individual Merge the haplotypes Genotype for the individual 2 1 2 1 0 0 1 2 0
Perfect Phylogeny Haplotyping (PPH) • Finding original haplotypes in nature hopeless without genetic model to guide solution picking • Gusfield (2002) introduced PPH problem • PPH is to find HI solutions that fit into a perfect phylogeny. • Nice results for PPH, including a linear time algorithm
The Perfect Phylogeny Model for Haplotypes Assume at most 1 mutation sites 12345 at each site Ancestral sequence 00000 1 4 Site mutations on edges 3 00010 2 The tree derives the set M: 10100 10000 01011 01010 00010 10100 5 10000 01010 01011 Extant sequences at the leaves
PPH Example Inferred Haplotypes Genotypes Perfect Phylogeny
Imperfect Phylogeny Haplotyping (IPPH): Extending PPH • Often, the real biological data does not have PPH solutions. • Eskin, et al (2003) found deleting small part of data may lead to PPH solution (heuristic) • Our approach: IPPH with explicit genetic model, with small amount of • Homoplasy, i.e. back or recurrent mutation • Recombination • Goal: Extend usage of PPH • Real data: may be of small perturbation from PPH • Haplotype block: low recombination or homoplasy
Back/Recurrent Mutation for Haplotypes More than one mutation at a site 000 2 1 010 Data 000 010 101 110 100 1 3 010 110 101 000
11000 0000001111 breakpoint Recombinations: Single Crossover • Recombination is one of the principle genetic force shaping genetic variations • Two equal length sequences generate the third equal length sequence 110001111111001 000110000001111 Suffix Prefix
IPPH (Imperfect Phylogeny Haplotyping) Problems • Small deviation from PPH • H-1 IPPH problem • Find a tree that allows exactly one site to mutate twice • The rest of sites can only mutate at most once • Derive haplotypes for the given genotypes • R-1 IPPH problem • Find a network that has exactly one recombination event • Each site mutates at most once • Derive haplotypes for the given genotypes
Number of Minimum Recombinations for Haplotypes Frequency of Minimum recombinations for small rho (scaled recombination rate) 20 sequences 30 sites 500 simulations
000 1 Homoplasy Tree 2 1 010 100 1 3 a2 b2 b1 a1 Haplotyping with One Homoplasy More than one mutation at a site 1 Haplotype Genotype
Algorithm for H1-IPPH • For each site s in the input genotype data M • Test whether M-{s} has PPH solutions • If not, move to next site. • Otherwise, check whether 1 homoplasy at site s can lead to HI solutions • If yes, stop and report result • Assume only one PPH solution for M-{s} • But how to find solutions with 1 homoplasy at s efficiently?
M-{i3} {i3} Site i3 Example M
Combine Mh-{i3} with h{i3} Assume Mh-{i3} is fixed. Haplotypes for the same genotype must pair up. Two ways to pair r2 s2 r2’ s2’ Mh-{i3} h{i3} M-{i3} {i3} PPH
Mh-{i3} h{i3} Mh1 Mh2 ? • 4 ways to try pairing i3. • Exponential number in general, even for one PPH solution • Need polynomial-time method to avoid trying all the pairings
Move to Trees Mh-{i3} h{i3} Convert perfect phylogeny tree from PPH solution to un-rooted
Tree Tr Ts s L1, L2 O1, O2 1 Homoplasy: from T to Tr, Ts Tree T s s O1 L1 L2 O2 Recurrent mutation @ site s Deleting s induces tree Tr s induces a split Ts
Tree T s s O1 L1 L - L1 O2 From Tr, Ts to T Tree Tr L - L1 L1 Ts s L Find two subtrees Ts1, Ts2, in Tr, s.t. O Ts1, Ts2 corresponds to one side of Ts
1. Pick one side of partition from Ts 2. Pick leaves from Tr corresponding the chosen partition side 3. Check whether the selected leaves fit into two sub-trees
s2 can pair with r2’ 1. May need to refine a non-binary vertex before picking subtree
Algorithms and Results • Efficient graph-coloring based method to select two subtrees (skipped) • Implemented in C++ • Simulation with data with program ms. • Compare to PHASE (a haplotyping program) • Accuracy: comparable • Speed: at least 10x faster • 100x100 data: about 3 seconds • Can identify the homoplasy site with high accuracy: >95% in simulation
Algorithm for R1-IPPH ML MR M Split M by cutting between two sites
PPH Solutions Build perfect phylogeny for two partitions
1-SPR operation SPR: subtree-prune-regraft operation 1 recombination condition equivalent to distance-SPR(TL,TR) = 1
Algorithm for R1-IPPH • Brute-force 1-SPR idea leads to exponential time when TL or TR are not binary. • Trickier than H1-IPPH, but with care, R1-IPPH can be solved in polynomial time. (not in paper)
Conclusions • Contributions • Assuming bounded number of PPH solutions • Polynomial time algorithm for H1-IPPH problem • Polynomial time algorithm for R1-IPPH problem • Possible extension to more than 1 homoplasy event. • Open problems • Haplotyping with more than 1 recombination efficiently. • Remove assumption that number of PPH solutions for M-{s} is bounded.
Thank you • Questions?