230 likes | 247 Views
Explore methods for inferring haplotypes from genotypes, applying heuristics and lower bounds to minimize recombinations in DNA sequencing.
E N D
Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis CSB 2006
Haplotypes/Genotypes • Diploid organisms have two copies of (not identical) chromosomes. A single copy is a haplotype, vector of 0,1.The mixed description is a genotype, vector of 0,1,2. At each site, • If both haplotypes are 0, genotype is 0 • If both haplotypes are 1, genotype is 1 • If one is 0 and the other is 1, genotype is 2 • Key fact: easier to collect genotypes, but many downstream applications work better with haplotypes
0 1 1 1 1 0 0 1 0 1 1 1 Haplotyping Sites: 1 2 3 4 5 6 7 8 9 Phasing the 2s 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 Haplotype 2 1 2 1 0 0 1 2 0 Genotype 2 1 2 1 0 0 1 2 0 Haplotype Inference (HI) Problem: given a set of n genotypes, infer the real n haplotype pairs that form the given genotypes
Two-stage Approach • Given a set of genotypes G, we are interested in downstream problems • Many HI solutions for G • Two stage: first infer the “correct” HI solution from the genotypes, then do the downstream analysis with the inferred haplotypes • Haplotype inference: extensively studied and believed to be accurate to certain extent
One-stage Approach • What effect does haplotyping inaccuracy have on downstream questions? • Our work: directly use genotype data for downstream problems • Without fixing a choice for the HI solution • Minimum recombination problem
Suffix Prefix 11000 0000001111 breakpoint Recombination: Single Crossover • Recombination is one of the principle genetic force shaping variation within species • Two equal length sequences generate a third equal length sequence 110001111111001 000110000001111
Kreitman’s Data (1983) 0000000011000000001101110111100000000000000 0010000000000000001101110111100000000000000 0000000000000000000000000000000000010000101 0000000000000000110000000000000000010011000 0001100010110011110000000000000000001000000 0010000000000001000000000000001010111000010 0010000000000001000000000000011111101000000 1111100010111001000000000000011111101100000 1111100010111001000000000000011111101100000 1111100010111001000000000000011111101100000 1111111110000101000010001000011111101000000 Question: what is the minimum number of recombinations needed to derive these sequences? Assume at most 1 mutation per site
Minimizing Recombination • Compute the minimum number of recombinations (Rmin) for deriving a set of haplotypes, assuming at most 1 mutation per site • NP-hard in general • Heuristics • Lower bounds on Rmin
Lower Bounds on Genotypes • For a particular recombination lower bound method L, what is the range of possible bounds for L over all possible HI solutions? • MinL(G): minimum L over all HI solutions for G. • MaxL(G): maximum L over all HI solutions for G. • This paper: HK bound, connected component bound and relaxed haplotype bound. • Polynomial-time algorithms for MaxHK, MinCC. • Heuristic method for relaxed haplotype bound.
Lower Bound: Incompatibility 1 2 3 4 5 Incompatibility Graph (IG): A node each site, edge between incompatible pair a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 M 1 2 3 4 5 • Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11 • Sites p,q are incompatible A recombination must occur between p,q
Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence. HK bound = maximum number of non-overlapping edges in incompatibility graph (IG). Easy to compute for haplotype data. HK Bound (1985) 1 2 3 4 5 HK Lower Bound = 1
01010 01010 10101 10101 00000 00101 01000 10100 HK = 1 HI1 1 2 3 4 5 01010 01010 10101 10101 00001 00100 00000 11100 HK = 3 HI2 1 2 3 4 5 IG for HI Solutions 01010 10101 00202 22200
HK Bounds on Genotypes • Known efficient algorithm for MinHK(G) (Wiuf, 2004). • This paper: polynomial-time algorithm for MaxHK(G)
MIG(G) E(G) = {12, 23, 35} Maximal Incompatibility Graph G 01010 10101 00202 22200 • An edge between sites p and q if there is a phasing of p, q so p and q are incompatible • Each pair of sites is considered independently • E(G): a maximum-sized set of non-overlapping edges in MIG(G) 1 2 3 4 5
Claim: MaxHK(G) = |E(G)| MaxHK(G) |E(G)| MIG(G): supergraph of IG(H) for any HI solution H If we can find an HI solution H, whose every pair of sites in E(G) is incompatible, then HK(H) |E(G)| Together, MaxHK(G) = |E(G)| MaxHK(G)
Finding such an H MIG(G) • Phase sites from left to right. • Each component in E(G) is a simple path • Each site only constrained by at most one site to the left
01010 01010 10101 10101 00?0? 00?0? 00?00 11?00 01010 01010 10101 10101 0010? 0000? 00000 11100 Phasing G for Incompatibility 01010 01010 10101 10101 00?0? 00?0? 0??00 1??00 • No matter how a previous site p is phased, can always phase this site q to make p, q incompatible
Haplotyping With Minimum Number of Recombinations • Compute Rmin(G) • Haplotyping on a network with fewest recombinations • NP-hard • This paper: A branch and bound method computing exact Rmin(G) for data with small number of sites • APOE data: 47 non-trivial genotypes, 9 sites • Our method: 2 minutes, Rmin(G) = 5
Application: Recombination Hotspot • Recombination hotspot: regions where recombination rate is much higher than neighboring regions • Previous study (Bafna and Bansal, 2005): a recombination lower bound with inferred haplotypes were used to identify recombination hotspots • Our work: compute the exact Rmin(G) with genotypes for a sliding window of a small number of SNPs to detect recombination hotspots
MS32 data (Jeffreys, et al. 2001) Result from haplotypes (Bafna and Bansal, 2005) Result from original genotypes (this paper)
Other Applications • Finding true Rmin from genotypes G • Two stage approach: run PHAS to get an HI solution H, and compute Rmin(H) • One stage approach: directly compute Rmin(G) • Accuracy of haplotype inference on a minimum network • Simulation results: comparable, slightly weaker and non-conclusive
Summary • Main goal of this paper: develop computational tools for the minimum recombination problem with genotypes • Polynomial-time algorithm for MaxHK and MinCC problems • Practical heuristics for other problems • Simulation results to several application questions are not conclusive • Our tools facilitate the study of these problems
Thank You • Software: available upon request