300 likes | 480 Views
Combinatorial methods in Bioinformatics: the haplotyping problem. Paola Bonizzoni DISCo Università di Milano-Bicocca. Content. Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH problem
E N D
Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca
Content • Motivation: biological terms • Combinatorial methods in haplotyping • Haplotyping via perfect phylogeny : the PPH problem • Inference of incomplete perfect phylogeny: algorithms • Incomplete pph and missing data • Other models: open problems
genotype A G A C A A maternal paternal Biological terms Diploid organism Biallelic site i |Value(i) {A,C,G,T}| 2 i i+1 i+2 heterozygous haplotype homozygous
Motivations • Human genetic variations are related to diseases (cancers, diabetes, osteoporoses) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes • The human genome project produces genotype sequences of humans • Computational methods to derive haplotypes from genotype data are demanded • Ongoing international HapMap project: find haplotype differences on large scale population data graphs Set-cover problems Combinatorial methods: Optimization problems
0 1 * Haplotyping: the formal model • Haplotype: m-vector h=<0, 1,…, 0> over {0,1}m • Genotype: m-sequence g=<{0,1}, …,{0,0}, …{1,1}> over {0,1,*} g = <*, …, 0,…, 1 > Def. Haplotypes <h, k> solve genotype g iff : g(i)=* implies h(i) k(i) h(i)= k(i)= g(i) otherwise
h1=<0,0,1,1,0,1,1> h1=<0,0,1,1,0,1,1> h1=<0,0,1,1,0,1,1> h1 h2=<0,1,1,0,0,1,1> h2=<0,1,1,0,0,1,1> g1 h2 g1 =<0,*,1,*,0,1,1> h3=<0,1,0,0,0,1,1> g2 =<0,1,*,0,0,1,1> g2 =<0,1,*,0,0,1,1> g3 =<0,0,*,*,1,1,1> g3 =<0,1,0,*,0,1,1> g3 =<0,1,0,*,0,1,1> Examples h g =<0,*,1,*,0,1,1> g solved by <k,h> g k k=<0,0,1,1,0,1,1> h=<0,1,1,0,0,1,1> Clark inference rule
Haplotype inference: the general problem • Problem HI: Instance: a set G={g1, …,g m} of genotypes and a set H={h1, …,h n } of haplotypes, Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H H’. H’ derives from an inference RULE
Type of inference rules • Clark’s rule: haplotypes solve g by an iterative rule • Gusfield coalescent model: haplotypes are related to genotypes by a tree model • Pedigree data: haplotypes are related to genotypes by a directed graph
0, 1,0,1,1 HI by the perfect phylogeny model 00000 • IDEA: g1= 0, 1,*,*,1 G H g2= *, 0,0,0,1 1, 0,0,0,1 0, 0,0,0,1 0, 1,1,0,1 Genotypes are the mating of haplotypes in a tree Given G find H andTthat explain G!
c1 c2 c3 c4 c5 s1 c2 c3 C1 , s2 s3 c4 c5 s4 s2 s3 s1 s4 Perfect Phylogeny models • Input data: 0-1 matrix A characters, species • Output data: phylogeny for A R Path c3c4
c2 c3 C1 , c4 c5 s2 s3 s1 s4 Perfect phylogeny • each row si labels exactly one leaf of T • each column cj labels exactly one edge of T • each internal edge labelled by at least one column cj • row si gives the 0,1 path from the root to si Def. A pp T for a 0-1 matrix A: Path c3c4
s2 s3 s1 s4 pp model: another view x L(x) cluster of x: set of leaves of T x A pp is associated to a tree-family (S,C) with S={s1 ,…, sn} C={S’ S: S’ is a cluster} s.t. X, Y in C , if XY then XY or Y X.
pp : another view A tree-family (S,C) is represented by a 0-1 matrix: c i • c i S’ : s j S’ iff b ji=1 • for each set in C at least a column s j Lemma A 0-1 matrix is a pp iff it represents a tree-family
ci 00000 00000 01000 si 01001 01000 11000 11000 01001 Haplotyping by the pp A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes: • si haplotype • ci SNPs SNP site 00000 0-1 switch in position i only once in the tree !! 01000
Haplotyping and the pp: observations • The root of T may not be the haplotype 000000 • 0-1 switch or 1-0 switch (directed case) 00011 00000 00011 00011 00011 1-0 switch 0-1 switch 01010 01001 01000 01000 01000 11000 11001 11010 01100 01001 01010
0, 1,0,1,1 01*1*001* 001*11*11 0000*1*1* HI problem in the pp model • Input data: a 0-1-*matrix B n m of genotypes G • Output data: a 0-1 matrix B’ 2n m of haplotypes s.t. (1) each g G is solved by a pair of rows <h,k> in B’ (2) B’ has a pp (tree family) ??? DECISION Problem
b’ a c’ c a’ b An example a 1 0 a’ 0 1 a * * b 0 * c 1 0 b 0 1 b’ 0 0 c 1 0 c’ 1 0
The pph problem: solutions • An undirected algorithm Gusfield Recomb 2002 • An O(nm2)- algorithm Karp et al. Recomb 2003 • A linear time O(nm) algorithm ?? Optimal algorithm A related problem: the incomplete directed pp (IDP) Inferring a pp from a 0-1-* matrix O(nm + klog2(n+ m)) algorithm Peer, T. Pupko, R. Shamir, R. Sharan SIAM 2004
C1 1 ? 0 0 1 ? ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 ? ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 1 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 1 0 1 0 1 0 1 01 C5 C2 C4 C3 S1 S2 S3 IDP problem OPEN PROBLEM: find an optimal algorithm ?? Instance: A 0-1-? Matrix A Solution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists” 1 2 3 4 5
c C’ 00 1 0 1 1 10 01 s1 0 1 s2 s3 11 Decision algorithms for incomplete pp Based on: Characterization of 0-1 matrix A that has a pp -Tree family - - forbidden submatrix – give a no certificate Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1} Forbidden subgraph Y X 10 11 01
Test: a 0-1 matrix A has a pp? • O(nm) algorithm (Gusfield 1991) Steps: • Given A order {c1, …,cm} as (decreasing) binary numbers A’ • Let L(i,j)=k , k = max{l <j: A’[i,l]=1} • Let index(j) = max{L(i,j): i} • Then apply th. TH. A’ has a pp iff L(i,j) = index(j) for each (i,j) s.t. A’[i,j]=1
The IDP algorithm C’ c s1 s2 s3
Other HI problems via the pp model • Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rows genotype pp (Igpp) genotype rows Algorithms: • Ihpp = IDP given a row as a root (polynomial time) NP-complete otherwise • Igpp has polynomial solution under rich data hypothesis (Karp et al. Recomb 2004 – Icalp 2004 ) NP-complete otherwise
0 1 0 0 0 0 0 1 0 0 0 0 recombination 0 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 maternal paternal HI problem and other models • Haplotype inference in pedigree data under the recombination model child
Single Mating Pedigree Tree Pedigree Graph Mating loop Nuclear family Pedigree graph father mather child
0|0 0|1 1|0 0|0 1|0 1|0 0|0 0|1 0|1 0|0 1|0 0|1 1|0 1|1 0|0 0|1 1|1 0|0 0 1 1 1 1 0 Paternal maternal 01 11 10 0|1 1|1 1|0 Haplotype inference in pedigree 00 01 10 10 11 00 01 11 01
Problems: • MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) • SPT-MRHI (Pedigree tree single-mating minimum recombination HI) Np-complete even if the graph is acyclic, but unbounded number of children… OPEN