1 / 30

Combinatorial methods in Bioinformatics: the haplotyping problem

Combinatorial methods in Bioinformatics: the haplotyping problem. Paola Bonizzoni DISCo Università di Milano-Bicocca. Content. Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH problem

abe
Download Presentation

Combinatorial methods in Bioinformatics: the haplotyping problem

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca

  2. Content • Motivation: biological terms • Combinatorial methods in haplotyping • Haplotyping via perfect phylogeny : the PPH problem • Inference of incomplete perfect phylogeny: algorithms • Incomplete pph and missing data • Other models: open problems

  3. genotype A G A C A A maternal paternal Biological terms Diploid organism Biallelic site i |Value(i) {A,C,G,T}|  2 i i+1 i+2 heterozygous haplotype homozygous

  4. Motivations • Human genetic variations are related to diseases (cancers, diabetes, osteoporoses) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes • The human genome project produces genotype sequences of humans • Computational methods to derive haplotypes from genotype data are demanded • Ongoing international HapMap project: find haplotype differences on large scale population data graphs Set-cover problems Combinatorial methods: Optimization problems

  5. 0 1 * Haplotyping: the formal model • Haplotype: m-vector h=<0, 1,…, 0> over {0,1}m • Genotype: m-sequence g=<{0,1}, …,{0,0}, …{1,1}> over {0,1,*} g = <*, …, 0,…, 1 > Def. Haplotypes <h, k> solve genotype g iff : g(i)=* implies h(i)  k(i) h(i)= k(i)= g(i) otherwise

  6. h1=<0,0,1,1,0,1,1> h1=<0,0,1,1,0,1,1> h1=<0,0,1,1,0,1,1> h1 h2=<0,1,1,0,0,1,1> h2=<0,1,1,0,0,1,1> g1 h2 g1 =<0,*,1,*,0,1,1> h3=<0,1,0,0,0,1,1> g2 =<0,1,*,0,0,1,1> g2 =<0,1,*,0,0,1,1> g3 =<0,0,*,*,1,1,1> g3 =<0,1,0,*,0,1,1> g3 =<0,1,0,*,0,1,1> Examples h g =<0,*,1,*,0,1,1> g solved by <k,h> g k k=<0,0,1,1,0,1,1> h=<0,1,1,0,0,1,1> Clark inference rule

  7. Haplotype inference: the general problem • Problem HI: Instance: a set G={g1, …,g m} of genotypes and a set H={h1, …,h n } of haplotypes, Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H  H’. H’ derives from an inference RULE

  8. Type of inference rules • Clark’s rule: haplotypes solve g by an iterative rule • Gusfield coalescent model: haplotypes are related to genotypes by a tree model • Pedigree data: haplotypes are related to genotypes by a directed graph

  9. 0, 1,0,1,1 HI by the perfect phylogeny model 00000 • IDEA: g1= 0, 1,*,*,1 G H g2= *, 0,0,0,1 1, 0,0,0,1 0, 0,0,0,1 0, 1,1,0,1 Genotypes are the mating of haplotypes in a tree Given G find H andTthat explain G!

  10. c1 c2 c3 c4 c5 s1 c2 c3 C1 , s2 s3 c4 c5 s4 s2 s3 s1 s4 Perfect Phylogeny models • Input data: 0-1 matrix A characters, species • Output data: phylogeny for A R Path c3c4

  11. c2 c3 C1 , c4 c5 s2 s3 s1 s4 Perfect phylogeny • each row si labels exactly one leaf of T • each column cj labels exactly one edge of T • each internal edge labelled by at least one column cj • row si gives the 0,1 path from the root to si Def. A pp T for a 0-1 matrix A: Path c3c4

  12. s2 s3 s1 s4 pp model: another view x L(x) cluster of x: set of leaves of T x A pp is associated to a tree-family (S,C) with S={s1 ,…, sn} C={S’  S: S’ is a cluster} s.t. X, Y in C , if XY then XY or Y  X.

  13. pp : another view A tree-family (S,C) is represented by a 0-1 matrix: c i • c i S’ : s j  S’ iff b ji=1 • for each set in C at least a column s j Lemma A 0-1 matrix is a pp iff it represents a tree-family

  14. ci 00000 00000 01000 si 01001 01000 11000 11000 01001 Haplotyping by the pp A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes: • si haplotype • ci SNPs SNP site 00000 0-1 switch in position i only once in the tree !! 01000

  15. Haplotyping and the pp: observations • The root of T may not be the haplotype 000000 • 0-1 switch or 1-0 switch (directed case) 00011 00000 00011 00011 00011 1-0 switch 0-1 switch 01010 01001 01000 01000 01000 11000 11001 11010 01100 01001 01010

  16. 0, 1,0,1,1 01*1*001* 001*11*11 0000*1*1* HI problem in the pp model • Input data: a 0-1-*matrix B n  m of genotypes G • Output data: a 0-1 matrix B’ 2n  m of haplotypes s.t. (1) each g  G is solved by a pair of rows <h,k> in B’ (2) B’ has a pp (tree family) ??? DECISION Problem

  17. b’ a c’ c a’ b An example a 1 0 a’ 0 1 a * * b 0 * c 1 0 b 0 1 b’ 0 0 c 1 0 c’ 1 0

  18. The pph problem: solutions • An undirected algorithm Gusfield Recomb 2002 • An O(nm2)- algorithm Karp et al. Recomb 2003 • A linear time O(nm) algorithm ?? Optimal algorithm A related problem: the incomplete directed pp (IDP) Inferring a pp from a 0-1-* matrix O(nm + klog2(n+ m)) algorithm Peer, T. Pupko, R. Shamir, R. Sharan SIAM 2004

  19. C1 1 ? 0 0 1 ? ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 ? ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 1 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 1 0 1 0 1 0 1 01 C5 C2 C4 C3 S1 S2 S3 IDP problem OPEN PROBLEM: find an optimal algorithm ?? Instance: A 0-1-? Matrix A Solution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists” 1 2 3 4 5

  20. c C’ 00 1 0 1 1 10 01 s1 0 1 s2 s3 11 Decision algorithms for incomplete pp Based on: Characterization of 0-1 matrix A that has a pp -Tree family - - forbidden submatrix – give a no certificate Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1} Forbidden subgraph Y X 10 11 01

  21. Test: a 0-1 matrix A has a pp? • O(nm) algorithm (Gusfield 1991) Steps: • Given A order {c1, …,cm} as (decreasing) binary numbers A’ • Let L(i,j)=k , k = max{l <j: A’[i,l]=1} • Let index(j) = max{L(i,j): i} • Then apply th. TH. A’ has a pp iff L(i,j) = index(j) for each (i,j) s.t. A’[i,j]=1

  22. Idea:

  23. The IDP algorithm C’ c s1 s2 s3

  24. Other HI problems via the pp model • Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rows genotype pp (Igpp) genotype rows Algorithms: • Ihpp = IDP given a row as a root (polynomial time) NP-complete otherwise • Igpp has polynomial solution under rich data hypothesis (Karp et al. Recomb 2004 – Icalp 2004 ) NP-complete otherwise

  25. 0 1 0 0 0 0 0 1 0 0 0 0 recombination 0 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 maternal paternal HI problem and other models • Haplotype inference in pedigree data under the recombination model child

  26. Single Mating Pedigree Tree Pedigree Graph Mating loop Nuclear family Pedigree graph father mather child

  27. 0|0 0|1 1|0 0|0 1|0 1|0 0|0 0|1 0|1 0|0 1|0 0|1 1|0 1|1 0|0 0|1 1|1 0|0 0 1 1 1 1 0 Paternal maternal 01 11 10 0|1 1|1 1|0 Haplotype inference in pedigree 00 01 10 10 11 00 01 11 01

  28. Problems: • MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) • SPT-MRHI (Pedigree tree single-mating minimum recombination HI) Np-complete even if the graph is acyclic, but unbounded number of children… OPEN

  29. Conclusions

  30. References

More Related