490 likes | 692 Views
Spooky halloween haplotype assembly. CSCI2820: Medical Bioinformatics Derek Aguiar. http://paradigmthrift.blogspot.com/2010/06/here-kitty-kitty.html. Important bits from last class. Data generation and preparation Sequencing, quality control, read alignment to a reference genome
E N D
Spooky halloween haplotype assembly CSCI2820: Medical Bioinformatics Derek Aguiar http://paradigmthrift.blogspot.com/2010/06/here-kitty-kitty.html
Important bits from last class • Data generation and preparation • Sequencing, quality control, read alignment to a reference genome • Problem optimizations • Minimum fragment removal • remove minimum number of vertices in the fragment conflict graph to make it bipartite Haplotype A: 1 0 0 1 − 0 1 0 − − 0 0 0 1 − 0 − 1 − 1 1 Haplotype B: 0 1 1
Important bits from last class • Problem optimizations • Minimum SNP removal • Maximum independent set in the SNP conflict graph • Minimum error correction • Correct the minimum number of base call errors such that the fragment conflict graph is bipartite Schwartz 2010
Important bits from last class • Levy et al. 2007 algorithm • Greedily seed haplotypes • Extend haplotype seeds with maximum overlapping fragments • Iterative refinement in terms of variants and reads
Important bits from last class • HapCUT algorithm • Compute large cuts in a graph which correspond to improvements in the haplotype solution • Recall: Graph cuts • Partition of the vertices of a graph into two disjoint subsets A and B such that at least one edge exists from a vertex in set A to a vertex in set B s5 s1 s7 s3 s4 s6 s2
Today • HapCompass • Evaluating the quality of a haplotype assembly • Polyploid and cancer haplotype assembly
Devising a new graph model • What are the disadvantages of using reads for vertices (fragment conflict graph) ? • Number of reads >> number of variants • What are the disadvantages of using a local maximum cut approximation step (hapCUT) ? • Maximum cut difficult to approximate and expensive computation, best known approximation ratio is 0.878 (and not practical implementation)
The compass graph GC • Vertices are variants and edges connect vertices spanned by reads • Edges are phased based on overlapping read content • Edge weights: w(si,sj) is the difference between 00 or 11 fragments and 01 or 10 fragments. • Note: Because each edge represents the phasing between two SNPs, and a SNP has exactly two alleles, there are only two possible phasings: or • ( ) 00 11 • ( ) 01 10
Key properties of compass graphs Compass Graph s6 -2 -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2
Key properties of compass graphs Compass Graph s6 (I) If w(si,sj) > 0 then the phasing is else if w(si,sj) < 0 then the phasing is -2 • ( ) 00 11 -2 s5 3 • ( ) 4 01 10 s1 s3 -2 2 -3 s2 s4 -2
Key properties of compass graphs Compass Graph s6 10 01 (II) Unique pairwise phasings of edges can be extended to paths, that is, the phasing is transitive among the SNPs along a path. E.g. the path (s6,s5,s3) defines the phasing -2 -2 s5 00 11 3 4 • ( ) s1 s3 100 011 -2 2 -3 s2 s4 -2
Key properties of compass graphs Compass Graph s6 -2 Lemma A cycle with an odd number of negative weight edges is conflicting. (III) A non-conflicting cycle is a simple cycle that has a unique phasing for any path. (IV) A conflicting cycle is a simple cycle that does not. -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2
Intuition – Cycling through trees • View cycles as vectors of edges • Cycle space is the collection of all cycles • Cycle basis is maximal set of linearlyindependent cycles v8 v7 v4 v1 v3 v5 v6 v2
Optimization • Minimum weighted edge removal (MWER): Given a compass graph G, remove a set of edges of minimum weight such that G is conflict-free.
Basic HapCompass Algorithm Compass Graph Compute simple cycles s6 -2 -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2
Basic HapCompass Algorithm Compass Graph Remove an edge from a conflicting cycle s6 -2 -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2
Basic HapCompass Algorithm Compass Graph If graph still has conflicts, iterate. s6 -2 -2 s5 3 4 s1 s3 2 -3 s2 s4 -2
Phasings of compass graphs Conflict-free Compass Graph s6 -2 Example spanning tree Theorem Every spanning tree of a compass graph is a conflict-free graph. Every spanning tree of a conflict-freecompass graph has the same unique phasing. SNPs -2 s5 s1 s2 s3 s4 s5 s6 edges {s1,s3} {s1,s2} {s5,s6} {s3,s4} {s3,s5} 1 1 3 1 1 4 s1 s3 1 0 1 0 2 -3 1 1 haplotype 1 1 1 0 1 0 s2 s4 -2
HapCompass Algorithm (diploid) input: Sequence reads, variant calls output: 2 haplotypes GC← maximum spanning tree cycle basis CC ← set of conflicting cycles in respect to GC forcinCCdo: resolve(c) GC← rebuild() output: haplotype assembly according to maximum weight spanning tree of Gg Lemma The graph Gcremains connected after each resolve(c). Runs in O(m(m-n+1)2+ (m-n+1)(m log n)) time, m=|E|, n=|V|
Advanced local step – set cover Compass Graph Two conflicting cycles s6 -2 -2 s5 3 4 s1 s3 -2 2 -3 s2 s4 -2
Set cover 2 4 3 2 s3 s1 s3 s2 s4 s2 2
The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes
Evaluation • How can we evaluate a haplotype assembly? true haplotypes computed haplotypes
Evaluation • Error type I: disconnected assemblies computed haplotypes 4 s5 s1 4 8 s3 s4 8 9 s6 s2 9 7 true haplotypes
Evaluation • Error type II: base call error computed haplotypes 4 s5 s1 4 7 s3 s4 -1 9 s6 s2 9 1 true haplotypes
Evaluation • Error type III: phase switch computed haplotypes 4 s5 s1 4 -1 s3 s4 8 9 s6 s2 9 7 true haplotypes
Evaluation s99 • Error type IV: bad alignment computed haplotypes 1 4 s5 s1 4 3 s3 s4 8 9 s6 s2 9 7 true haplotypes
evaluation • 6 main metrics for evaluating accuracy of haplotypes • (1) Completeness • Error(s) type I • Not all algorithms use all of the data • However, longer haplotypes does not necessarily mean better
evaluation • 6 main metrics for evaluating accuracy of haplotypes • (2) Minimum error correction • Error(s) type I, II, III, IV
minimum error correction • Counts the number of bit flips in the reads to construct the haplotypes • The distance d(r,h) between a read r and a haplotype h is just the Hamming distance (number of positions they differ) • Define the distance d of a read r and the computed haplotypes h and h’: d(r,h,h’)=min(d(r,h),d(r,h’)) • So, the total distance is the sum over all read fragments
evaluation • 6 main metrics for evaluating accuracy of haplotypes • (3) Haplotype reconstruction rate • Error(s) type I, II
reconstruction rate • Using the same distance function now counting the distance between two haplotypes, the reconstruction rate is the percentage of correctly computed bases • Requires knowledge of the true haplotypes • Given the computed haplotypes h and h’, known haplotypes g and g’, and n SNPs
evaluation • 6 main metrics for evaluating accuracy of haplotypes • (4) Haplotype switch error • Error(s) type II, III, IV
haplotype switch error • Counts the number of haplotype switches when comparing computed haplotypes to true haplotypes • Requires knowledge of true haplotypes • Was originally developed for haplotype phasing and not well suited for haplotype assembly assembled haplotypes true haplotypes SE = 4 1 1 0 1 – – 1 0 1 1 0 0 1 0 – – 0 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0
evaluation • 6 main metrics for evaluating accuracy of haplotypes • (5) Fragment mapping phase relationship • Error(s) type I, II, III, IV • Fragment mapping phase relationship (FMPR) counts the number of pairwise phase relationships in the reads that are not present in the solution 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 0 FMPR = 2 or 2/6 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 true haplotypes
evaluation • 6 main metrics for evaluating accuracy of haplotypes • (6) Boolean fragments mapped • Error(s) type I, II, III, IV • Boolean fragment mapping (BFM) counts the percentage of fragments that map with errors 1 1 1 1 1 0 BFM = 1 or 1/2 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 true haplotypes http://b3ta.hnldesign.nl/519-muhahaha-lolshark.html
Polyploidy k=2 k=4,6,8 k=3 k=4 k=4
GC haplotype phasing ## 11 11 01 11 01 10 00 00 10 00 2 ## v3 v0 v1 v2
G`C haplotype phasing ## 10 11 11 11 11 ## 01 00 00 00 00 ## v3 v0 v1 v2 00 00 00 00 00
Edge decidability • New to the haplotype assembly model • Phasings of edges are no longer disjoint triploid diploid 11 01 11 01 00 10 00 10 00 00 shared haplotypes={} shared haplotypes={00}
Polyploid edge decidability • Edge phasing is decided by maximum likelihood
Probability of read • Calculating P(r | se, pi)? • Assume read may be generated from any of the haplotypes of pi with equal probability.
Disjoint paths • Given k source and sink pairs in a graph, find k pairwise disjoint paths between each source-sink pair that are • node disjoint • edge disjoint • If k is not part of the input, NP-complete (hard) • If k is part of the input, polynomial time solutions exist (although not practical)
2 v3 v1 v2 GC subgraph(no conflict) The chain graph is a trellis graph. 11 11 11 v1v2 v2v3 v3v1 00 00 00 s1 t1 The chain graph 11 11 11 • Vertex for each 2 variant phased haplotype organized in levels • Edges connect valid haplotype extensions (shared variant allele) Disjoint paths in trellis graphs are polynomially solvable and algorithms are fast in practice. 00 00 00 s0 t0
GC subgraph(conflict) v0 v1 v2 v0v1v1v2v2v0 01 11 11 10 00 00 s1 t1 The chain graph 10 11 11 • Vertex for each 2 variant phased haplotype organized in levels • Edges connect valid haplotype extensions (shared variant allele) 01 00 00 s0 t0