260 likes | 420 Views
HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data. By Derek Aguiar and Sorin Istrail ( Brown University) Journal of Computational Biology, June 2012 Presented by KWOK Tsz Piu (Bill) 19/12/2013. Introduction.
E N D
HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data By Derek Aguiar and SorinIstrail (Brown University) Journal of Computational Biology, June 2012 Presented by KWOK TszPiu (Bill) 19/12/2013
Introduction • Genetic variation is present in the form of single nucleotide polymorphisms(SNPs), insertions/deletions, inversions, translocations, copy number variations, etc. • The abundance of SNPs in human genome and the development of high-throughput genotyping technologies • SNPs become the marker of choice for understanding human genetic variation.
Introduction • Human genome contains a pair of DNA sequences : one from each parent called haploid sequences or haplotypes • Haplotypes differ in SNP/insertion/deletion… • SNPs are single bpmutations (~0.1%; non-uniform) • SNP positions contain one of two possible alleles … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcTgtatacacgggTctata… … ataggtccCtatttcgcgcCgtatacacgggTctata …
Haplotypes and Genotypes • Haplotype: description of SNP alleles on a chromosome • 0 for major allele, 1 for minor • Diploids: two homologous copies of each autosomal chromosome • One inherited from mother and one from father • Genotype: description of alleles on both chromosomes • 0 - both chromosomes contain the major allele; • 1 - both chromosomes contain the minorallele; • 2 - the chromosomes contain different alleles 021200210 011000110 001100010 genotype + two haplotypes per individual
Goal of Haplotype assembly • Reconstruct the two haplotypesby the aligned sequence fragments
Goal of Haplotype assembly • Sequence reads are sampled from haploid fragments
Gene‐Disease Association Studies • Haplotypes increase power of association
Haplotype assembly problem • In the absence of error in sequenced read, the correct haplotype assembly is unique. • In the real case, the problem become finding the haplotype assembly that optimizes a certain objective function • E.g., minimize the number of conflicts with the sequenced reads. (MEC)
Compass Graph • Weight = Number of phasings – number of phasings • Positive => suggest phasings • Negative => suggest phasings • Zero (small absolute value) => both phasings are ok.
Properties of compass graph • There is a unique phasing between two SNPs si and sj if and only if for any two simple edge-disjoint paths p and q in GC between si and sj, the number of negative edges of p plus the number of negative edges of q is even, and p and q include no 0-weight edges. • S1->S2->S4 • S1->S3->S4
Definitions • Conflicting cycle is: • Simiple cycle contains odd number of negative edges • Or has at least one 0-weight edges • GC(Compass graph) with no conflicting cycle is happy • Happy graph can be uniquely phased • We can observe that • Every spanning tree of a compass graph is a happy graph
Problem formulations • Target: • Remove conflicting cycles with Minimum weighted edge removal (MWER)
Problem formulations • Target: • Remove conflicting cycles with Minimum weighted edge removal (MWER)
Algorithm 1 • Remove all 0-weight edges from GC. • Construct a maximum spanning tree T. • Mark all conflicting cycles. • Repeat 4.1 & 4.2 until Gcis happy: • Randomly select a conflicting cycle, remove the edge e with weight closest to 0 on the cycle. • Re-mark the conflicting cycles • Output the phasing corresponding to any spanning tree of GC m = |Ec|, n = |Vc| Time complexity: O(m(m-n+1)2)+(m-n+1)(m log n))
Improvement • Idea: • Want to remove edges that are in multiple conflicting cycles • Formulate the problem to set cover problem: • Set: edges • Elements: conflicting cycles • Target: Find the set of edges(sets) of minimum weight s.t. they cover all of the conflicting simple cycles (elements) Universe = {1, 2, 3, 4, 5} (5 elements) Set = {{1, 2, 3}, {2, 4}, {3, 4}, {4, 5}} Best = {{1, 2, 3}, {4, 5}}
Results • Real Data: 1000 genome data, chr 22 of NA12878 • FMPR: Number of mismatch of each fragment to haplotypes • BFM: Number of fragments that are not perfectly match the haplotypes • Block size = number of SNPs
Results • Simulated data: • Chr 22, NA12878 • 10M simulated reads, error rate = 0.05, read length = 100bp
Conclusion • Haplotype assembly is becoming increasingly important • Cost of sequencing decreases • More genome-wide and whole-exome studies are conducted • A new haplotype assembly algorithm • New formulation of the graph • Some useful observations to make the algorithm works • Quality of SNP calls and sequence base call scores will be included in the future.