1 / 25

A n MCMC algorithm for haplotype assembly from whole-genome sequence data

A n MCMC algorithm for haplotype assembly from whole-genome sequence data. Vikas Bansal , Aaron L. Halpern, Nelson Axelrod, et al. Genome Research. 2008 Aug Presented by KWOK Tsz Piu (Bill) 5 /9/2013. Introduction.

mort
Download Presentation

A n MCMC algorithm for haplotype assembly from whole-genome sequence data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An MCMC algorithm for haplotype assembly from whole-genome sequence data VikasBansal, Aaron L. Halpern, Nelson Axelrod, et al. Genome Research. 2008 Aug Presented by KWOK TszPiu (Bill) 5/9/2013

  2. Introduction • Genetic variation is present in the form of single nucleotide polymorphisms(SNPs), insertions/deletions, inversions, translocations, copy number variations, etc. • The abundance of SNPs in human genome and the development of high-throughput genotyping technologies have made SNPs the marker of choice for understanding human genetic variation.

  3. Introduction • Human genome contains a pair of DNA sequences : one from each parent called haploid sequences or haplotypes • Haplotypes differ at SNP/insertion/deletion… positions • SNPs (single nucleotide polymorphism) are single bpmutations (~0.1%; non-uniform) • SNP positions contain one of two possible alleles … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcCgtatacacgggTctata … … ataggtccCtatttcgcgcCgtatacacgggTctata …

  4. Haplotypes and Genotypes • Haplotype: description of SNP alleles on a chromosome • 0/1 vector: 0 for major allele, 1 for minor • Diploids: two homologous copies of each autosomal chromosome • One inherited from mother and one from father • Genotype: description of alleles on both chromosomes • 0/1/2 vector: 0 (1) - both chromosomes contain the major (minor) allele; 2 - the chromosomes contain different alleles 021200210 011000110 001100010 genotype + two haplotypes per individual

  5. Gene‐Disease Association Studies • Haplotypes increase power of association

  6. Goal of Haplotype assembly problem • Goal of haplotype assembly(phasing)is to use aligned sequence fragments to reconstruct the two haplotypes

  7. Haplotype assembly problem • For haplotype assembly to be feasible • High sequence coverage • Reads to be long enough to span multiple variant sites • However, even with paired ends, it is not possible to link all variants • A haplotype assembly for a diploid genome is a collection of haplotype segments.

  8. Haplotype assembly problem • In the absence of error in sequenced read, the correct haplotype assembly is unique. • In the real case, the problem become finding the haplotype assembly that optimizes a certain objective function • E.g., minimize the number of conflicts with the sequenced reads. (MEC)

  9. Input • Each sequenced read is mapped to the reference genomic sequence • For a variant, reads with sequence matching the consensus are assigned as 0, otherwise 1 • Each fragment i is represented by a ternary string Xi = {0, 1, -}n • ‘-’ corresponds to the heterozygous loci not covered by the fragment

  10. Error • For the jthvariant site of fragment i, • Quality scores si[j] • Which can be converted to error probability qi[j],

  11. Haplotype • Let H = (h, ) represent the unordered pair of haplotypes • where h is a binary string of length n • is the bitwise complement of h • i.e., h[j] = 1 - h[j]. • h = 100100... • = 011011...

  12. Probability X = input matrix H = H(h,) = pair of haplotype q = error probability matrix • Target: • By Bayes’ rule: • Assume variant calls are independent • δ(Xi[j],h[j]) = 1 if Xi[j] = h[j] and 0 otherwise

  13. Probability X = input matrix H = H(h,) = pair of haplotype q = error probability matrix • Assume each fragment is randomly generated from one of the two haplotypes • Assume fragments are independently generated

  14. MCMC algorithm • States of the Markov chain are the set of possible haplotypes • Transitions of the Markov chain are governed by subsets S of columns of the fragment matrix X.

  15. MCMC algorithm • Γ = {S1, S2, . . . , Sk} is a collection of subsets of columns of X • for each state H, there are k + 1 possible moves to choose from, including the self-loop. • E.g., Γ = {{3, 4, 5, 11}, {1, 2, 3, 4}, … } • Algorithm 1:

  16. MCMC algorithm • What should be Γ? • A natural choice for Γis Γ1= {{1},{2}, . . . , {n}}. • However, the mixing time of it grows exponentially with d, the depth of coverage • Choose by a graph-partitioning approach

  17. A graph-partitioning approach • Node: Variant site • Weight of Edge: +1 between two nodes if the phase suggested by the fragment is consistent with the current haplotype assembly, -1 if not.

  18. A graph-partitioning approach

  19. A graph-partitioning approach • Hence, a cut with low weight corresponds to a subset of columns whose current phase with respect to the rest of the columns is inconsistent with the fragment matrix • Partition the graph G(X) into 2 set of node using min-cut algorithm and add the two subsets to Γ. • Apply the same procedure recursively.

  20. MCMC algorithm • The collection of subsets is dependent upon the haplotype pair H. • Final Algorithm HASH (haplotype assembly for single human) Γ1 = {{1},{2}, . . . , {n}}.

  21. Result • Source: Human genome (HufRef genome assembly) • ~32 million reads, 7.5 coverage • ~1.8 million heterozygous variants • E.g., Chromosome 22 • 24,967 heterozygous variants • 103,356 rows (53,279 links two or more variants) • Partitioned into 607 disjoint haplotypes

  22. Result

  23. Result

  24. Summary • Describe a MCMC algorithm for haplotype assembly that samples haplotypes • Given a list of all heterozygous variants and sequenced reads mapped to a genome assembly • Graph-partitioning approach to make the Markov chain more efficient

  25. Thank you!

More Related