680 likes | 1.15k Views
Haplotype assembly. CSCI2820: Medical Bioinformatics Derek Aguiar. Outline. Haplotype assembly workflow Data generation and preparation Error modeling Problem optimizations Algorithms HapCUT Levy et al. 2007 HapCompass Polyploidy and cancer. Haplotype reconstruction. solution space S
E N D
Haplotype assembly CSCI2820: Medical Bioinformatics Derek Aguiar
Outline • Haplotype assembly workflow • Data generation and preparation • Error modeling • Problem optimizations • Algorithms • HapCUT • Levy et al. 2007 • HapCompass • Polyploidy and cancer
Haplotype reconstruction solution space S k-ploidy n 2’s Sequence reads or genotypes Haplotype assembly Haplotype phasing Haplotypes
The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes
The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes
DNA ...A A C G T C A G T C A G C C T... 5.4 thousand bases 150 billion bases 3.2 billion bases
Variation structural variation Single nucleotide variant (SNV)or polymorphism (SNP) del/no del alleles C/T alleles http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism https://sites.google.com/site/lifesciencesinmaine/5-cell-division-reproduction-and-dna
Genotypes and haplotypes Genotype Haplotype ACGCCA ACGCCA CCTTAA {A,C} {C} {G,T} {C,T} {C,A} {A} TGCGGT CCTTAA GGAATT 000001 101111 2 0 2 2 2 1
High-throughput Sequencing • Also termed next-generation sequencing • Illumina • 454 • SOLiD • DNA is fractured, amplified, fixated onto an array, bases are added • Single molecule or 3rd generation technologies Source of bias Error signature Short reads: ~50-200bp (454 can get up to 1kb) Generally more error prone than Sanger Extremely fast and parallel
Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.
Illumina FASTQ format • Four lines per sequence • identifier line • sequence line • + line • base quality line
Illumina File Snippet • @D4ZHLFP1:10:C03D7ACXX:2:2305:7378:90088 2:N:0:ATCACG • CGTGACAATATTTAGCTTTAGATGGAGTTTACCACCCATTTTGGGCTGCATTCCCAAACAACCCGACTCGTCGAAAGCACATCGTGAACGACCGAGGTCGC • + • BBBDDFFFFHGHH@D@GHIGIFIJJJIEEHHHIHIHGHIJJIJFIHII;BFFFBDGIGIGIJJGIAEEDFDBD>BBBA:<AC@@BDCBCD@BB@>BD5399
ID line in detail • @D4ZHLFP1:10:C03D7ACXX:2:2305:7378:90088 2:N:0:ATCACG
Illumina Base QualityEncoding http://en.wikipedia.org/wiki/FASTQ_format
The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes
sequence read alignment • Report all read alignments between sequence reads ri with lengths li for 1≤i≤n and a genome g of length L
sequence read alignment • The real problem is li<<L for 1≤i≤n • What can we do to overcome this problem?
BLAT • Algorithm • Construct an index of non-overlapping k-mersfor the reference • Use index to quickly find regions of the genome similar to the query • Compute an alignment between read and genomic region
SAM/BAM file format • Used to store read alignments • Tab delimited • samtools program used to manipulate SAM/BAM files • Picard java port
VCF file format • Used to store variant calls • Developed as part of the 1000 Genomes Project
Motivation • Sequence reads are sampled from haploid fragments (fi) gap between sequences covered SNPs f1 C T f2 A C ···········A······C······A······T······G·········· ···········C······G······A······A······T·········· ···········C······G······G······A······T·········· C G T f3
The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes
errors • Two fundamental sources of errors • Sequencing • A base was miscalled • An insertion or deletion was miscalled • Read alignment • A read was aligned to the wrong location or the alignment itself is wrong
errors • We can explicitly model sequencing errors based on the technology used to generate the reads • Also, we can use the phred quality scores
fragments • For now, we assume the organism we are sequencing has ploidy=2, or 2 sets of chromosomes • Thus, each SNP can only have two alleles • Our computational model of an aligned read is termed a fragment • Given n SNPs, the ith fragment fiis an n-dimensional vector with alphabet {0,1,-} where 0 and 1 represent the major and minor alleles and – represents a gap in coverage
formal definition of input • The input is a matrix M with m rows corresponding to m fragments and n columns corresponding to the n SNPs. f1 A C ⋅⋅⋅A⋅⋅⋅⋅C⋅⋅⋅⋅A⋅⋅⋅⋅T⋅⋅⋅ ⋅⋅⋅C⋅⋅⋅⋅G⋅⋅⋅⋅G⋅⋅⋅⋅A⋅⋅⋅ G A f2 SNP-fragment matrix s1 s2 s3 s4 f1 f2
The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes
The haplotype assembly problem (diploid) • Given: a set aligned sequence reads and variants • Compute: two haplotypes for each chromosome/scaffold
imagine a world without errors How would you assemble the fragments?
fragment conflict • Let fik be the allele or ‘-’ at position k of fragment i, then two fragments conflict if • Informally, two fragments fi and fj are in fragment conflict if they cover a common SNP and have different alleles at that site So, how would you assemble the fragments?
fragment conflict graph Lancia et al., 2001 1 − 0 • The fragment conflict graph is a representation of the fragments which share a SNP but differ in alleles. • The fragment conflict graph GF(VF,EF) has a vertex for every fragment and an edge between fragments that conflict (where d(fi,fj) is the number of fragment conflicts between fi and fj): 1 0 − − 0 0 0 1 − 0 − 1 − 1 1
fragment conflict graph • Find two sets such that there are no fragment conflicts internal to each set (bipartition of GF) • The shores of a bipartition of GF give the haplotype Haplotype A: 1 0 0 1 − 0 1 0 − − 0 0 0 1 − 0 − 1 − 1 1 Haplotype B: 0 1 1
minimum fragment removal • Optimization: Compute a subset of vertices VS in VF such that G(VF-VS) induces a bipartite graph. • Sometimes referred to as the bipartization problem and is NP-hard Lancia et al., 2001 But, exist in real data, sequencing errors do. Hmmmmm. disney.wikia.com
minimum edge removal • Optimization: Compute a subset of edges ES in EF such that G(EF-ES) induces a bipartite graph. • Also referred to as bipartization and NP-hard • But what exactly does an edge removal in GF mean? • More on this later… Lancia et al., 2001
SNP conflict graph • GS(VS,ES) has a vertex for every SNP and an edge between SNPs that exhibit 3 or more haplotypes: Schwartz 2010
minimum error correction • The most used optimization in the literature is the Minimum Error Correction (MEC) • FastHare4,HAPCUT1, HASH2, He et al.3 • Compute a minimum set of error corrections in the fragments (rows) of M such that GF is bipartite. Lancia et al., 2001 1 Bansal, V. and Bafna, V. (2008). Hapcut: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics, 24(16), 153–159. 2 Bansal, V., Halpern, A. L., Axelrod, N., and Bafna, V. (2008). An mcmc algorithm for haplotype assembly from whole-genome sequence data. Genome Research, 18(8), 1336–1346. 3 He, D., Choi, A., Pipatsrisawat, K., Darwiche, A., and Eskin, E. (2010). Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics, 26(12), i183–i190. 4 Panconesi, A. and Sozio, M. (2004). Fast hare: A fast heuristic for single individualsnp haplotype reconstruction. In Proceedings of the 4th International Workshop on Algorithms in Bioinformatics WABI04, volume 3240, pages 266 – 277.
minimum error correction • A single A->T change creates a bipartite GF Lancia et al., 2001 Schwartz 2010
Input v0 v1 CCTA read 1 AGTA read 2 AGGA consensus CGGA read 3
Fragment conflicts • Two fragments conflict if they share a variant, differ in allele • Haplotype assembly = two sets with no internal conflicts (bipartite) 10 01 f1 f3 v0 v1 11 read 1 f2 read 2 read 3
Optimizations • Remove minimum number of • Fragments • Variants • Correct minimum number of errors f1=01 f2=10 f3=11 f3=10 f1 f3 v0 v1 0- 01 01 1- 10 10 f2
Prior Work • Minimum Error Correction (MEC) • Levy et al. 2007, FastHare4,HAPCUT1, HASH2, He et al.3 • Minimum SNP Removal • Minimum Fragment Removal • HapCompass5 1Bansal, V. and Bafna, V. (2008) 2 Bansal, V., Halpern, A. L., Axelrod, N., and Bafna, V. (2008) 3 He, D., Choi, A., Pipatsrisawat, K., Darwiche, A., and Eskin, E. (2010) 4 Panconesi, A. and Sozio, M. (2004) 5 Derek Aguiar and SorinIstrail, (2012)
Levy et al. 2007 • Venter diploid genome • First diploid genome assembly • Greedy haplotype assembly algorithm
Levy et al. 2007 • Initialize set S containing all reads • while S has a read in which all variants are not currently assigned • choose a read containing the most uncalled variants in the set of haplotypes • use read to seed haplotype; assign alleles to haplotype and the compliment alleles to the compliment haplotype • iterate: • extend haplotypes using read with strongest signal (|number of variant agreements for haplotype A - number of variant agreements for haplotype B|)
Levy et al. 2007 (continued) • Iterate: • For each variant, assign allele by majority rule of the reads used to determine the alleles • For each read, assign the read to the haplotype with minimum distance
Levy et al. 2007 example v0 v1 v2 v3 v4 read 0 read 1 read 2 read 3 read 4 hap A hap A’