Haplotype assembly

Haplotype assembly CSCI2820: Medical Bioinformatics Derek Aguiar

Outline • Haplotype assembly workflow • Data generation and preparation • Error modeling • Problem optimizations • Algorithms • HapCUT • Levy et al. 2007 • HapCompass • Polyploidy and cancer

Haplotype reconstruction solution space S k-ploidy n 2’s Sequence reads or genotypes Haplotype assembly Haplotype phasing Haplotypes

The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

DNA ...A A C G T C A G T C A G C C T... 5.4 thousand bases 150 billion bases 3.2 billion bases

Variation structural variation Single nucleotide variant (SNV)or polymorphism (SNP) del/no del alleles C/T alleles http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism https://sites.google.com/site/lifesciencesinmaine/5-cell-division-reproduction-and-dna

Genotypes and haplotypes Genotype Haplotype ACGCCA ACGCCA CCTTAA {A,C} {C} {G,T} {C,T} {C,A} {A} TGCGGT CCTTAA GGAATT 000001 101111 2 0 2 2 2 1

High-throughput Sequencing • Also termed next-generation sequencing • Illumina • 454 • SOLiD • DNA is fractured, amplified, fixated onto an array, bases are added • Single molecule or 3rd generation technologies Source of bias Error signature Short reads: ~50-200bp (454 can get up to 1kb) Generally more error prone than Sanger Extremely fast and parallel

Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.

http://www.ncbi.nlm.nih.gov/Traces/sra/

Illumina FASTQ format • Four lines per sequence • identifier line • sequence line • + line • base quality line

Illumina File Snippet • @D4ZHLFP1:10:C03D7ACXX:2:2305:7378:90088 2:N:0:ATCACG • CGTGACAATATTTAGCTTTAGATGGAGTTTACCACCCATTTTGGGCTGCATTCCCAAACAACCCGACTCGTCGAAAGCACATCGTGAACGACCGAGGTCGC • + • BBBDDFFFFHGHH@D@GHIGIFIJJJIEEHHHIHIHGHIJJIJFIHII;BFFFBDGIGIGIJJGIAEEDFDBD>BBBA:<AC@@BDCBCD@BB@>BD5399

ID line in detail • @D4ZHLFP1:10:C03D7ACXX:2:2305:7378:90088 2:N:0:ATCACG

Illumina Base QualityEncoding http://en.wikipedia.org/wiki/FASTQ_format

sequence read alignment • Report all read alignments between sequence reads ri with lengths li for 1≤i≤n and a genome g of length L

sequence read alignment • The real problem is li<<L for 1≤i≤n • What can we do to overcome this problem?

BLAT • Algorithm • Construct an index of non-overlapping k-mersfor the reference • Use index to quickly find regions of the genome similar to the query • Compute an alignment between read and genomic region

SAM/BAM file format • Used to store read alignments • Tab delimited • samtools program used to manipulate SAM/BAM files • Picard java port

VCF file format • Used to store variant calls • Developed as part of the 1000 Genomes Project

Motivation • Sequence reads are sampled from haploid fragments (fi) gap between sequences covered SNPs f1 C T f2 A C ···········A······C······A······T······G·········· ···········C······G······A······A······T·········· ···········C······G······G······A······T·········· C G T f3

errors • Two fundamental sources of errors • Sequencing • A base was miscalled • An insertion or deletion was miscalled • Read alignment • A read was aligned to the wrong location or the alignment itself is wrong

errors • We can explicitly model sequencing errors based on the technology used to generate the reads • Also, we can use the phred quality scores

relationship between quality score and base call

fragments • For now, we assume the organism we are sequencing has ploidy=2, or 2 sets of chromosomes • Thus, each SNP can only have two alleles • Our computational model of an aligned read is termed a fragment • Given n SNPs, the ith fragment fiis an n-dimensional vector with alphabet {0,1,-} where 0 and 1 represent the major and minor alleles and – represents a gap in coverage

formal definition of input • The input is a matrix M with m rows corresponding to m fragments and n columns corresponding to the n SNPs. f1 A C ⋅⋅⋅A⋅⋅⋅⋅C⋅⋅⋅⋅A⋅⋅⋅⋅T⋅⋅⋅ ⋅⋅⋅C⋅⋅⋅⋅G⋅⋅⋅⋅G⋅⋅⋅⋅A⋅⋅⋅ G A f2 SNP-fragment matrix s1 s2 s3 s4 f1 f2

The haplotype assembly problem (diploid) • Given: a set aligned sequence reads and variants • Compute: two haplotypes for each chromosome/scaffold

imagine a world without errors How would you assemble the fragments?

fragment conflict • Let fik be the allele or ‘-’ at position k of fragment i, then two fragments conflict if • Informally, two fragments fi and fj are in fragment conflict if they cover a common SNP and have different alleles at that site So, how would you assemble the fragments?

fragment conflict graph Lancia et al., 2001 1 − 0 • The fragment conflict graph is a representation of the fragments which share a SNP but differ in alleles. • The fragment conflict graph GF(VF,EF) has a vertex for every fragment and an edge between fragments that conflict (where d(fi,fj) is the number of fragment conflicts between fi and fj): 1 0 − − 0 0 0 1 − 0 − 1 − 1 1

fragment conflict graph • Find two sets such that there are no fragment conflicts internal to each set (bipartition of GF) • The shores of a bipartition of GF give the haplotype Haplotype A: 1 0 0 1 − 0 1 0 − − 0 0 0 1 − 0 − 1 − 1 1 Haplotype B: 0 1 1

minimum fragment removal • Optimization: Compute a subset of vertices VS in VF such that G(VF-VS) induces a bipartite graph. • Sometimes referred to as the bipartization problem and is NP-hard Lancia et al., 2001 But, exist in real data, sequencing errors do. Hmmmmm. disney.wikia.com

minimum edge removal • Optimization: Compute a subset of edges ES in EF such that G(EF-ES) induces a bipartite graph. • Also referred to as bipartization and NP-hard • But what exactly does an edge removal in GF mean? • More on this later… Lancia et al., 2001

SNP conflict graph • GS(VS,ES) has a vertex for every SNP and an edge between SNPs that exhibit 3 or more haplotypes: Schwartz 2010

minimum error correction • The most used optimization in the literature is the Minimum Error Correction (MEC) • FastHare4,HAPCUT1, HASH2, He et al.3 • Compute a minimum set of error corrections in the fragments (rows) of M such that GF is bipartite. Lancia et al., 2001 1 Bansal, V. and Bafna, V. (2008). Hapcut: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics, 24(16), 153–159. 2 Bansal, V., Halpern, A. L., Axelrod, N., and Bafna, V. (2008). An mcmc algorithm for haplotype assembly from whole-genome sequence data. Genome Research, 18(8), 1336–1346. 3 He, D., Choi, A., Pipatsrisawat, K., Darwiche, A., and Eskin, E. (2010). Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics, 26(12), i183–i190. 4 Panconesi, A. and Sozio, M. (2004). Fast hare: A fast heuristic for single individualsnp haplotype reconstruction. In Proceedings of the 4th International Workshop on Algorithms in Bioinformatics WABI04, volume 3240, pages 266 – 277.

minimum error correction • A single A->T change creates a bipartite GF Lancia et al., 2001 Schwartz 2010

Input v0 v1 CCTA read 1 AGTA read 2 AGGA consensus CGGA read 3

Fragment conflicts • Two fragments conflict if they share a variant, differ in allele • Haplotype assembly = two sets with no internal conflicts (bipartite) 10 01 f1 f3 v0 v1 11 read 1 f2 read 2 read 3

Optimizations • Remove minimum number of • Fragments • Variants • Correct minimum number of errors f1=01 f2=10 f3=11 f3=10 f1 f3 v0 v1 0- 01 01 1- 10 10 f2

Prior Work • Minimum Error Correction (MEC) • Levy et al. 2007, FastHare4,HAPCUT1, HASH2, He et al.3 • Minimum SNP Removal • Minimum Fragment Removal • HapCompass5 1Bansal, V. and Bafna, V. (2008) 2 Bansal, V., Halpern, A. L., Axelrod, N., and Bafna, V. (2008) 3 He, D., Choi, A., Pipatsrisawat, K., Darwiche, A., and Eskin, E. (2010) 4 Panconesi, A. and Sozio, M. (2004) 5 Derek Aguiar and SorinIstrail, (2012)

Levy et al. 2007 • Venter diploid genome • First diploid genome assembly • Greedy haplotype assembly algorithm

Levy et al. 2007 • Initialize set S containing all reads • while S has a read in which all variants are not currently assigned • choose a read containing the most uncalled variants in the set of haplotypes • use read to seed haplotype; assign alleles to haplotype and the compliment alleles to the compliment haplotype • iterate: • extend haplotypes using read with strongest signal (|number of variant agreements for haplotype A - number of variant agreements for haplotype B|)

Levy et al. 2007 (continued) • Iterate: • For each variant, assign allele by majority rule of the reads used to determine the alleles • For each read, assign the read to the haplotype with minimum distance

Levy et al. 2007 example v0 v1 v2 v3 v4 read 0 read 1 read 2 read 3 read 4 hap A hap A’

Haplotype assembly

Haplotype assembly

Presentation Transcript

SNP Haplotype

Haplotype Blocks

Combinatorial Algorithms for Haplotype Inference

Polyploid haplotype assembly

A n MCMC algorithm for haplotype assembly from whole-genome sequence data

1000 Genomes Project Haplotype Integration

Spooky halloween haplotype assembly

Haplotype Discovery and Modeling

Haplotype Trees

Haplotype analysis

Computational Approaches to Haplotype Inference

Linear Reduction for Haplotype Inference

HAPLOTYPE ANALYSIS

Introduction to Haplotype Estimation

Haplotype inference and haplotype-based transmission disequilibrium test (Hap-TDT)

L6: Haplotype phasing

Haplotype analysis

METHODS FOR HAPLOTYPE RECONSTRUCTION

HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data

Haplotype Reconstruction

Haplotype analysis

L6: Haplotype phasing