310 likes | 909 Views
Genome Alignment & Assembly. Chandrasekar A. Sequence Assembly. In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original order of the sequence
E N D
Genome Alignment & Assembly Chandrasekar A.
Sequence Assembly • In bioinformatics, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct the original order of the sequence • This is needed as DNA sequencing technology cannot read whole genomes in one stretch, but rather reads small pieces of between 20 and 1000 bases, depending on the technology used.
Genome Assemblers • Variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments
Tools/ Software’s for Assembly • TIGR Assembler • Velvet (Denovo) • Maq (Reference) • Reference assembly & Alignment using BWA tool and Visualization of alignment using SAM
Anatomy of a WGS Assembly STS Chromosome STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads”
Order & Orientation Contig Assembly without pairs results in contigs whose order and orientation are not known. Consensus (15- 30Kbp) Reads ? 2-pair Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. Mean & Std.Dev. is known Scaffold
Overlap between two sequences overlap (19 bases) overhang (6 bases) …AGCCTAGACCTACAGGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGCTTATCCGGT… overhang % identity = 18/19 % = 94.7% • overlap - region of similarity between regions • overhang - un-aligned ends of the sequences • The assembler screens merges based on: • length of overlap • % identity in overlap region • maximum overhang size.
A B implies TRUE A B OR A B REPEAT-INDUCED Assembly Pipeline Trim & Screen Find all overlaps 40bp allowing 6% mismatch. Overlapper Unitiger Scaffolder Repeat Rez I, II
Assembly Pipeline Trim & Screen Compute all overlap consistent sub-assemblies: Unitigs(Uniquely Assembled Contig) Overlapper Unitiger Scaffolder Repeat Rez I, II
A B B A B B A A A B A B OVERLAP GRAPH Edge Types: Regular Dovetail Prefix Dovetail Suffix Dovetail E.G.: Edges are annotated with deltas of overlaps
A C A B C B The Unitig Reduction 1. Remove “Transitively Inferrable” Overlaps:
A 412 352 A B B 45 The Unitig Reduction 2. Collapse “Unique Connector” Overlaps:
Identifying Unique DNA Stretches Repetitive DNA unitig Unique DNA unitig Arrival Intervals Discriminator Statistic is log-odds ratio of probability unitig is unique DNA versus 2-copy DNA. +10 -10 0 Dist. For Unique Dist. For Repetitive Definitely Repetitive Don’t Know Definitely Unique
Mated reads Assembly Pipeline Scaffold U-unitigs with confirmed pairs Trim & Screen Overlapper Unitiger Scaffolder Repeat Rez I, II
Assembly Pipeline Trim & Screen Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II
Assembly gaps Physical gaps Sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap
Assembly paradigms • Overlap-layout-consensus • greedy (TIGR Assembler, phrap, CAP3...) • graph-based (Celera Assembler, Arachne) • Eulerian path (especially useful for short read sequencing)
TIGR Assembler/phrap Greedy • Build a rough map of fragment overlaps • Pick the largest scoring overlap • Merge the two fragments • Repeat until no more merges can be done
Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 2 3 4 5 6 7 8 9 1 ACCTGA ACCTGA AGCTGA ACCAGA 1 2 3 2 3 1 1 2 3 3 1 1 2 3 1 3 2 2
Paths through graphs and assembly • Hamiltonian circuit: visit each node (city) exactly once, returning to the start Genome
All pairs alignment • Needed by the assembler • Try all pairs – must consider ~ n2 pairs • Smarter solution: only n x coverage (e.g. 8) pairs are possible • Build a table of k-mers contained in sequences (single pass through the genome) • Generate the pairs from k-mer table (single pass through k-mer table) k-mer
Assessing Assembly Quality • number and sizes of contigs • Assumption: few large contigs is better than many small contigs. • True because there are fewer gaps in the former, but, does not account for the possibility of misassembles.
Reference assembly – BWA tool • BWA - Burrows-Wheeler Aligner • Aligns relatively short nucleotide sequences against a long reference sequence such as the human genome. • It implements two algorithms, bwa-short and BWA-SW. • The former works for query sequences shorter than 200bp and the latter for longer sequences up to around 100kbp. • Both algorithms do gapped alignment. • They are usually more accurate and faster on queries with low error rates
Reference assembly – BWA tool • Given high-quality reads, it is an order of magnitude faster than MAQ while achieving similar alignment accuracy. • Platform: Illumina; SOLiD; 454; Sanger • Features: PET (paired end tags) mapping (short reads only); gapped alignment; mapping quality; counting suboptimal occurrences (short reads only); SAM output • Advantages: fast • Limitations: short read algorithm is slow for long reads and reads with high error rate • Availability: GPL
Reference assembly – SAMtool • SAM (Sequence Alignment/Map) format is a generic format for storing large nucleotide sequence alignments. • SAM Tools provide various utilities for manipulating alignments in the SAM format, including sorting, merging, indexing and generating alignments in a per-position format.
Reference assembly – SAMtool • Is flexible enough to store all the alignment information generated by various alignment programs; • Is simple enough to be easily generated by alignment programs or converted from existing alignment formats; • Is compact in file size; • Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory; • Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
Denovo assembly- Velvet • de novo genomic assembler specially designed for short read sequencing technologies, such as Solexa or 454, • Developed by Daniel Zerbino and Ewan Birney at the European Bioinformatics Institute (EMBL-EBI), near Cambridge, in the United Kingdom. • Currently takes in short read sequences, removes errors then produces high quality unique contigs. • It then uses paired-end read and long read information, when available, to retrieve the repeated areas between contigs.
Applications of Genome assembly • Generating and interpreting alignment status and reports • Genome variation calling (finding SNP's, indels) • Variation annotation and Viewing