340 likes | 591 Views
CS 6293 Advanced Topics: Current Bioinformatics. Genome Assembly: a brief introduction. Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg. Homework #2. #1: questions will be posted online before Monday class #2: Form groups of 3 Each group reads two papers on a topic:
E N D
CS 6293 Advanced Topics: Current Bioinformatics Genome Assembly: a brief introduction Slides Adapted from Mihai Pop, Art Delcher, and Steven Salzberg
Homework #2 • #1: questions will be posted online before Monday class • #2: Form groups of 3 • Each group reads two papers on a topic: Short reads alignment or assembly • Present the papers and do some comparison • ~8 minutes presentation • You can choose to go to some really cool details • Or give the main idea of the paper • Other teams (and me) will judge you • Send me names in your group and optionally papers you want to present • List of papers: http://www.oxfordjournals.org/our_journals/bioinformatics/nextgenerationsequencing.html
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT ~500 nucleotides Genome sequencing 3x109 nucleotides
AGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCTAGTAGCACAGACTACGACGAGACGATCGTGCGAGCGACGGCGTAGTGTGCTGTACTGTCGTGTGTGTGTACTCTCCT Genome sequencing 3x109 nucleotides A big puzzle ~60 million pieces Computational Fragment Assembly Introduced ~1980 1995: assemble up to 1,000,000 long DNA pieces 2000: assemble whole human genome
Shotgun DNA Sequencing (Technology) DNA target sample SHEAR SIZE SELECT e.g., 10Kbp ± 8% std.dev. End Reads (Mates) 550bp LIGATE & CLONE Primer SEQUENCE Vector
Whole Genome Shotgun Sequencing + single highly automated process + only three library constructions – assembly is much more difficult • Collect 10x sequence in a 1-to-1 ratio of two types of read pairs: ~ 35million reads for Human. Short Long 10Kbp 2Kbp • Collect another 20X in clone coverage of 50Kbp end sequence pairs: ~ 1.2million pairs for Human. • Early simulations showed that if repeats were considered black boxes, one could still cover 99.7% of the genome unambiguously. BAC 3’ BAC 5’
Celera’s Sequencing Factory(circa 2001) • 300 ABI 3700 DNA Sequencers • 50 Production Staff • 20,000 sq. ft. of wet lab • 20,000 sq. ft. of sequencing space • 800 tons of A/C (160,000 cfm) • $1 million / year for electrical service • $10 million / month for reagents
Collected 27.27 Million reads = 5.11X coverage 21.04 Million are paired (77%) = 10.52 Million pairs 2Kbp 5.045M 98.6% true * <6% std.dev. 10Kbp 4.401M 98.6% true * <8% std.dev. 50Kbp 1.071M 90.0% true * <15% std.dev. * validated against finished Chrom. 21 sequence The clones cover the genome 38.7X times Data is from 5 individuals (roughly 3X, 4 others at .5X) Human Data (April 2000)
Pairs Give Order & Orientation Contig Assembly without pairs results in contigs whose order and orientation are not known. Consensus (15- 30Kbp) Reads ? 2-pair Pairs, especially groups of corroborating ones, link the contigs into scaffolds where the size of gaps is well characterized. Mean & Std.Dev. is known Scaffold
Anatomy of a WGS Assembly STS Chromosome STS-mapped Scaffolds Contig Gap (mean & std. dev. Known) Read pair (mates) Consensus Reads (of several haplotypes) SNPs External “Reads”
Assembly gaps Physical gaps Sequencing gaps sequencing gap - we know the order and orientation of the contigs and have at least one clone spanning the gap physical gap - no information known about the adjacent contigs, nor about the DNA spanning the gap 12
Assembly paradigms Overlap-layout-consensus greedy (TIGR Assembler, phrap, CAP3...) graph-based (Celera Assembler, Arachne) Eulerian path (especially useful for short read sequencing) 13
TIGR Assembler/phrap Greedy Build a rough map of fragment overlaps Pick the largest scoring overlap Merge the two fragments Repeat until no more merges can be done 14
(A) Overlap between two reads—note that agreement within overlapping region need not be perfect; (B) Correct assembly of a genome with two repeats (boxes) using four reads A–D; (C) Assembly produced by the greedy approach. Pop M Brief Bioinform 2009;10:354-366 © The Author 2009. Published by Oxford University Press. For Permissions, please email: journals.permissions@oxfordjournals.org
Overlap-layout-consensus Main entity: read Relationship between reads: overlap 1 4 7 2 5 8 3 6 9 2 3 4 5 6 7 8 9 1 ACCTGA ACCTGA AGCTGA ACCAGA 1 2 3 2 3 1 1 2 3 16
Paths through graphs and assembly Hamiltonian circuit: visit each node (city) exactly once, returning to the start Hamiltonian path: visit each node (city) exactly once Genome
Overlap between two sequences overlap (19 bases) overhang (6 bases) GGATGCGCGGACACGTAGCCAGGAC CAGTACTTGGATGCGCTGACACGTAGC overhang % identity = 18/19 % = 94.7% • overlap - region of similarity between regions • overhang - un-aligned ends of the sequences • The assembler screens merges based on: • length of overlap • % identity in overlap region • maximum overhang size. 18
All pairs alignment Needed by the assembler Try all pairs – must consider ~ n2 pairs Smarter solution: only n x coverage (e.g. 8) pairs are possible Build a table of k-mers contained in sequences (single pass through the genome) Generate the pairs from k-mer table (single pass through k-mer table) k-mer 19
BWT-based overlap detection • Efficient construction of an assembly string graph using the FM-index, Jared T. Simpson and Richard Durbin, Bioinformatics, 26 (12): i367-i373 (2010) • Read it yourself for more details ACT ACT$...... ACT….. ACT….. $ ACT…. ACT BWT for multiple sequences
OVERLAP GRAPH A A B B B A B A A B A B Edge Types: Regular Dovetail Prefix Dovetail Suffix Dovetail E.G.: Edges are annotated with deltas of overlaps
The Unitig Reduction A C A B C B 1. Remove “Transitively Inferrable” Overlaps:
The Unitig Reduction A 412 352 A B B 45 2. Collapse “Unique Connector” Overlaps:
Celera Assembly Pipeline A B implies TRUE A B OR A B REPEAT-INDUCED Trim & Screen Find all overlaps 40bp allowing 6% mismatch. Overlapper Unitiger Scaffolder Repeat Rez I, II
Celera Assembly Pipeline Trim & Screen Compute all overlap consistent sub-assemblies: Unitigs(Uniquely Assembled Contig) Overlapper Unitiger Scaffolder Repeat Rez I, II
Celera Assembly Pipeline Mated reads Scaffold U-unitigs with confirmed pairs Trim & Screen Overlapper Unitiger Scaffolder Repeat Rez I, II
Celera Assembly Pipeline Trim & Screen Fill repeat gaps with doubly anchored positive unitigs Overlapper Unitig>0 Unitiger Scaffolder Repeat Rez I, II
Handling repeats Repeat detection pre-assembly: find fragments that belong to repeats statistically (most existing assemblers) repeat database (RepeatMasker) during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001) post-assembly: find repetitive regions and potential mis-assemblies. Reputer, RepeatMasker "unhappy" mate-pairs (too close, too far, mis-oriented) Repeat resolution find DNA fragments belonging to the repeat determine correct tiling across the repeat 28
Statistical repeat detection Significant deviations from average coverage flagged as repeats. - frequent k-mers are ignored - “arrival” rate of reads in contigs compared with theoretical value Problem 1: assumption of uniform distribution of fragments - leads to false positives non-random libraries poor clonability regions Problem 2: repeats with low copy number are missed - leads to false negatives 29
Mis-assembled repeats excision collapsed tandem rearrangement 30
Eulerian path-based assembly • Break each read into k-mers (typically k >= 19) • Construct a de Bruijn graph using the k-mers from all reads • Each k-mer is a node • v1 has a directed edge to v2 if v1 can be expressed by removing the last char from v2 and adding a new char at the beginning of v2, E.g. v1 = acgtctgact v2 = cgtctgactg • Find a Eulerian path in the graph • visits each edge exactly once
4. Error removal 3. Simplification 1. Sequencing 2. Constructing a de Bruijn graph
Eulerian path-based assembly • No need to compute pairwise overlaps – important for NGS data • Eulerian paths are much easier to find than Hamiltonian path • Catch: multiple Eulerian paths may exist • Loss of information • Repeats appear as cycles in the graph • Less likely to cause mis-assembly • More suitable for short-reads assembly • Newbler • VELVET • EDENA • ABySS • See Flicek & Birney, Nat Methods, 2009
References • Sense from sequence reads: methods for alignment and assembly, Paul Flicek & Ewan Birney, Nature Methods 6, S6 - S12 (2009) • Genome assembly reborn: recent computational challenges, Mihai Pop, Briefings in Bioinformatics, 10(4): 354-366 (2009)