1 / 45

Whole Genome Shotgun Assembly

Whole Genome Shotgun Assembly. Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach (Celera, Gene Myers). Shotgun sequencing was introduced by F. Sanger et al. (1977) and has remained the mainstay of genome sequence assembly for nearly 25 years now.

Download Presentation

Whole Genome Shotgun Assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Whole Genome Shotgun Assembly Two strategies for sequencing: clone-by-clone approach whole-genome shotgun approach (Celera, Gene Myers). Shotgun sequencing was introduced by F. Sanger et al. (1977) and has remained the mainstay of genome sequence assembly for nearly 25 years now. ED Green, Nat Rev Genet 2, 573 (2001) Bioinformatics III

  2. Automatic sequencing Bioinformatics III

  3. Automated Sequencing • nearly all automatic sequencing is done using the enzymatic dideoxy chain- • termination method of Sanger (1977). • Separation of fragments by gel electrophoresis. • Readout of fragments labeled with fluorescent dyes. • Computer analysis of gel images: • lane tracking – identify gel boundaries • lane profiling – sum each of 4 signals across lane width to create a profile • trace processing – deconvolute and smooth signal estimates + reduce noise • base-calling in which the processed trace is translated into a sequence of bases. • Program Phred is quasi-standard for last step (base calling). Bioinformatics III

  4. Base Calling - Phred B. Ewing, L. Hillier, M.C. Wendl, P. Green Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res 8, 175-185 (1998). B. Ewing, P. Green. Base-calling of automated sequencer traces using Phred. II. Errror probabilities. Genome Res 8, 186-194 (1998). The processed traces are displayed as chromatograms of 4 curves of different color, each curve representing the signal of 1 of the 4 bases. Bioinformatics III

  5. Base Calling - Phred Idealized traces would consist of evenly spaced, nonoverlapping peaks. Real traces deviate from this ideal due to imper- fections of the sequencing reactions, of gel electro-phoresis, and of trace processing. The first 50 or so peaks and peaks over 500 or so are particularly noisy. Quality: high – no ambiguities medium – some ambiguities Poor – low confidence Bioinformatics III

  6. Base Calling Algorithm 1 Locate Predicted Peaks find the idealized locations of the base peaks using Fourier methods. 2 Locate Observed Peaks scan 4 trace arrays for concave regions satisfying 2  v(i)  v(i+1) + v(i-1) 3 Match Observed and Predicted Peaks a) find easy matches b) use dynamic programming to align those peaks not matched in a) c) match remaining observed peaks that seem to represent genuine bases 4 Find missed Peaks Bioinformatics III

  7. Phred quality values q = - 10  log10 (p) where q - quality value p - estimated probability error for a base call Examples: q = 20 means p = 10-2 (1 error in 100 bases) q = 40 means p = 10-4 (1 error in 10,000 bases)

  8. Phred Phred performs several tasks: a. Reads trace files – compatible with most file formats: SCF (standard chromatogram format), ABI (373/377/3700), ESD (MegaBACE) and LI-COR. b. Calls bases – attributes a base for each identified peak with a lower error rate than the standard base calling programs. c. Assigns quality values to the bases – a “Phred value” based on an error rate estimation calculated for each individual base. d. Creates output files – base calls and quality values are written to output files.

  9. whole genome assembly: problem description The goal is to reconstruct an unknown source sequence (the genome) on {A, C, G, T} given many random short segments from the sequence, the shotgun reads. A read is a subsequence of nucleotides of length around 500, taken from a random place in the genome. The orientation of the read is either forward or reverse complement. Reads contain two kinds of errors: base substitutions and indels. Base substitutions occur with a frequency of ca. 0.5 – 2%. Indels occur roughly 10 times less frequently. Reads can come from short plasmid inserts (2-12 kb), cosmids (40 kb) or BACs (150 kb). Batzoglou PhD thesis (2002) Bioinformatics III

  10. Whole Genome Assemblers TIGR Assembler G.G. Sutton et al., Genome Sci Technol 1, 9-19 (1995) PHRAP P. Green (1996) Celera Assembler CAP3 X. Huang, A. Madan, Genome Res 9, 868-877 (1999) RePS J. Wang et al. Genome Res 12, 824-831 (2002) Phusion (Sanger) J.C. Mullikin, Z. Ning, Genome Res 13, 81-90 (2003) Arachne (Whitehead/MIT) Euler (UCSD, USC) P.A. Pevzner, H. Tang, M.S. Waterman, RECOMB (2001) most assemblers follow the same approach: overlap – layout - consensus Bioinformatics III

  11. CAP3 Assembler Removal of poor end regions of reads Computation of overlaps between reads Removal of false overlaps Construction of contigs Construction of multiple sequence alignments and generation of consensus sequences Bioinformatics III

  12. CAP3: Clipping of Low-Quality Regions • Use base quality values (from Phred) and sequence similarities to • compute 5‘ and 3‘ clipping positions of reads. • Definition of good regions of a read: • - any sufficiently long region of high-quality values that is similar • to a region of another read OR • any sufficiently long region that is highly similar to a good high-quality • region of another read Computation of the 5‘ and 3‘ clipping positions of read f. Read f has high local similarities to reads g and h. A pair of broken lines shows the start and end positions of a similarity. A thick line indicates the high quality region of a read. Huang, Madan, Genome Res 9, 868 (1999) Bioinformatics III

  13. Celera – compartmentalized shotgun assembler use preliminary data from both human genome assembly projects Huson et al. Bioinformatics 17, S132 (2001) Bioinformatics III

  14. Arachne program • by Serafin Batzoglou (MIT, PhD thesis 2000) • create graph G of overlaps between pairs of reads of shotgun data • process G for the purpose of constructing supercontigs of mapped reads. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  15. Earmuff links An important variation of whole-genome shotgun sequencing obtains reads from both ends of an insert, forward and backward. Since inserts are size-selected, the approximate distance of the pair of reads obtained from the ends of a fragment is known. These will be called earmuff links. Bioinformatics III

  16. Arachne: creation of overlap graph List of reads R = (r1, ..., rN) , N is number of reads. Each read ri has length li < 1000. If both reads are taken from the endpoints of the same clone (earmuff link) ri has link to another read rj at specified distance dij. First: create graph G of overlaps (edges) between pairs of reads (nodes).  Pairs of reads in R need to be aligned. Since R can be very long, N2 alignments are infeasible. Create table of occurences of k-mers (k long strings) in the reads, count the number of k-mer matches for each pair of reads. Then perform pairwise alignments between pairs of reads that contain more than a cutoff number of common k-mers. Batzoglou PhD thesis (2002) Bioinformatics III

  17. Arachne: table of k-mer occurrences Find number of k-mer matches in the forward or reverse complement direction between each pair of reads in R. (1) Obtain all triplets (r,t,v) r = read in R t = index of a k-mer occuring in r v = direction of occurrence (forward or reverse complement) (2) sort the set of pairs according to k-mer indices t (3) use sorted list to create table T of quadrublets (ri, rj, f, v) where ri and ri are reads that contain at least one common k-mer, v is a direction, and f is the number of k-mers in common between ri and rj in direction v. Batzoglou PhD thesis (2002) Bioinformatics III

  18. Arachne: table of k-mer occurrences Here: k = 3 Batzoglou PhD thesis (2002) Bioinformatics III

  19. Arachne: table of k-mer occurrences • If a k-mer occurs „too often“  likely part of a repeat sequence, • we should not use it for detecting overlap. • Implementation • find k-mer occurences (r,t,v) and sort into 64 files according to the • first three nucleotides of each k-mer. • For i=1,64 • load file in memory, sort according to t, store sorted file. • end • load 64 sorted files in memory sequentially, create table T incrementally. • In practice, k = 8 to 24. Batzoglou PhD thesis (2002) Bioinformatics III

  20. Arachne: pairwise read alignments Perform pairwise alignments between reads that contain more than a cutoff number of common k-mers. When excluding those k-mers that are too common (larger than a second) cutoff it is guaranteed that only O(N) number of pairwise alignments will be performed. Only a small number of base substitutions and indels is allowed in an overlapping region of two aligned reads. Use dynamic programming alignment that disallows deviations of more than a few characters. Output of the alignment algorithm: for reads ri, rj quadrublets (b1, b2, e1, e2) of beginning b1, b2 and end e1, e2 positions of the detected overlap region. If a significant overlap region is detected (ri, rj, b1, b2, e1, e2) becomes a link in the overlap graph G. Batzoglou PhD thesis (2002) Bioinformatics III

  21. Correcting errors in reads Shown is a portion of a multiple alignment between 5 reads. A base T of quality 30 is aligned to bases C, some of which are of quality greater than 30. The base T is subsequently changed to a base C of quality 30. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  22. Partial alignments 3 partial alignments of length k=6 between a pair of reads coalesce to yield a single full alignment of length k=19. Vertical bars denote matching bases, whereas x‘s denote mismatches. This illustrates the commonly occurring situation where an extended k-mer hit is a full alignment between two reads. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  23. Ambiguity created by the presence of repeats In the absence of sequencing errors and repreats it would be simple to retrieve all retrievable pairwise distances of reads and to construct G. In the presence of repeats a link between two reads in G does not necessarily imply true overlap. A „repeat link“ is a link in G between two reads that come from different regions in the genome, and overlap in a repeated segment. Batzoglou PhD thesis (2002) Bioinformatics III

  24. Arachne: processing of overlap graph Some of the repetition in the genome is efficiently masked before the creation of G by throwing away k-mers of high frequency when building T. Furthermore some heuristic algorithms are used to detect and delete repetitive links (not discussed here). Batzoglou PhD thesis (2002) Bioinformatics III

  25. Merging contigs Sequence contigs are formed by merging together pairs of reads that can be merged without ambiguity. In practice the situation is much worse than shown here. Repeats are not 100% conserved between copies. Batzoglou PhD thesis (2002) Bioinformatics III

  26. Sequence contigs Batzoglou PhD thesis (2002) Bioinformatics III

  27. Using paired pairs of overlaps to merge reads Arachne searches for instances of two plasmids of similar insert size with sequence overlaps occurring at both ends  paired pairs. (A) A paired pair of overlaps. The top two reads are end sequences from one insert, and the bottom two reads are end sequences from another. The two overlaps must not imply too large a discrepancy between the insert lengths. (B) Initially, the top two pairs of reads are merged. Then the third pair of reads is merged in, based on having an overlap with one of the top two left reads, an overlap with one of the top two right reads, and consistent insert lengths. The bottom pair is similarly merged. Bottom: collection of paired pairs are merged into contigs, and consensus sequences are formed. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  28. Detection of repeat contigs Some of the identified contigs are repeat contigs in which nearly identical sequence from distinct regions are collapsed together. Detection by (a) repeat contigs usually have an unusually high depth of coverage. (b) they will typically have conflicting links to other contigs. Contig R is linked to contigs A and B to the right. The distances estimated between R and A and R and B are such A and B cannot be positioned without substantial overlap between them. If there is no corresponding detected overlap between A and B then R is probably a repeat linking to two unique regions to the right. After marking repeat contigs, the remaining contigs should represent the correctly assembled sequence. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  29. Supercontig creation and gap filling Unmarked contigs = unique contigs. Iteratively merge contigs into supercontigs. • A supercontig is constructed by successively linking pairs of contigs that share at least two forward-reverse links. Here, 3 contigs are joined into one supercontig. • The layout now consists of a number • of supercontigs with interleaved gaps. • Most gaps belong to regions marked • as repeat contigs, some correspond • to regions of insufficient shotgun reads. • (B) Arachne attempts to fill gaps by using paths of contigs. The first gap in the supercontig shown here is filled with one contig, and the second gap is filled by a path consisting of two contigs. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  30. Contig assembly If (a,b) and (a,c) overlap, then (b,c) are expected to overlap. Moreover, one can calculate that shift(b,c)=shift(a,c)-shift(a,b). A repeat boundary is detected toward the right of read a, if there is no overlap (b,c), nor any path of reads x1, ..., xksuch that (b,x1), (x1,x2) ..., (xk,c) are all overlaps, and shift(b,x1) + ... + shift(xk,c)  shift(a,c) – shift(a,b). Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  31. Consistency of forward-reverse links • The distance d(A,B) (length of gap or negated length of overlap) between two linked contigs A and B can be estimated using the forward-reverse linked reads between them. • The distance d(B,C) between two contigs B,C that are linked to the same contig A can be estimated from their respective distances to the linked contig. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  32. Types of misassemblies • 3 types of simple minor misas-semblies are shown: insertions, deletions, and hanging ends. In all cases, a contiguous segment (of a contig ore the genome) of less than 10 kb does not align in the expected location (with the genome or contig). • (B) More misassemblies. • First, two pieces of a contig align to distant parts of the genome. • Second, adjacent contigs in a supercontig are aligned to distant parts of the genome. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  33. Filling gaps in supercontigs • Contigs A and B are connected by a path p of contigs X1,..., Xk. The distance dp(A,B) between A and B (along the path p) is the length of the sequence in the path that does not overlap A and B. • Contigs Y1 and Y2 share forward-reverse links with the supercontig S. These links position them in the vicinity of the gap between A and B. Therefore, Y1 and Y2 will be used as possible stepping points in the path closing the gap from A to B. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  34. Detection of chimeric reads Reads l1, l2, l3, r1, r2, and r3, and the absence of a read n (having long overlaps on both sides of a point x) suggest that read c may be chimeric, consisting of the juxtaposition of two disparate genomic segments: one corresponding to the part of c before x, and one corresponding to the part of c after x. Note that reads l3and r3extend slightly beyond x, as often happens for real chimeric reads. Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  35. Contig Coverage and Read Usage Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  36. Characterization of Contigs Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  37. Characterization of Supercontigs Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  38. Base Pair Accuracy Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  39. Misassemblies Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  40. Computational Performance Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  41. Contig Coverage and Read Usage Batzoglou et al. Genome Res 12, 177 (2002) Bioinformatics III

  42. Comparison of different assemblers • you should look out for: • - smallest number of contigs + misassembled contigs • highest possible coverage by contigs • lowest possible coverage by misassembled contigs Pevzner, Tang, Waterman PNAS 98, 9748 (2001) Bioinformatics III

  43. There is no error-free assembler to date Comparative analysis of EULER, PHRAP, CAP, and TIGR assemblers (NM sequencing project). Every box corresponds to a contig in NM assembly produced by these programs with colored boxes corresponding to assembly errors. Boxes in the IDEAL assembly correspond to islands in the read coverage. Boxes of the same color show misassembled contigs. Repeats with similarity higher than 95% are indicated by numbered boxes at the solid line showing the genome. To check the accuracy of the assembled contigs, we fit each assembled contig into the genomic sequence. Inability to fit a contig into the genomic sequence indicates that the contig is misassembled. For example, PHRAP misassembles 17 contigs in the NM sequencing project, each contig containing from two to four fragments from different parts of the genome. „Biologists "pay" for these errors at the time-consuming finishing step“. Pevzner, Tang, Waterman PNAS 98, 9748 (2001) Bioinformatics III

  44. What comes next? Finishing the genome Usually, the assembly of shotgun data is finished with a number of contigs with some remaining gaps. Also, within each contig there are some regions of high error rate. The goal of the finishing phase is then to get a single continuous contig with low error rate. „Finishers“ apply ad hoc rules to decide where additional data is necessary. This experimental data may then be generated in experiments using different chemistry or higher coverage. Autofinish (phrap group) is a program to help humans with deciding which new reads to get. Bioinformatics III

  45. Human experts are only rarely needed ... D. Gordon, C. Desmarais, P. Green, Genome Res, 11, 614 (2001) Bioinformatics III

More Related