250 likes | 377 Views
ECE697S: Topics in Computational Biology. Lecture 4: Sequence Assembly. Why Genome Sequencing?. Modern Sequencing Methods. Sanger (1982) introduced a sequencing method amenable to automation. Whole-genome sequencing: Clone-By-Clone vs. Shotgun Assembly
E N D
ECE697S: Topics in Computational Biology Lecture 4: Sequence Assembly
Modern Sequencing Methods • Sanger (1982) introduced a sequencing method amenable to automation. • Whole-genome sequencing: Clone-By-Clone vs. Shotgun Assembly • Drosophila melongaster sequenced (Myers et al. 2000) • Homo sapien sequenced (Venter et al. 2001)
Sanger (1982) introduced chain-termination sequencing. Main idea: Obtain fragments of all possible lengths, ending in A, C, T, G. Using gel electrophoresis, we can separate fragments of differing lengths, and then assemble them.
Automated Sequencing Perkin-Elmer 3700: Can sequence ~500bp with 98.5% accuracy
Reads and Contigs Sequencing machines are limited to about ~500-750bp, so we must break up DNA into short and long fragments, with reads on either end. Reads are then assembled into contigs, then scaffolds.
Clone-by-Clone vs. Shotgun • Traditionally, long fragments are mapped, and then assembled by finding a minimum tiling path. Then, shotgun assembly is used to sequence long fragments. • Shotgun assembly is cheaper, but requires more computational resources. • Drosophila was successfully sequenced using shotgun assembly.
Difficulties • Good coverage does not guarantee that we can “see” repeats. • Read coverage is generally not “truly” random, due to complications in fragmentation and cloning. • Any automated approach requires extensive post-processing.
The Fruit Fly • Drosophila melongaster was sequenced in 2000 using whole genome shotgun assembly. • Genome size is ~120Mbp for euchromatic (coding) portion, with roughly 13,600 genes. • The genome is still being refined.
NIH used a Clone-By-Clone strategy; Celera used shotgun assembly. Celera used 300 sequencing machines in parallel to obtain 175,000 reads per day. Efforts were combined, resulting in 8x coverage of the human genome; consensus sequence is 2.91 billion base pairs.
Human Genome Sequence • Taken as the consensus sequence among 5 subjects. • Gene identification is by homology, we have around ~20,000 genes. • Euchromatic DNA is “coding”, rest is “junk”.
Abstraction • The basic question is: given a set of fragments from a long string, can we reconstruct the string? • What is the shortest common superstring of the given fragments?
Overlap-Layout-Consensus • Construct a (directed) overlap graph, where nodes represent reads and edges represent overlap. Paths are contigs in this graph. • Problem: Find the consensus sequence by finding a path that visits all nodes in layout graph. • Note: This is an idealization, since we must handle errors!
Approximation Algorithms • The shortest common superstring problem is NP-complete. • Greedily choosing edges is a 4-approximation, conjectured to be a 2-approximation. • Another idea: TSP has a 2-approximation if the edge weights are metric (Waterman et al. 1976 gives such metrics).
Handling Repeats • We can estimate how much coverage a given set of overlapping reads should yield, based on coverage. • Repeats will “seem” to have unusually good coverage. • Celera’s algorithms are proprietary, but there is no explicit way to handle repeats in the overlap-layout-consensus paradigm.
Hybridization Suppose we had a way to probe fragments of length k that were present in our sequence, from a hybridization assay. Commercial products: Affymetrix GeneChip, Agilent, Amersham, etc.
Sequencing-By-Hybridization • Then instead of reads, we have regularly sized fragments, k-mers. • Construct a multigraph G with (k-1)-mers as nodes, with edges representing k-mers. G is a de Bruijn graph. • Idea: An Eulerian path in G corresponds to the assembled sequence, and we don’t lose repeats (Pevzner 1989).
Bridges of Königsberg Theorem (Euler 1736): A graph has a path visiting every edge exactly once if and only if it is connected and has 2 or fewer vertices of odd degree.
Pros and Cons • An Eulerian path in a graph can be found in linear time, if one exists. • Errors in the hybridization experiments may prevent us from finding a solution. • Can we just use reads as “virtual” hybridization data?
Graph Preprocessing • Read errors mean up to k missing/erroneous edges. But we cannot correct this until we are done assembling! • Greedily mutate reads to minimize size of set of k-mers. • We also need to deal with repeats, which requires contracting certain paths to single edges…
Multiple Sequence Alignment • Construct a de Bruijn graph as before where each sequence is a path in the graph. • Find a heaviest paths in this graph; these are “consensus” sequences. • Align each consensus sequence with library sequences to find common subsequences.