440 likes | 639 Views
Mohamed Tikah Marrakchi BIN6002 Summer, 2005. An Eulerian path approach to DNA fragment assembly Pavel A.Pevzner, Haixu Tang, and Michael S. Waterman PNAS 2001. Genome sequencing. Analysing DNA using improved technologies opens a new era in (but not only in) biomedical research.
E N D
Mohamed Tikah Marrakchi BIN6002 Summer, 2005 An Eulerian path approach to DNA fragment assemblyPavel A.Pevzner, Haixu Tang, and Michael S. WatermanPNAS 2001
Genome sequencing • Analysing DNA using improved technologies opens a new era in (but not only in) biomedical research. • In the past few years genome sequences of many organisms were generated. • A yeast (Saccharomyces cerevisiae) • A nematode (C. elegans) • A fly (Drosophila melanogaster) • A plant (Arabidopsis thaliana) • Human (Homo sapiens)
Genome sequencing • Depending on the intended use of the genome sequence data, choose a specific sequencing strategy. • a detailed 'blueprint' (to establish a gene catalogue...) • not so detailed: to acquire information about repetitive sequences, carry simple comparisons with other organisms...)
Genome sequencing • Central to almost all the past and current genome sequencing projects is the 'dideoxy chain termination' • sequencing method (developed by Fred Sanger and colleagues in the 70s) • electrophoretic separation and detection of invitro synthesized, single-stranded DNA molecules terminated with dideoxynucleotides.
Genome sequencing • Many improvements were brought to the Sanger sequencing method. • laser based instrumentation that allows the detection of fluorescently labeled DNA-molecules • development of thermostable polymerases • development of more robust fluorescent dye systems • robotic systems have been designed to automate specific steps in the sequencing process (prepare sequencing reactions, load samples on gels ...)
Genome sequencing • Improvement of the quality and overall throughput of DNA sequencing and decrease in the cost (100-fold less in the past decade...)
Genome sequencing • Software tools were developed for analysing sequence data and for carrying out sequence assembly • calling the nucleotide base at each position and assigning a corresponding quality score • assembling sequences (using the quality score to calculate accuracy rates which are helpful for analysis and finishing steps) • user friendly viewers (ex. phred, phrap and consed)
Genome sequencing • Many strategies were tested for large genomes sequencing • 'Shotgun' sequencing (described in the 80s) was found to be very efficient. • large piece of DNA can be sequenced by first fragmenting it into smaller pieces generating redundant amounts of sequence • Obtaining sequence data from random fragments data then piecing the sequence reads together
Genome sequencing • Strategies in shotgun sequencing: • Strategy using large insert clones and associated physical maps • Strategy taking a whole genome approach (without using clone-based physical maps) • Hybrid strategies involving both approaches
Clone by clone shotgun sequencing • Also referred to as hierarchical shotgun sequencing or map-based shotgun sequencing • Map construction: pieces of genomic DNA are cloned using a host vector system (bacteria or yeast) • Individual clones are analysed for the presence of unique DNA landmarks (STS, restriction sites ...) used to assemble overlapping clone maps. • YACs (yeast artificial chromosomes) up to a megabase pair in size (used for the first physical maps of the human and mouse). • BACs (100 - 200 kb)
Clone by clone shotgun sequencing [2] Eric D. Green. Nature Review Genetics 2, 573-583 (2001)
Clone by clone shotgun sequencing • Clone selection: With the assembled BAC contig map, minimally overlapping clones are selected for shotgun sequences • One BAC is usually selected for sequencing if each of the restriction fragments in its fingerprint is also present in one overlapping clone.
Clone by clone shotgun sequencing [2] Eric D. Green. Nature Review Genetics 2, 573-583 (2001)
Clone by clone shotgun sequencing • Subclone library selection: Random fragmentation of the cloned DNA in each selected BAC. subclone it into a plasmid • even assemblies generated with low coverage (3 - 5x) can be used for important analysis to provide a ‘working draft sequence’ • highly accurate sequences (>99.99% accurate) are obtained with 8 - 10x coverage.
Clone by clone shotgun sequencing • Directed Finishing phase: • sequence finishing is a process in which remaining problems with the assembly are resolved: • discontinuities between sequence contigs (gaps), areas of low quality, ambiguous bases in the consensus sequence... • software facilitates the process of sequence finishing. • phred statistical foundation for sequence assembly programs (phrap). • Autofinish automates the finishing process (recommends specific additional sequencing reactions)
Clone by clone shotgun sequencing [2] Eric D. Green. Nature Review Genetics 2, 573-583 (2001)
Whole genome shotgun sequencing • Assembly of sequence reads generated in a random, genome-wide fashion. • Bypasses the need for a clone-based physical map. • Pieces of the entire genome are subcloned in suitable plasmid vectors. • Sequence reads are generated from both ends of subclones to produce highly redundant sequence • Coverage to deal with the problem of repetitive sequences.
Whole genome shotgun sequencing • (drosophilia, Haemophilus influenzae) • using several size classes of subclone is important • availability of long range mapping data is also crutially important • software tools
Whole genome shotgun sequencing [2] Eric D. Green. Nature Review Genetics 2, 573-583 (2001)
Hybrid strategies for shotgun sequencing The two strategies clone-by-clone and whole genome shotgun sequencing are not mutually exclusive. • Mixed approach that capture the advantages of of both the approches. • provide a rapid insight about the sequence of the entire genome • minimizing the likelihood of serious misassemblies • Finding optimal balance between generating sequence reads in a clone-by-clone versus whole genome fashion
Hybrid strategies for shotgun sequencing [2] Eric D. Green. Nature Review Genetics 2, 573-583 (2001)
Sequencing of genomes from multicellular organisms [2] Eric D. Green. Nature Review Genetics 2, 573-583 (2001)
DNA Fragment Assembly • Fragment assembly is trying to assemble a big puzzle. • Follows the "overlap - layout - concensus” paradigm which is used in almost all available assembly tools (phrap, cap, tigr, celera) • There is no polynomial algorithm for the resolution of the layout step • Finishing step is time consuming
DNA Fragment Assembly • fragment assembly problem : finding a path in the overlap graph. • Hamiltonian path problemNP-complete : difficult problem. [1] Pavel A. Pevzner, Haixu Tang, Michael S. Waterman. PNAS, August 2001.
DNA Fragment Assembly • Euler is a new algorithm and software tool that solves the repeat problem. • Uses a counter-intuitive approach which consists in breaking the puzzle in more pieces! • loss of information is minimal (if we still use 'big' pieces). • information can be restored in later stages. • Doesn't have the overlap step • Reduces the NP-complete Hamiltonian path problem to an easy to solve Eulerian path problem.
DNA Fragment Assembly [1] Pavel A. Pevzner, Haixu Tang, Michael S. Waterman. PNAS, August 2001.
DNA Fragment Assembly - EULER • A repeat corresponds to an edge rather than a collection of vertices. • The problem is transformed into finding a path visiting every edge of the graph exactly once. • Eulerian path problem.
DNA Fragment Assembly - EULER How to construct the de Bruijn graph from sequencing reads? • finished DNA sequence is not available...its actually what we are looking for! • Consensus (error correction in reads) is the first step in the proposed approach. • But again, how could we correct the errors without having the final sequence?
DNA Fragment Assembly - EULER • Unlike the existing tools. Euler starts with the ‘consensus’ step: • Spectral Alignment Problem • Error Correction Problem
DNA Fragment Assembly - EULER • Spectral Alignment Problem: Two types of l-tuples (which are subsequences of length l) are defined: • solid l-tuples belonging to more than M reads (M is a threshold) • weak l-tuples otherwise • Lets T be a set of l-tuples (called a spectrum).
DNA Fragment Assembly - EULER • given a string s and a spectrum T find the minimum number of mutations in s that transform s into a T-string (e.g all l-tuples of s belong to T) • solve the problem using dynamic programming • Use spectral alignment of a read against all solid l-tuples • Use iterative spectral alignments with the set of reads reduces the number of weak l-tuples (and increases the number of solid l-tuples)
DNA Fragment Assembly - EULER • Error Correction Problem • Given a collection of reads S and a maximum of d errors per read, introduce d corrections in each read in such a way that |Sl| is minimised where Sl is the spectrum of S (all l-tuples of the reads and their reverse complement). • An error in a read s affects at most 2l l-tuples that point to the same error • or 2x for position within a distance x<l from endpoints of read.
DNA Fragment Assembly - EULER • Error Correction Problem: • Look for error corrections that reduce the size of Sl by 2l (or 2x) • Euler uses a more evolved approach. It eliminates 97.7% of sequencing errors (in some case going from 4.8 errors per read to 0.11 errors per read) • Error correction is not perfect. It can introduce errors … but as long as the errors are consistent. • Errors introduced are corrected in a later stage. Eliminating the false edges in the de Bruijn graph being built is more important.
DNA Fragment Assembly - EULER • Eulerian Superpath: • S is a set of reads. The de Bruijn graph is defined as follows: • Sl is the set l-tuples of S • vertices in the graph are the set S(l-1). • if Sl contains a l-tuple whose first (l-1)-elements are the vertex v and last (l-1)-elements are the vertex w then join v and w in the graph. • If S is only one read then the ‘assembly problem’ is finding the Chinese Postman path which can be easily transformed to the Eulerian path problem.
DNA Fragment Assembly - EULER • Definitions: • source vertex • sink vertex • branching vertex • repeat • entrance • exit [1] Pavel A. Pevzner, Haixu Tang, Michael S. Waterman. PNAS, August 2001.
DNA Fragment Assembly - EULER • de Bruijn graph is very complicated even in error free cases. • use the information about which l-tuples belong to the same reads • covering reads (reads containing an entrance and an exit) reveal information about the pairing between entrances and exits • tangles are repeats that do not have a covering read.
DNA Fragment Assembly - EULER • Eulerian Superpath Problem: find in a given graph G an Eulerian path which contains a given set of paths P as subpaths. • the graph G is the de Bruijn graph • the subpaths in P are the reads.
DNA Fragment Assembly - EULER • Solving the subpath problem: • carry k consecutive transformations of the graph G and the system of paths P in order to obtain new 'equivalent' graph Gk and system of paths Pk. • where every path is a single edge. • Every solution of the Eulerian path problem in (Gk, Pk) provides a solution of the Eulerian superpath problem in (G, P)
DNA Fragment Assembly - EULER • Definition Px,y, P->x, Py-> • x,y-detachment reduces the number of edges in G to eventually 1 path per edge. • However, in the case of multiple edges some extra work has to be done.
DNA Fragment Assembly - EULER [1] Pavel A. Pevzner, Haixu Tang, Michael S. Waterman. PNAS, August 2001.
DNA Fragment Assembly - EULER • resolvable path, resolvable edge • some edges cannot be resolved even after detachment of all resolvable edges • usually this situation corresponds to tangles (x-cuts) [1] Pavel A. Pevzner, Haixu Tang, Michael S. Waterman. PNAS, August 2001.
DNA Fragment Assembly - EULER • Neseira meningitidis (NM) project: • Better results with real sequencing data than those obtained with other tools sing error free sequencing data.
DNA Fragment Assembly - EULER [1] Pavel A. Pevzner, Haixu Tang, Michael S. Waterman. PNAS, August 2001.
Bibliography • [1] An Eulerian path approach to DNA fragment assembly, Pavel A. Pevzner, Haixu Tang, Michael S. Waterman. PNAS, August 2001. • [2] Strategies for the systematic sequencing of complex genomes. Eric D. Green. Nature Review Genetics 2, 573-583 (2001) • [3] Eulerian Cycle / Chinese Postman. The Stony Brook Algorithm Repository. Steven S. Skiena.http://www.cs.sunysb.edu/~algorith/