620 likes | 807 Views
Genomic Sequencing. DNA Sequencing. Fred Sanger, Cambridge, England Copy DNA with one of four bases starved ACGTAAGCTA with T starved produces ACG and ACGTAAGC Run experiment with each of four bases starved, producing a ladder (all sub-fragments ending at the base)
E N D
DNA Sequencing • Fred Sanger, Cambridge, England • Copy DNA with one of four bases starved • ACGTAAGCTA with T starved produces ACG and ACGTAAGC • Run experiment with each of four bases starved, producing a ladder (all sub-fragments ending at the base) • Separate resulting fragments by length • Animations • http://www.youtube.com/watch?v=UT9wqaVCH5s • http://www.mun.ca/biology/scarr/4241_RMC_Sequencing.html • http://dnalc.org/view/15479-Sanger-method-of-DNA-sequencing-3D-animation-with-narration.html
DNA Sequencing • Later, sequencing machines sequence 500-700 nt fragments, called reads • Reads are assembled into a continuous genome (difficult) • Shotgun sequencing • Current • Next Generation Sequencing (NGS) • www.cs.uml.edu/~kim/580/10_ngs.pdf
Shot-gun Method • Shotgun method • Break up DNA into small fragments, each of which is sequenced • Use computer to search for overlap • Build a master sequence • Good for short prokaryote genomes • For n fragments, # of possible overlaps is 2n(n-1) • Repeats in sequences are problems
Genetic Maps • For long genomes, use genetic markers • Use shot gun method and locate known markers in the master sequence • Known genes can be markers
Restriction Map • Restriction endonuclease • An enzyme binding to specific DNA sequences, and making double-stranded cut at or near the sequences • Type II always cut at the same place (over 2,500 type II) • e.g., HindII cuts at GTGCAC or GTTAAC
Complete and Partial Digest • Probability of restriction site being cut • = 1: complete digest • Distance between successive cuts is known and accurate • <1 : partial digest • Distances across more than one restriction site are generated
Partial Digest Problem (PDP) • X = {x1=0, x2, . . ., xn}: an ordered set of n points on a line • ΔX = {xi- xj| 1 ≤i<j ≤ n}: a multiset of pairwise distances with ( n2) elements • Partial Digest Problem (PDP) • Given a multiset L containing ( n2) integers of pairwise distances • Find a set X of n integers such than ΔX = L • Also, called Turnpike problem, reconstructing highway from pairs of exits • Unique set X is not always possible • e.g., if ΔA = Δ(A+v), where Δ(A+v) = {a+v|a ЄA} (one set is a shift of another set) • A = {0,2,4,7,10}, Δ(A+100) = {100, 102, 104, 107, 110} • e.g., if ΔA = Δ(-A) • A = {0,2,4,7,10}, Δ(-A) = {-10, -7, -4, -2, 0} • In general, U + V and U – V are homometric
PDP(1) • Brute force approach • Given L, • Compute ΔX for every possible combination of X • Until X is found such that ΔX = L • Need to examine (M-1n-2) different set of positions • => O(Mn-2) BruteForcePDP(L, n) M← max(L) for every set of n-2 integers 0< x2 < . . . <xn-1 <M X ← {0 < x2 < . . . <xn-1 <M } FormΔX from X if ΔX = L return X return “No Solution”
PDP(2) • Brute force approach • Given L, • Identical to BruteForcePDP() except that xiЄ L • Need to examine (|L|n-2) different set of positions • => O(M2n-4) BruteForcePDP(L, n) M <- max(L) for every set of n-2 integers 0< x2 < . . . <xn-1 <M from L X ← {0, < x2 < . . . <xn-1 <M } FormΔX from X if ΔX = L return X return “No Solution”
PDP(3) • Steven Skiena, 1990 • Largest in L determines the two outermost points in X • e.g. L = {2,2,3,3,4,5,6,7,8,10} • Pick 10: • X={0,10} • L = {2,2,3,3,4,5,6,7,8) • Pick 8: X={0,2,10} or X={0,8,10} • L = {2,3,3,4,5,6,7} • Pick 7: x3=3 should include x3-x2=1 • X={0,2,7,10} • L = {2,3,3,4,5,6} • . . .
PLACE(L, X) if L is empty output X return y← max(L) if Δ(y, X) is subset of L add y to X and remove Δ(y, X) from L PLACE(L,X) remove y from X and add Δ(y, X) to L if Δ(width-y, X) is subset of L add width-y to X and removeΔ(width-y, X) PLACE(L, X) remove width-y from X and add Δ(width-y, X) to L return PartialDigest(L) width ← max(L) DELETE(width, L) X← {0, width} PLACE(L, X) [ Δ(y, X): multiset of distances between a point y and all points in set X]
Shortest Superstring Problem • Find superstring of the reads, but shortest one Shortest Superstring Problem Given a set of strings, find a shortest string that contain all of them Input: Strings s1, s2, …., sn Output: A shortest string s that contains all strings s1, s2, …., sn { 000 001 010 011 100 101 110 111 } 0 1 0 1 1 0 0 1 1 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 1 1 1 1 1 0 1 1 0 0
Shortest Superstring Problem -2 • Define overlap(si, sj) • The length of the longest prefex of sj that matches a suffix of si • Shortest Superstring problem becomes • Traveling salesman problem with vertices for strings and edges of overlaps
DNA Arrays • Sequencing by Hybridization (SBH) • millions of short DNA fragments called probes in a chip • Input DNA sequence reacts to fragments in an array (chip) via base complementary property
Base coverage • A sample (genome) is amplified • A base is the sample is copied into many reads • But, reads are randomly generated • Poisson distribution • Similarly, k-mers • Still, Poisson distribution, but different
Coverage Depth and Extent • Coverage Depth • The avg number of times each base or k-mer is sequenced • Coverage Extent • The ratio of genome covered by at least one base or k-mer • Given a genome of size G, read length L, read number N • Total number of bases (nb) and k-mers (nk) • nb = N*L; nk = N*(L-k+1) • nb/nb= L/(L-k+1)
Coverage Depth and Extent • Coverage Depth of bases (db) and k-mers (db) • db = nb/G; dk = nk/G • db / dk= L/(L-k+1) • For the de novo sequencing, these relationships can be used to estimate the unknown genome size (G) and coverage depth for bases (db) from read data before assembly from • G = nk /dk and db = dk* L/(L-k+1)
Coverage Depth or Sequencing Depth • Coverage Depth (db) is called sequencing depth (c) • From Poisson, prob. of non-coverage is • P(X=0) = exp(-c) • Coverage extent is P(X>0) = 1- exp(-c) • To cover >99% of a genome, c>4.6 • To ensure the whole genome is covered, # of uncovered bases G*exp(-c)<1 • Human genome (3 Gb): c>22
SBH • Given an unknown DNA sequence, DNA array provides • All strings of length l that the sequence contains • No information about their positions • Spectrum (s, l) • For string s of length n, the l-mer composition with multiset of n-l+1 l-mers in s • l=3, s=TATGGTGC • Spectrum(s.l) = {TAT, ATG, TGG, GGT, GTG, TGC}
SBH as a Hamiltonian Path Problem • Two l-mers overlap if overlap(p,q) = l-1 • Hamiltonian Path Problem • Given Spectrum (s, l), and a vertex for every l-mer in Spectrum (s, l) • Connect every two vertices if two vertices overlap, • So that visit every vertex • Overlap-Layout-Consensus (OLC) • NP-complete
OLC • Conventional shotgun sequencing • Overlap-layout-consensus • Use computer to search for overlap: trying for all possible pairs of fragments • Layout: putting fragments together • Consensus: error correction • Good for short prokaryote genomes • For n fragments, # of possible overlaps is 2n(n-1) • Difficult • No solution for “repeat problem” to find correct path in the layout step • Produce sequencing errors • Programs • PHRAP, CAP, TIGR, CELERA
SBH as an Eulerian Path Problem • A graph with all (l-1)-mers (later) • edges corresponding to l-mers from Spectrum (s, l) • Find a path visiting every edge exactly once
Eulerian Path Problem • Repeatedly find Eulerian cycles in the graph • Linear time
De Bruijn Graph • Partition read fragments into fixed-size k-mers • k = 27, for example • Each (k-1)-mer becomes a graph node
De Bruijn Graph • Eulerian Graph • De Bruijn Graph • Glue parallel links with multiplicity (e.g., multiplicity of 3) • Tangle: # of input edges is not equal to # of output edges
De Bruijn Graph • How to construct de Bruijn graph from collections of sequencing reads ? • Gluing requires knowledge of finished sequence • Cannot construct de Bruijn graph from collection of sequencing reads until sequencing is completed • Let s be a sequencing read with error • If genome sequence G is known, errors in s can be done by aligning s against G • But G is not known until the last “consensus” step • EULER uses SA to minimize errors in the first step
Programs • ABySS (Assembly By Short Sequencing) • Simpson, 2009 • www.cs.uml.edu/~kim/580/08_abyss.pdf • Velvet • Zerbino and Birny, 2008 • www.cs.uml.edu/~kim/580/08_velvet.pdf • Euler • Pevzner, 2001 • www.cs.uml.edu/~kim/580/01_pevzner.pdf • www.cs.uml.edu/~kim/580/09_chaisson.pdf • SOAPdenovo (Short Oligonucleotide Alignment Program) • Beijing Genomics Institute • www.cs.uml.edu/~kim/580/09_soap.pdf
ABySS • Proceeds in two stages • Stage 1 • All possible k-mers are generated from reads • Remove read errors and construct initial contigs • Stage 2 • Use mate-pairs to extend contigs • Distributed implementation of de Bruijn graph in a cluster using Message Passing Interface (MPI) over multiple computers
ABySS – Stage 1 • Three steps • Load read data into distributed de Bruijn graph • Resolve read errors • Merge graph nodes • Load read data into distributed de Bruijn graph • Reads with unknown bases are discarded • Each read is broken into (read_length-k+1) overlapping k-mers • A k-mer is assigned to one cluster node • Compute adjacency of k-mers • For each k-mer, a message is sent to its eight possible neighbors • If a neighbor exists, there must be a k-1 bp overlap
ABySS – Stage 1 (cont’d) • Resolve read errors • Remove dead-ends • When correct k-mers of a read connect to incorrect k-mers, • They are likely to be unique and most will not have an extension • One end of the branch will terminate with no extension • Dead-end branches are traced backward to the ambiguous point and are removed if their lengths are shorter than a threshold • Remove bubbles • A branch diverges and rejoins later • Caused by single base differences
ABySS – Stages 1 and 2 • Vertex merging • Merge vertices linked by unambiguous edges • Contig merging • Use paired-end info
ABySS Results • Genome of African male from NCBI Short Read Archive: Accession # SRA000271 • 3.5 Billion mate-paired reads, x42 • Read length: 36-42 bp, median fragment 210 bp • At k=27, 15h run time without paired-end info
Velvet • Construct a graph • Transform reads into roadmaps • From a read, generate k-mers with read ID and position in the read (called roadmaps) • Each read is transformed to a set of k-mers with overlaps and hash links to previous reads with the same k-mers • 2nd database • For each read, which k-mers are overlapped by subsequent reads • Trace reads through the graph using roadmaps
Velvet • Graph simplification • A node with one outlink can be combined with a next node with one input link • Error removal • Focus on topological features • Tips (dead-ends) shorter than 2k • bubbles due to internal read errors (Tour Bus algorithm) • Erroneous connections due to distant merging tips • Breadcrumb – use read pairs to extend contigs
EULER, 2001 • EULER • Implement Eulerian Path problem • Issues with real data • Reads may have errors • Error correction is typically done in ‘consensus’ stage • EULER corrects errors in the first step • SA (Spectral Alignment) • Repeat problem • De Bruijn graph
Spectral Alignment (SA) • Genome sequence G is not known, set Gl of all l-mers present in G can be accurately predicted • An l-mer is called solid if it belongs to more than M reads • EULER Approach – approximate Gl as a set of all solid l-mers • SBH problem without read errors • Construct a graph with edges corresponding to l-mers from Spectrum(s, l) • Find a path visiting every edge exactly once • SA • Given a string s and Gl, find the minimum number of mutations in s that transform s such that Spectrum(s, l) = Gl • Can be efficiently programmed by dynamic programming
Spectral Alignment (SA) • Formulation • Given a set of reads R = {r1, .., rn}, integer l, and upper bound Δ on the number of errors in each read • Spectrum Sl is a set of all l-mers from reads r1, .., rn and reverse complements r1’, .., rn’ • Introduce up to Δ corrections in each read in R such that | Sl | is minimized • Result • One correction in a read can correct l from R andl from R’ • Reduces 86.5% of read errors • But, can create errors • One change in a read may change all reads in the region • Error introduction is OK as long as the errors from overlapping reads covering the same position are consistent, corresponding to a single mutation in a genome • Correct 234,410 errors, introduce 1,452 errors in NM
EULER - Results • No incorrect contigs
Summary of EULER • Eulerian Path approach – de Bruijn graph • Do error correction early – SA • Fill gaps ASAP
EULER +, 2004 • EULER+, 2004 • A-Bruijn graph • To handle errors in reads, introduce vertices with ungapped alignments that allow mismtaches rather than exact l-mer in de Bruijn assembly • Graph simplication algorithms to remove errors in edges • De Bruijn graph is proportional to the coverage and requires a large memory with a higher coverage with short reads than long–read sequencing