600 likes | 885 Views
I519 Introduction to Bioinformatics, 2011. Sequencing techniques and genome assembly. Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB. >read1
E N D
I519 Introduction to Bioinformatics, 2011 Sequencing techniques and genome assembly Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB
>read1 aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaat >read2 gctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatgctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgca >read3 tgcggctatgctaatgcatgcggctatgcaagctgggatcctgcggctatgctaatgaatggtcttgggatttaccttggaatgctaagctgggatccgatgacaatgcatgcggctatgctaatgaatggtcttgggatttaccttggaatatgctaatgcatgcggctatgcta …… Start with reads
What can be done • Assemble the short reads into a genome (hopefully a complete genome) • Assembly problem • Comparative analysis • Whole genome level: whole genome comparison • Individual gene level • Genome variation & SNP • Annotate the genome • What are the genes (gene structure prediction) • What are the functions of the genes
How genome sequences are generated? • Limitation on read length (new sequencers produce even shorter reads than Sanger sequencing machines) • Sequencing of long DNA sequences (a chromosome or a whole genome) relies on sequencing of short segments (carried in cloning vectors) • Two approaches to sequence large pieces of • Chromosome walking / primer walking; progresses through the entire strand, piece by piece • Shotgun sequencing; cut DNA randomly into smaller pieces; with sufficient oversampling (?), the sequence of the target can be inferred by piecing the sequence reads together into an assembly.
Cloning vectors • Cloning Vectors: DNA vehicles in which a foreign DNA can be inserted; and stay stable • Various types • Cosmid (plasmid, containing 37-52 kbp of DNA) • BAC (Bacterial Artificial Chromosome; takes in 100-300 kbp of foreign DNA) • YAC (Yeast Artificial Chromosome)
DNA Shotgun sequencing Too long to be sequenced Fragment assembly (an inverse problem) cut randomly (Shotgun) Each short read can be sequenced
Shotgun sequencing: from small viral genomes to larger genomes • Early applications of shotgun approach • small viral genomes (e.g., lambda virus; 1982) • 30- to 40-kbp segments of larger genomes that could be manipulated and amplified in cosmids or other clones (physical mapping) -- hierarchical genome sequencing (divide-and-conquer sequencing) • 1994, Haemophilus influenzae -- whole-genome shotgun (WGS) sequencing • Critical to this accomplishment: use of pairs of reads, called mates, from the ends of 2-kbp and 16-kbp inserts randomly sampled from the genome (which used for ordering the contigs) • 2001 whole-genome shotgun sequencing of Human genome
DNA sequencing technology • Sanger sequencing • The main method for sequencing DNA for the past thirty years! • 2nd generation sequencing techniques (next generation sequencing) • Differ from Sanger sequencing in their basic chemistry • Massively increased throughput • Smaller DNA concentration • 454 pyrosequencing, Ilumina/Solexa, SOLiD • 3rd generation? (single-molecule)
DNA sequencing: history Gilbert method (1977): chemical method to cleave DNA at specific points (G, G+A, T+C, C). Sanger method (1977): labeled ddNTPsterminate DNA copying at random points. • Both methods generate labeled fragments of varying lengths that are further electrophoresed (electrophoretic separation)
Start at primer (restriction site) Grow DNA chain Include ddNTPs Stops reaction at all possible points Separate products by length, using gel electrophoresis Sanger method: generating reads Chain terminators: dideoxynucleotides triphosphates (ddNTPs)
Radioactive sequencing versus dye-terminator sequencing ddNTPs (chain terminators) are labeled with different fluorescent dyes, each fluorescing at a different wavelength.
Automatic DNA sequencing Output: chromatograms (fluorescent peak trace)
Trace archive NCBI trace archive: TI# 422835669 (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?)
New sequencing techniques • Next Generation Sequencing (NGS) (Second Generation) • Pyrosequencing • Illumina • SOLiD • Third generation sequencing • single-molecule sequencing technologies • NHGRI funds development of third generation DNA sequencing technologies • “More than $18 million in grants to spur the development of a third generation of DNA sequencing technologies was announced today by the National Human Genome Research Institute (NHGRI). …The cost to sequence a human genome has now dipped below $40,000. Ultimately, NHGRI's vision is to cut the cost of whole-genome sequencing of an individual's genome to $1,000 or less, which will enable sequencing to be a part of routine medical care..” • http://www.nih.gov/news/health/sep2010/nhgri-14.htm
Next-generation sequencing transforms today's biology 454 sequencer Ref: Nature Methods - 5, 16 - 18 (2008) Sanger sequencers
Next-generation sequencing transforms today's biology • Genome re-sequencing • Metagenomics • Transcriptomics (RNA-seq) • Personal genomics ($1000 for sequencing a person’s genome)
Pyrosequencing • Pyrosequencing principles • the polymerase reaction is modified to emit light as each base gets incorporated.
Solexa/Illumina sequencing • Ultrahigh-throughput sequencing • Keys • attachment of randomly fragmented genomic DNA to a planar, optically transparent surface • solid phase amplification to create an ultra-high density sequencing flow cell with > 10 million clusters, each containing ~1,000 copies of template per sq. cm. • Short reads • Used for gene expression, small RNA discovery etc
Solexa/Illumina sequencing More details at http://www.illumina.com/pages.ilmn?ID=203
Commercial release in October 2007 Sequencing by Oligo Ligation and Detection ~5 days to run / produces 3-4Gb The chemistry is based on template-directed ligation of short, “dinucleotide-encoding”, 8-mer oligonucleotides. Dinucleotide-encoding permits discrimination of SNP’s from most chemistry and imaging errors, and subsequent in silico correction of those errors. Applied Biosystems SOLiD sequencer Ref: http://appliedbiosystems.cnpg.com/Video/flatFiles/699/index.aspx
Comparison of new sequencing techniques Increased!! Yes now! (“The new science of metagenomics” Table 4-2)
Base calling • Determine the sequence of nucleotides from chromatograms or flowgram (trace files often in SCF format) • Peak detection • Phrep quality score Q = -10log10(Pe)
Phrep quality score (for high values the two scores are asymptotically equal)
DNA Fragment assembly (Genome assembly) ?
Assembly • Comparative assembly • comparative (re-sequencing) approaches that use the sequence of a closely related organism as a guide during the assembly process. • De novo assembly • reconstructing genomes that are not similar to any organisms previously sequenced • proven to be difficult, falling within a class of problems (NP-hard) • main strategies: greedy, overlap-layout-consensus, and Eulerian
Fragment assembly: overlap-layout-consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT..
Overlap • Find the best match between the suffix of one read and the prefix of another • Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment • Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring
T GA TACA | || || TAGA TAGT Overlapping reads • Sort all k-mers in reads (k ~ 24) • Find pairs of reads sharing a k-mer • Extend to full alignment – throw away if not >95% similar TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC
Layout Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA
Derive consensus sequence TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA Derive multiple alignment from pairwise read alignments Derive each consensus base by weighted voting TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Consensus • A consensus sequence is derived from a profile of the assembled fragments • A sufficient number of reads are required to ensure a statistically significant consensus. • Reading errors are corrected
Gaps and contigs Contig 1 Contig 2 Gap Filling gap -- up the gaps by further experiments Mates for ordering the contigs
Read coverage C Assuming uniform distribution of reads: Length of genomic segment: L Number of reads: nCoverage l = n l / L Length of each read: l How much coverage is enough (or what is sufficient oversampling)? Lander-Waterman model: P(x) = (lx * e-l ) / x! P(x=0) = e-l where l is coverage
Contig numbers vs read coverage Using a genome of 1Mbp
How much coverage is needed reads Cover region with >7-fold redundancy Overlap reads and extend to reconstruct the original DNA sequence
Repeats complicate fragment assembly True overlap Repeat overlap
Repeat Repeat Repeat Green and blue fragments are interchangeable when assembling repetitive DNA Challenges in fragment assembly • Repeats: A major problem for fragment assembly • > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer)
Repeat types • Low-Complexity DNA (e.g. ATATATATACATA…) • Microsatellite repeats (a1…ak)N where k ~ 3-6 (e.g. CAGCAGTAGCAGCACCAG) • Transposons/retrotransposons • SINEShort Interspersed Nuclear Elements (e.g., Alu: ~300 bp long, 106 copies) • LINELong Interspersed Nuclear Elements ~500 - 5,000 bp long, 200,000 copies • LTRretroposonsLong Terminal Repeats (~700 bp) at each end • Gene Families genes duplicate & then diverge • Segmental duplications ~very long, very similar copies
Celera assembler • “The key to not being confused by repeats is the exploitation of mate pair information to circumnavigate and to fill them” • A mate pair are two reads from the same clone -- we know the distance between the two reads Myers et al. 2000 “A Whole-Genome Assembly of Drosophila”. Science, 287:2196 - 2204
Celera assembler: unitig Unitig: a maximal interval subgraph of the graph of all fragment overlaps for which there are no conflicting overlaps to an interior vertex A-statistic: log-odds ratio of the probability that the distribution of fragment start points is representative of a “correct” unitig versus an overcollapsed unitig of two repeat copies.
Celera Assembler: scaffold Contigs that are ordered and oriented into scaffolds with approximately known distances between them (using mate pairs or BAC ends)
Human genome • 2001 Two assemblies of initial human genome sequences published • International Human Genome project (Hierachical sequencing; BACshotgun) • Celera Genomics: WGS approach; • Initial impact of the sequencing of the human genome (Nature 470:187–197, 2011)
Assembly of human genome sequence tagged site (STS) markers J. C. Venter et al., Science 291, 1304 -1351 (2001)
Fragment assembly: two alternative choices Finding a path visiting every VERTEX exactly once in the OVERLAP graph: Hamiltonian path problem NP-complete problem: algorithms unknown Find a path visiting every EDGE exactly once in the REPEAT graph: Eulerian path problem Linear time algorithms are known
Overlap graph thick edges (a Hamiltonian cycle) correspond to the correct layout of the reads along the genome False overlaps induced by repeats