500 likes | 821 Views
Computational Genomics: Genome assembly. Andrey Kislyuk 25 January 2010. Why do we need to assemble genomes?. DNA sequencing methods can’t sequence more than about 1000 nt at a time Sanger method (1975) chain termination with labeled ddNTPs Maxam-Gilbert method (1976) cleaving agents
E N D
Computational Genomics:Genome assembly Andrey Kislyuk 25 January 2010
Why do we need to assemble genomes? • DNA sequencing methods can’t sequence more than about 1000 nt at a time • Sanger method (1975) • chain termination with labeled ddNTPs • Maxam-Gilbert method (1976) • cleaving agents • Both require primers for DNA Polymerase, but we can’t make the primers until we know the sequence! • Both limited to ~1000 nt on the gel/capillary • Both require cloning and/or PCR amplification • Large-scale sequencing: primer walking • Slow, costly, and error-prone: not practical beyond ~10Kbp 5 June 2014·Computational Genomics
Shotgun sequencing • Whole genome shotgun sequencing (1995) • Hydroshearing: prepare several libraries of random fragments approx. 2, 5, 10, 50… Kbp long • Cloning: use bacterial plasmids to grow DNA – problems arise if DNA contains a gene harmful to the host bacteria • Picking, amplification • Sanger sequencing, capillary electrophoresis, read out fluorescent dyes with a laser – 4 different colors • Result: lots of ~1000 nt Sanger reads • Assemble them with pairwise sequence alignment • Multiple coverage corrects errors • Seems straightforward now, but many did not believe it could be done! 5 June 2014·Computational Genomics
How much shotgun sequencing? • So, can we really sequence the whole genome with this? (No, we can’t.) • Lander and Waterman (1988): • Assuming random distribution of reads and ignoring repeat resolution issues, Define: • G = genome length • L = length of a single read • Then overall coverage is C = LN/G • N = number of reads sequenced • T = minimum overlap to align the reads together • Coverage for any given base obeys the Poisson distribution: • The number of gaps (bases with 0 coverage) is: 5 June 2014·Computational Genomics
How much shotgun sequencing? 5 June 2014·Computational Genomics
How much shotgun sequencing? 5 June 2014·Computational Genomics
Pioneers of sequence assembly • J. Craig Venter’s group at TIGR (later JCVI) • Created TIGR Assembler, Celera Assembler, and associated tools • Jim Kent (UC Santa Cruz) • Created GigAssembler • Allowed the Human Genome Project to compete with Celera • Philip Green’s group at the University of Washington • Created Phred, Phrap and Consed tools • Sequencing centers: JCVI, Sanger Institute, Whitehead/MIT, DOE JGI, Baylor HGSC, WUSTL 5 June 2014·Computational Genomics
Why do we need to assemble genomes? • 2nd Generation sequencing methods • Cheaper and more processive (sequence more data), but shorter read length • 454 Pyrosequencing: 200-600 nt average read length • Illumina: 50-70 nt average read length • ABI SOLiD: 50 nt average read length • Same idea: randomly hydrosheared library • Random reads from across the genome form a big puzzle 5 June 2014·Computational Genomics
Next Generation Sequencing Technologies Sequencing by synthesis • 2nd generation • 454 Pyrosequencing • Solexa/Illumina • SOLiD • 3rd generation • Single-molecule sequencing • Nanopore sequencing 5 June 2014·Computational Genomics
454 Pyrosequencing A + PCR Reagents + Emulsion Oil B Mix DNA library & capture beads (limited dilution) Create “Water-in-oil” emulsion “Break micro-reactors” Isolate DNA containing beads Perform emulsion PCR 5 June 2014·Computational Genomics
454 Pyrosequencing 44 μm Load enzyme beads Load beads into PicoTiter™Plate PicoTiter™Plate Diameter = 44 μm Depth = 55 μm Well size = 75 pl Well density = 480 wells mm-2 1.6 million wells per slide 5 June 2014·Computational Genomics
454 Pyrosequencing Sequencing by synthesis Photonsgenerated are captured by CCD camera Reagent flow Margulies et al., 2005
Raw sequencer output 4-mer 3-mer Measures the presence or absence of each nucleotide at any given position TACG Flow Order 2-mer KEY (TCAG) 1-mer • Sanger: Trace (usually .ab1/.scf file format) • 454: Flowgram (.sff file format) 5 June 2014·Computational Genomics
Assembly algorithms • Paradigms • Overlap-Layout-Consensus • De Bruijn graphs 5 June 2014·Computational Genomics
Differences between an overlap graph and a de Bruijn graph Schatz M C et al. Genome Res. 2010;20:1165-1173 ©2010 by Cold Spring Harbor Laboratory Press
Overlap-Layout-Consensus Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors ..ACGATTACAATAGGTT.. Credit: Serafim Batzoglou
Overlap: A pairwise alignment problem • Find the best match between the suffix of one read and the prefix of another • Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment • Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring Credit: Serafim Batzoglou
Overlapping Reads T GA TACA | || || TAGA TAGT • Sort all k-mers in reads (k ~ 24) • Find pairs of reads sharing a k-mer • Extend to full alignment – throw away if not >95% similar TAGATTACACAGATTAC ||||||||||||||||| TAGATTACACAGATTAC Credit: Serafim Batzoglou
Overlapping Reads and Repeats • A k-mer that appears N times initiates N2 comparisons • For an Alu that appears 106 times 1012 comparisons – too much • Solution: Discard all k-mers that appear more than t Coverage, (t ~ 10) Credit: Serafim Batzoglou
Finding Overlapping Reads Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA Credit: Serafim Batzoglou, Masahiro Kasahara
Finding Overlapping Reads (cont’d) • Correcterrors using multiple alignment C: 20 C: 20 C: 35 C: 35 C: 0 T: 30 C: 35 C: 35 TAGATTACACAGATTACTGA C: 40 C: 40 TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA A: 15 A: 15 A: 25 A: 25 - A: 0 A: 40 A: 40 A: 25 A: 25 • Score alignments • Accept alignments with good scores Credit: Serafim Batzoglou
Layout • Repeats are a major challenge • Do two aligned fragments really overlap, or are they from two copies of a repeat? Credit: Serafim Batzoglou
The k-mer uniqueness ratio Schatz M C et al. Genome Res. 2010;20:1165-1173 ©2010 by Cold Spring Harbor Laboratory Press
Merge Reads into Contigs repeat region Merge reads up to potential repeat boundaries Credit: Serafim Batzoglou
Merge Reads into Contigs (cont’d) repeat region • Ignore non-maximal reads • Merge only maximal reads into contigs Credit: Serafim Batzoglou
Merge Reads into Contigs (cont’d) repeat boundary??? • Ignore “hanging” reads when detecting repeat boundaries sequencing error b a Credit: Serafim Batzoglou
Merge Reads into Contigs (cont’d) ????? Unambiguous • Insert non-maximal reads whenever unambiguous Credit: Serafim Batzoglou
Link Contigs into Supercontigs (aka scaffolds) Normal density Too dense: Overcollapsed? (Myers et al. 2000) Inconsistent links: Overcollapsed? Credit: Serafim Batzoglou
Link Contigs into Supercontigs (cont’d) Find all links between unique contigs Connect contigs incrementally, if 2 links Credit: Serafim Batzoglou
Link Contigs into Supercontigs (cont’d) Fill gaps in supercontigs with paths of overcollapsed contigs Credit: Serafim Batzoglou
Link Contigs into Supercontigs (cont’d) d ( A, B ) Contig A Contig B • Define G = ( V, E ) • V := contigs • E := ( A, B ) such that d( A, B ) < C • Reason to do so: Efficiency; full shortest paths cannot be computed Credit: Serafim Batzoglou
Link Contigs into Supercontigs (cont’d) Contig A Contig B Define T: contigs linked to either A or B Fill gap between A and B if there is a path in G passing only from contigs in T Credit: Serafim Batzoglou
Consensus • A consensus sequence is derived from a profile of the assembled fragments • A sufficient number of reads is required to ensure a statistically significant consensus • Reading errors are corrected Credit: Serafim Batzoglou
Derive Consensus Sequence Derive multiple alignment from pairwise read alignments TAGATTACACAGATTACTGA TTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAAACTA TAG TTACACAGATTATTGACTTCATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA TAGATTACACAGATTACTGACTTGATGGGGTAA CTA TAGATTACACAGATTACTGACTTGATGGCGTAA CTA Derive each consensus base by weighted voting Credit: Serafim Batzoglou
Mate pairs and paired-end reads • Mate pairs: Circularize and trim size-selected fragments during library preparation. Inserts can be approx. 1, 5, 10, 20 Kbp long. • Paired-end reads: Sequence a short amplified fragment from both ends. Fragment length is more precise but limited to about 300 bp. 5 June 2014·Computational Genomics
Mate pairs/Paired-end reads 5 June 2014·Computational Genomics
Paired end reads (aka mate pairs) 5 June 2014·Computational Genomics Credit: 454 Life Sciences
Base Calling and Trimming • Base Calling: the process of translating the raw sequencer output into • The most likely nucleotide sequence • Confidence scores for each position • Trimming: the process of removing adapter, key, vector, and/or low quality sequence from a read 5 June 2014·Computational Genomics
Reference-based assembly • Reference-based assembly • Replaces overlap detection with alignment against a similar genome • Also called mapping, mapped assembly Credit: M. Schatz 5 June 2014·Computational Genomics
Reference-based assembly • Use a related genome to ease the layout task • Much faster computationally • Arranges reads with more confidence, so a better assembly is possible • Allows other types of analysis: somatic mutations, organismal SNPs, structural variation, RNA-Seq, … 5 June 2014·Computational Genomics
Assembly quality control • QC/QA • Metrics: Size, number of contigs, N50 • Diagnostic procedures 5 June 2014·Computational Genomics
Genome size as predicted from the assembly 5 June 2014·Computational Genomics
Read length, paired-end reads, coverage • Read length and paired-end reads matter. • Long reads can span repeat regions • Paired-end reads can reach into repeat regions and bridge gaps • Combination of the two maximizes shotgun sequencing performance • Coverage also matters. • High coverage allows very high confidence in base calling • Can do repeat resolution based on coverage fitting • More likely that a read will span an ambiguous region 5 June 2014·Computational Genomics
Scaffolding • If paired end reads are available, scaffolding is already done. • If not (our case)… • Sequenced relatives may exist (our case) • Use reference-based assembly to predict scaffolding • No ready-made tools available for this • Can be inaccurate • Assemblers can get confused by repeats or overlaps that are too short • May be able to join by hand • Manual gap fill • Automated gap fill (no tools exist yet) 5 June 2014·Computational Genomics
Finishing • Finishing is the process of completing the chromosome sequence. • Close all gaps (usually by PCR, but large gaps in big genomes can be sent back to make BACs for resequencing) • Re-sequence areas with less than 2x, 3x, 5x coverage (depending on quality standard) – same procedure as gaps • Check and manually assemble unresolved repeat regions • Check for mis-assembly by analyzing the overlap graph • Lots of Consed work! • This is the most expensive and time-consuming part of sequencing. • Lots of small projects omit finishing and work with draft genomes 5 June 2014·Computational Genomics
Assemblers we used and our results 5 June 2014·Computational Genomics
Homopolymer errors 4-mer 3-mer Measures the presence or absence of each nucleotide at any given position TACG Flow Order 2-mer KEY (TCAG) 1-mer • Specific to 454 pyrosequencing • Sequencing errors usually result in frameshifts! 5 June 2014·Computational Genomics
Visualization tools: Mauve 5 June 2014·Computational Genomics
More topics • Currently popular assemblers • Newbler (demo) • Velvet • ALLPATHS 2 • ABYSS • SHRiMP • Celera/WGS • PHRAP • Other visualization tools (Consed, MAQ, Prospector 2, ABySS-Explorer…) • Microread assembly (Solexa and SOLiD) • de Bruijn graph assembly paradigm 5 June 2014·Computational Genomics