880 likes | 1.14k Views
Computational assembly for prokaryotic sequencing projects. Lee Katz, Ph.D. Bioinformatician, Enteric Diseases Laboratory Branch January 15, 2014. Disclaimers
E N D
Computational assembly for prokaryotic sequencing projects Lee Katz, Ph.D. Bioinformatician, Enteric Diseases Laboratory Branch January 15, 2014 Disclaimers The findings and conclusions in this presentation have not been formally disseminated by the Centers for Disease Control and Prevention and should not be construed to represent any agency determination or policy. The findings and conclusions in this [report/presentation] are those of the author(s) and do not necessarily represent the official position of CDC
Lee Katz, Present Currently in the National Enteric Reference Laboratory Vibrio, Campylobacter, Escherichia, Shigella, Yersinia, Salmonella Focusing on Listeria and Vibrio
One of my projects is #2 on CDC’s list of accomplishments for 2013! #2 http://www.cdc.gov/features/endofyear/
Outline • Sequencing • 1st gen • 2nd gen • 3rdgen • Reads • Quality control (Q/C) • Read metrics • Read-cleaning • Assembly • Algorithms • Assembly metrics
Prokaryotic Sequencing Projects Stages Examples Haemophilus influenzae Neisseria meningitidis Bordetellabronchisceptica Vibrio cholerae Listeria monocytogenes Fleischman et al. (1995) “Whole-Genome Random Sequencing and Assembly of Haemophilus influenzae Rd” Science 269:5223 Kislyuk et al. (2010) “A computational genomics pipeline for prokaryotic sequencing projects” Bioinformatics26:15 • Sequencing • Assembly • Feature prediction • Functional annotation • …analysis… • Display (Genome Browser)
Out with the old; in with the new:Two new technologies to the compgenomics class! • 454 • Illumina single end reads • Illumina paired end reads • PacBio
Sequencing: first generation Margulies et al. (2005) Genome sequencing in open microfabricated high density picoliter reactors. Nature437:7057
Sanger sequencing output • Usually .ab1/.scf file format
454 Pyrosequencing A + PCR Reagents + Emulsion Oil B Mix DNA library & capture beads (limited dilution) Create “Water-in-oil” emulsion “Break micro-reactors” Isolate DNA containing beads Perform emulsion PCR
44 μm 454 Pyrosequencing Load enzyme beads Load beads into PicoTiter™Plate PicoTiter™Plate Diameter = 44 μm Depth = 55 μm Well size = 75 pl Well density = 480 wells mm-2 1.6 million wells per slide
454 Pyrosequencing Sequencing by synthesis Photonsgenerated are captured by CCD camera Reagent flow Margulies et al., 2005
4-mer 3-mer Measures the presence or absence of each nucleotide at any given position TACG Flow Order 2-mer KEY (TCAG) 1-mer 454 sequencing output • Flowgram (.sff file format)
The following animations are courtesy of Illumina, Inc. Region complementary to P5 grafting primer Index 2 P5 primer DNA insert P7 primer Index 1 P5 grafting primer P7 grafting primer Flow cell surface
The following animations are courtesy of Illumina, Inc. SBS Sequencing Primer Hybridization
The following animations are courtesy of Illumina, Inc. Sequence (Cycle 1)
7 dark cycles P5 grafting primer
Index 2 index read 8 cycles 7 dark cycles P5 grafting primer
Index 2 index read 8 cycles 7 dark cycles P5 grafting primer
Linearization Original strand New strand
Illumina sequencing video http://www.youtube.com/watch?v=womKfikWlxM
PacBio sequencing* (3rd Gen) *Pacific Biosciences
http://www.youtube.com/watch?v=NHCJ8PtYCFc SMRT Bell Zero-mode waveguide (ZMW), a very fancy and very small well Thanks to PacBio for donating some slide materials in this section Eid et al Science, January 2009/10.1126/science.1162986
http://www.youtube.com/watch?v=NHCJ8PtYCFc Eid et al Science, January 2009/10.1126/science.1162986
Eid et al Science, January 2009/10.1126/science.1162986
PacBio video http://www.youtube.com/watch?v=NHCJ8PtYCFc
Q/C + cleaning + metrics Reads
Q/C You need to know if your data are good! Example software • FastQC • Computational Genomics Pipeline (CG-Pipeline)
Quality Control FastQC output
Quality Control bioinformatics FastQC output
The CG-Pipeline way run_assembly_readMetrics.pl File avgReadLengthtotalBasesminReadLengthmaxReadLengthavgQuality tmp.fastq80.00 177777760 80 80 35.39
Read cleaning with CG-Pipeline(not validated; please use with caution) R. Read F. Read Read %ACGT Phred http://sourceforge.net/projects/cg-pipeline/ Graphs made with FastqQC (AMOS)
1. Trimming low-qual endsrun_assembly_trimLowQualEnds.pl R. Read F. Read Read 1A. %ACGT 1B. Phred http://sourceforge.net/projects/cg-pipeline/ Graphs made with FastqQC (AMOS)
2a. Removing duplicate reads2b. Sometimes: downsamplingrun_assembly_removeDuplicateReads.pl Trimmed reads http://sourceforge.net/projects/cg-pipeline/
3. Trimming and filteringrun_assembly_trimClean.pl Min avg. quality Min length 3A. trimming Min avg. quality Min length 3B. filtering http://sourceforge.net/projects/cg-pipeline/
More • Software • Fastx toolkit http://hannonlab.cshl.edu/fastx_toolkit/ • EA-utilshttps://code.google.com/p/ea-utils/ • AMOS amos: SourceForge.net • … and more is out there! • Evaluation • Fabbro et al 2013, “An extensive evaluation of read trimming effects on Illumina NGS data analysis”