340 likes | 543 Views
Genome Assembly. Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington , Juliette Zerick. Outline. Input Data Sequence read data Pipeline Review U n-processed data Assemblers
E N D
Genome Assembly Kelley Bullard, Henry Dewhurst, Kizee Etienne, Esha Jain, VivekSagar KR, Benjamin Metcalf, Raghav Sharma, Charles Wigington, Juliette Zerick
Outline • Input Data • Sequence read data • Pipeline Review • Un-processed data • Assemblers • Preliminary data – assembler comparison • Visualization • Future
Input Data Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio navarrensis- 454 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio vulnificus- 454 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio navarrensis- Illumina Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio vulnificus- Illumina Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Pipeline: Revisited 454 • Illumina DeNovo • Allpaths LG • SOAP DeNovo • Velvet • Taipan • SUTTA • Hybrid DeNovo • Ray • MIRA Parameter optimization 454 raw reads Illumina raw reads Illumina hybrid • 454 DeNovo • Newbler • CABOG • SUTTA Process Illumina GAGE Statistical analysis Pre-processing 454 Evaluation Info. Illumina/ 454/ Hybrid DeNovo assembly Assemblers • GAGE • Hawk-eye Fastqc Prinseq NGS QC Assemblers Chosen Ref. Unmapped reads All possible combinations of the best 3 454 reads Illumina reads Read stats LEGEND contigs * 3 • Mimimus • MAIA Finished genome Scaffolds PRE-PROCESSING Align illumina reads against 454 contigs CONTIG MERGING Unmapped reads • MUMmer • PAGIT • Mauve Published Genomes from public databases Mac vector CLC wb V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O contigs Gap filling Nulceotide identity DENOVO ASSEMBLY GENOME FINISHING bwa Unmapped reads • GRASS • Built-in Align Illumina against the reference samstats contigs Compare mapping statistics Reference genome Illumina/(454?) reference based assembly Draft/ Finished genome • MUMmer • DNA Diff AMOScmp Reference evaluation Reference evaluation REFERENCE SELECTION REFERENCE BASED ASSEMBLY
Vibrio navarrensis- 454; unprocessed data Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio vulnificus- Illumina; unprocessed data Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Vibrio navarrensis- Illumina; unprocessed data Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Per base sequence quality vul_454_07-2444 nav_454_2541-90 vul_ill_06-2432 nav_ill_08-2462 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Per base sequence content vul_454_06-2432 nav_454_08-2462 vul_ill_06-2432 nav_ill_06-2756-81 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Seq. duplicate levels vul_454_08-2435 • nav_454_2541-90 • nav_ill_08-2462 vul_ill_06-2432 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Pre-processing stats Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Pipeline: Revisited 454 • Illumina DeNovo • Allpaths LG • SOAP DeNovo • Velvet • Taipan • SUTTA • Hybrid DeNovo • Ray • MIRA Parameter optimization 454 raw reads Illumina raw reads Illumina hybrid • 454 DeNovo • Newbler • CABOG • SUTTA Process Illumina GAGE Statistical analysis Pre-processing 454 Evaluation Info. Illumina/ 454/ Hybrid DeNovo assembly Assemblers • GAGE • Hawk-eye Fastqc Prinseq NGS QC Assemblers Chosen Ref. Unmapped reads All possible combinations of the best 3 454 reads Illumina reads Read stats LEGEND contigs * 3 • Mimimus • MAIA Finished genome Scaffolds PRE-PROCESSING Align illumina reads against 454 contigs CONTIG MERGING Unmapped reads • MUMmer • PAGIT • Mauve Published Genomes from public databases Mac vector CLC wb V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O contigs Gap filling Nulceotide identity DENOVO ASSEMBLY GENOME FINISHING bwa Unmapped reads • GRASS • Built-in Align Illumina against the reference samstats contigs Compare mapping statistics Reference genome Illumina/(454?) reference based assembly Draft/ Finished genome • MUMmer • DNA Diff AMOScmp Reference evaluation Reference evaluation REFERENCE SELECTION REFERENCE BASED ASSEMBLY
Assemblers Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
CLC Genomics • Word Size: Automatic Word Size • CLC bio's de novo assembly algorithm works by using de Bruijn graphs. It makes a table of all sub-sequences of a certain length (called words) found in the reads. • BubbleSize: AutomaticBubbleSize • A bubble is defined as a bifurcation in the graph where a path furcates into two nodes and then merge back into one. • Minimum Contig Length: 200 • Mismatchcost : 2 • The cost of a mismatch between the read and the reference sequence. • Insertion cost: 3 • The cost of an insertion in the read (causing a gap in the reference sequence) • Deletion cost: 3 • The cost of having a gap in the read. The score for a match is always 1. • Length fraction: 0.5 • Set minimum length fraction of a read that must match the reference sequence. Setting a value at 0.5 means that at least half the read needs to match the reference sequence for the read to be included in the final mapping. • Similarity: 0.8 • Set minimum fraction of identity between the read and the reference sequence. If you want the reads to have e.g. at least 90% identity with the reference sequence in order to be included in the final mapping, set this value to 0.9. • Update contigs based on mapped reads • This means that the original contig sequences produced from the de novo assembly will be updated to reflect the mapping of the reads Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Velvet • De brujin assembler • Max kmer length-31, default 29 • Commands • velveth directory -k-mer -readtype –file format filename • velvetg VAssemILL -exp_cov auto -cov_cutoff auto • exp_cov – allow the sytem to infer expected coverage of unique regions • Cov_cutoff - Allow the system to infer the removal of low coverage nodes • Designed for very short reads (25-50bp) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Newbler • De Novo OLC assembler • Uses k-mer based hashing • Command – runAssembly [filename] • Designed for longer reads (454) Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
SOAP DeNovo2 • Short reads DeNovo assembler • Designed to study Illumina GAII contigs • Command - SOAPdenovo-127mer all -s test.config -K 30 -R -p 4 -N 4600000 -o test_OP 1>ass.log 2>ass.err • Parameters specified: • Insert_size: 0, single end reads • Kmer_size: 23, default • asm_flag: both contigs and scaffold Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Assembler comparison- 454 nav_454_2541-90 vul_454_06-2432 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Assembler comparison- Illumina nav_ill_2541-90 vul_ill_06-2432 Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Pipeline: Revisited 454 • Illumina DeNovo • Allpaths LG • SOAP DeNovo • Velvet • SUTTA • Hybrid DeNovo • Ray Parameter optimization 454 raw reads Illumina raw reads Illumina • 454 DeNovo • Newbler • CABOG • SUTTA hybrid Process Illumina GAGE Statistical analysis Pre-processing 454 Evaluation Info. Illumina/ 454/ Hybrid DeNovo assembly Assemblers • GAGE • Hawk-eye Fastqc Prinseq NGS QC Assemblers Chosen Ref. Unmapped reads All possible combinations of the best 3 454 reads Illumina reads Read stats LEGEND contigs * 3 • Mimimus • MAIA Finished genome Scaffolds PRE-PROCESSING Align illumina reads against 454 contigs CONTIG MERGING Unmapped reads • MUMmer • PAGIT • Mauve Published Genomes from public databases Mac vector CLC wb V. vulnificus YJ016 V. vulnificus CMCP6 V. vulnificus MO6-24/O contigs Gap filling Nulceotide identity DENOVO ASSEMBLY GENOME FINISHING bwa Unmapped reads • GRASS • Built-in Align Illumina against the reference samstats contigs Compare mapping statistics Reference genome Illumina/454? reference based assembly Draft/ Finished genome • DNA Diff • DNA Diff AMOScmp Reference evaluation Reference evaluation REFERENCE SELECTION REFERENCE BASED ASSEMBLY
Reference Genomes • V. vulnificus MO6-24/O • V. vulnificus YJ016 • V. vulnificus CMCP6
Reference vs. all contigs- 454 nav_454_2541-90 vul_454_06-2432
Reference vs. all contigs- Illumina nav_ill_2541-90 vul_ill_06-2432
Visualization Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future
Road ahead….. • Get all the tools working • Optimize tool parameters • Use Illumina reads to finish 454 contigs • Performance considerations for the tool Input Data / Sequence Read Data / Pipeline Review / Un-processed data / Assemblers / Preliminary Data / Visualization / Future