450 likes | 603 Views
Assembly group Presentation II. Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye. Two problematic libraries. M19107 & M21639 were resequenced due to low coverage M19107 still had low coverage after re- seq M21639 coverage increased to ~80X.
E N D
Assembly group Presentation II Robert Arthur Kevin Lee Xing Liu Pushkar Pande Gena Tang Racchit Thapliyal Tianjun Ye
Two problematic libraries • M19107 & M21639 were resequenced due to low coverage • M19107 still had low coverage after re-seq • M21639 coverage increased to ~80X
Mixed culture? • Is this a mixed culture? • What does a mixed culture assembly look like? • newProject, addRun, addRun, runProject … • M19501 & M21127 • M21127 & M21621
Possible explanations for poor assembly • About 20% larger genome • Recent plasmid or other large genome insertion • Also lost hemolytic ability and H2S production
De novo assembly • Limited by our data, 36 contigs is lower limit • Number of repeat elements • rRNA • 20X coverage is sufficient
Assembler Evaluation Strategy • Single assembler evaluation • Minimus2 assembler evaluation • Feedback by gene prediction group
Single Assembler Evaluation (“Hard Measurement”)
Single assembler evaluation • “Soft” measurement: satisfy gene prediction group's requirement. • RNA prediction group requires a file which can trace back the depth of the reads. For this, we use the .tsv file in the Newbler output.
Newbler Output • 454AlignmentInfo.tsv (-infoall/-info/-noinfo) base consensus, quality, depth and flow-signal, at each position in each contig. A very useful file. • eg: Position Consensus Quality Unique Align Signal Signal Score Depth Depth StdDev (incl. duplicates) >contig00008 1 G 64 26 32 0.98 0.05 2 A 64 27 33 0.94 0.13 3 T 64 27 33 1.97 0.14 4 T 64 27 33 1.97 0.14 5 G 64 27 33 0.97 0.06 ...etc...
Single Assembler Evaluation • For “hard” measurement: we focused mainly on “Total Contigs” “N50 Contigs Bases”, “Total Big Contigs”, “Big Contigs Percent Bases”, “Big Contig Reads”, “Singleton Reads”. • For “soft” measurement: we focused on the trace back of depth for each base pair.
Final Rank of Single Assembler • Combine the “hard” and “soft” measurement manually. We get as a result: • 1. Newbler De Novo • 2. Mira3 • 3. Amos • Eliminated : Newbler reference & velvet
Minimus2 Evaluation • Since our top choice is Newbler, we want to include Newbler’s results in the merged contigs. Thus, we analysed the statistics of: 1. Newbler merged with Amos 2. Newbler merged with Mira • Visulization tool: hawkeye
Final Rankings of Single Assembler • Combined the “hard” and “soft” measurement manually and we got: • 1. Newbler De Novo • 2. Mira3 • 3. Amos • Eliminated : Newbler reference & velvet
Minimus2 Evaluation • Using the same strategy as above, our rankings are: • 1. Newbler merged with Mira; • 2. Newbler merged with Amos;
Final Recommendation • 1. Merged Contigs: • (1) Newbler merged with Mira • (2) Newbler merged with Amos • 2. Single Assembler: • (1) Newbler • (2) Mira • (3) Amos
Feedback Strategy • Gene Prediction Group may use predicted genes and RNAs to evaluate our assembly results.
Minimus2 Overview • Minimus2 is a modified minimus pipeline • It is designed to merge one or two sequence sets hereafter referred to as S1 or S2 • Uses Nucmer based Overlap Detector instead of the Smith-Waterman hash overlap Minimus1 uses (much faster)
Minimus2 Usage • minimus2 prefix \ • -D REFCOUNT=n \ # Number of sequences is the first set • -D OVERLAP=n \ # Minimum overlap (Default 40bp) • -D CONSERR=f \ # Maximum consensus error (0..1) (Def 0.06) • -D MINID=n \ # Minimum overlap %id for align. (Def 94) • -D MAXTRIM=n # Maximum sequence trimming length (Def 20bp) • Prefix refers to an .AFG filename
Minimus2 Usage, Cont’d. • REFCOUNT should be set to number of sequences in the first set • “all vs all” alignment is ran by default and sets REFCOUNT to 0 unless user-specified • S1 & S2 should be merged and converted to AMOS format using toAmos command • toAmos –s S1-S2.seq –o S1-S2.afg
Minimus2 Usage, Cont’d. • After you merge the data, you actually run minimus on it • Minimus2 S1-S2 –D REFCOUNT=## • Input • S1-S2.afg • Output • S1-S2.fasta (contig) • S1-S2.singletons.seq (single)
Nucmer Algorithm • A modification of the MUMmer package matching algorithm • Operates via building and then searching a suffix tree data structure • This is a significant upgrade from the minimus1 approach as searching using suffix trees is O(n) and minimus1’s method is O(n2) • Linear time versus polynomial time • MUMmer link : http://mummer.sourceforge.net/
Nucmer Algorithm • The Nucmer strategy uses approximately 17 bytes of memory for each basepair in the reference sequence • The query supplied by the user is streamed past the reference suffix tree so that the memory requirements do not depend on the size of the query sequence • In English: Bigger query does not mean order of magnitude longer operating time • Unique algorithm that can be found and analyzed as it is open source on Sourceforge
MIRA • Based on OLC approach • Strategies: • Preprocessing: high confidence region (HCR) • Use a quick heuristic algorithm to alignment the HCR of reads • Overlaps are reviewed with Smith-Waterman alignment algorithm • Contigs can be be optionally analysed and corrected by an incorporated version of an automatic editor • Repeats are resolved by searching for typical mis-assembly patterns • Optional pre-assembly read extension step: the assembler can try to extend HCRs of reads by analysing the overlap pairs from the previous alignments.
MIRA outputs • d_results: this directory contains all the output files of the assembly in different formats. • d_info: this directory contains information files of the final assembly. • d_log: this directory contains log files and temporary assembly files. • d_chkpt: this directory contains checkpoint files needed to resume assemblies that crashed or were stopped (not implemented yet, but soon)
d_results • out.padded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process. Positions in the consensus containing gaps (also called 'pads', denoted by an asterisk) are still present. • out.unpadded.fasta: this file contains as FASTA sequence the consensus of the contigs that were assembled in the process, put positions in the consensus containing gaps were removed. • qual files • Outputs with other formats • caf, ace, gap4d
d_info • info_assembly.txt: some statistics as well as whether or not problematic areas remain in the result. • info_callparameters.txt: This file contains the parameters as given on the mira command line when the assembly was started. • info_contigstats.txt: This file contains in tabular format statistics • info_contigreadlist.txt: This file contains information which reads have been assembled into which contigs. • info_readstooshort: A list containing the names of those reads that have been sorted out of the assembly only due to the fact that they were too short, before any processing started. • error_reads_invalid: A list of sequences that have been found to be invalid due to various reasons (given in the output of the assembler).
Newbler • Another OLC assembler • Starts with `indexing` • Scans the .sff file, trims the reads, • Performs some checks for possible 3’ and 5’ primers. • Finds overlaps between reads • Splits the phase between long reads and short reads. • Alignments proceed using seed and extend. • Simplifies overlap graph and generates consensus contigs. • Uses the quality information for base calling.
Newbler: Metrics file • 454NewblerMetrics.txt • runData • Total number of reads and bases in the file, also the number of reads and bases after trimming. • runMetrics • Number of searches, seeds and overlaps during the alignment phase of assembly • readAlignmentResults • Number of reads and bases aligned to other reads, • Inferred read error – No. of errors, mainly indels, between the contigs and the reads. • consensusDistribution • This section deals with base calling of the consensus contigs. • consensusResults • A summary of the read alignments and assembly statistics
Newbler: Contigs • 454AllContigs.fna (>100 bp default) • 454LargeContigs.fna (>500 bp default) • The fasta header: • Gives the the unique contig number, its length in bp and the number of reads in the alignment used to build this contig • lower case bases correspond to quality values below 40. • 454Contigs.ace • All contigs in .ace format. Useful for visualization using for e.g. Eagleview >contig00001 length=381 numreads=158 >contig00002 length=144 numreads=560 GGGAGAACTCATCTCTTGGCAAGTTTCGTGCTTAGATGCTTTCAGCACTTATCTCTTCCG CACTTAGCTACCCGGCAATGCGTCTGGCGACACAACCGGAACACCAGTGaTGCGTCCACT CCGGTCCTCTCGTACTAGGAGCAG >contig00002 length=144 numreads=560 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 20 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64 64
Newbler: The status files • 454TrimStatus.txt: This file describes what (trimmed) part of the read was considered for alignment. • read Id, trim points used, used trimmed length, raw length • 454ReadStatus.txt: This file describes where reads ended up after assembly was complete. • Id, Status (assembled, partially assembled, singleton, outlier, too short). • 454AlignmentInfo.tsv: This file gives a consensus alignment overview for each position in each contig. • Position, consensus, quality score, depth, signal, std. deviation
Automating assembly • Motivation • Troublesome installation, number of dependencies • Difficult to remember command line parameters • Automation • install.sh : A script to install assemblers and their dependencies • assembler.sh: Script to run assemblers with default arguments.
Future work • Continue to dialog with G.P. to determine assembly of choice. • Metrics: • tRNA count • rRNA count • Protein coding regions