70 likes | 323 Views
Overlap Layout Consensus. Assembly S.O.P. Reference Assembly. Align reads to a reference sequence ??? PROFIT!!!!!. Reference Assembly by Newbler. from The Genome Sequencer Data Analysis Software Manual, p.147
E N D
Overlap Layout Consensus Assembly S.O.P.
Reference Assembly • Align reads to a reference sequence • ??? • PROFIT!!!!!
Reference Assembly by Newbler from The Genome Sequencer Data Analysis Software Manual, p.147 • For each read, search for a suitable alignment, or alignments, of the read to the reference sequence(s) (a read may align to multiple positions in the reference sequence); this is done in "nucleotide" space • Construct contigs and compute a consensus basecall sequence from the signals of the aligned reads (performed in "flowspace") • Identify the positions in the aligned reads (consensus) that differ from the reference sequence(s); alternatively, identify subsets of the aligned reads that are identical within each subset but differ between subsets (these are the "putative differences") • Evaluate the list of putative differences to identify High-Confidence differences • Output the following information: • contig consensus sequence(s) and associated quality values; • alignments of the reads and contigs to the reference, position-by-position metrics of the depth and consensus accuracy (quality values) for each position in the aligned reference; • and the positions and alignments of identified differences
Reference Assembly by AMOScmp • AMOS Is Not An Assembler • AMOScmp uses NUCmer to align reads to a reference sequence
#!/usr/local/bin/amos-2.0.4/bin/runAmos -C # `AMOScmp' - The AMOS Comparative Assembler Pipeline #--------------------------------------- USER DEFINED VALUES ------------------# TGT = $(PREFIX).afg REF = $(PREFIX).1con #------------------------------------------------------------------------------# BINDIR=/usr/local/bin/amos-2.0.4/bin NUCMER=/usr/local/bin/MUMmer3.21/nucmer SEQS = $(PREFIX).seq BANK = $(PREFIX).bnk ALIGN = $(PREFIX).delta LAYOUT = $(PREFIX).layout CONFLICT = $(PREFIX).conflict CONTIG = $(PREFIX).contig FASTA = $(PREFIX).fasta INPUTS = $(TGT) $(REF) OUTPUTS = $(CONTIG) $(FASTA) ## Building AMOS bank 10: $(BINDIR)/bank-transact -c -z -b $(BANK) -m $(TGT) ## Collecting clear range sequences 20: $(BINDIR)/dumpreads $(BANK) > $(SEQS) ## Running nucmer 30: $(NUCMER) --maxmatch --prefix=$(PREFIX) $(REF) $(SEQS) ## Running layout 40: $(BINDIR)/casm-layout -U $(LAYOUT) -C $(CONFLICT) -b $(BANK) $(ALIGN) ## Running consensus 50: $(BINDIR)/make-consensus -B -b $(BANK) ## Outputting contigs 60: $(BINDIR)/bank2contig $(BANK) > $(CONTIG) ## Outputting fasta 70: $(BINDIR)/bank2fasta -b $(BANK) > $(FASTA) The AMOScmp pipeline script
NUCmer • MUM: maximal unique matches • A MUM is a subsequence that occurs in two exactly matching copies, once in each input sequence, and that cannot be extended in either direction
NUCmer alignment procedure • Create a map of all contig positions within each of the multi-fasta files • Concatenate the two files separately • Run MUMmer to find all exact matches between the two genomes. • Map the resulting matches back to the separate contigs. • Run a clustering algorithm for all the MUMs along each contig. MUMs are clustered together if they are separated by no more than a user-specified distance. • Run a modified Smith-Waterman dynamic programming alignment algorithm to align the sequences between the MUMs. In order to avoid excessive computation in this step, the algorithm permits only limited mismatches in these gaps between MUMs. The exact amount of mismatch is specified by the user.