780 likes | 1k Views
Metagenomics or marker gene studies. Jarno Tuimala, PhD RS-koulutus / SPR Veripalvelu. Program. 10.15-11.45 data preprocessing and alignment 11.45-12.30 lunch 12.30-14.00 sequence filtering 14.00-14.15 coffee break 14.15-16.15 sequence classification, statistics
E N D
Metagenomicsormarker gene studies Jarno Tuimala, PhD RS-koulutus / SPR Veripalvelu
Program 10.15-11.45 data preprocessing and alignment 11.45-12.30 lunch 12.30-14.00 sequence filtering 14.00-14.15 coffee break 14.15-16.15 sequence classification, statistics 16.15-16.45 wrap-up, feedback etc. And other breaks when needed
Aims of the course • To learn the essentialanalysissteps for markergenestudies • To learnhow to usemothur software for performing the analyses • Alternatingbetweenlectures and hands-onexercises
Metagenomics • Shotgun sequencing of niche specific samples • Needs assembly (putting the sequence pieces in a correct order and to the correct species) • Sequencing certain markers genes, such as 16S rRNA, RecA, RpoB • Sequence alignment is enough • In fact, calling the latter metagenomics is misleading. Marker gene studies have been performed for at least 20 years.
Sanger Sequencing from Wikipedia
Roche 454 pyrosequencing Review: Medini et al, 2008
16S rRNA sequences from 454 • Primer (adaptor key in the end) + tag + template specificprimer + 16S rRNA sequence • Adaptor contains a biotin tag and is needed during sequencing of the template • Tag allows discrimination between samples (barcoding) • Primer is needed for amplification of the sequence • These are usually removed from the sequences before the actual analysis • Example: TCAGTACTCGGCCTACGGGAGGCAGCAG
Base calling bias (454) http://www.biomedcentral.com/1471-2164/12/245
Files needed on this course • Demodata 1 (SOP) and Demodata 2 (Turku) • Extract to the Desktop • Software • Extract to the Desktop • Copy mothur.exe and uchime.exe to the demo data folder • Seaview • R • Reference sequences • Originally from Silva.bacteria.zip, Silva.gold.bacteria.zip and Trainset9_032012.rdp.zip (from mothur wiki), but repackaged for the course • Extract to the Desktop • After preparing the input data, copy to the data folder • Scripts • Used for the statistical analyses
Demodata 1 • SOP data from mothur • Mice gut microbiome followed after weaning • Data is for one animal only, but ten time points • We are using sequence files in fasta format and corresponding quality info files • If you need to process sff files, please see SOP data example in mothur wiki, and the example on the forth coming slides • You’ll need a lookup file from http://www.mothur.org/wiki/Lookup_files
Demodata 2 • Ruminant data from Turku AMK • Sample numbers indicate the individuals cows, the sample character indicate the primer set. • Altogether 12 samples. • All in SFF format!
Keep a lab book! • Use some text editor (Notepad!) to first type in the commands, and then copy and paste them into mothur. • This creates a file of mothur commands that can then later on be used as batch file for processing the data. • It also documents what you have done. • Better to adapt this type of a habit sooner than later.
Work flow • Prepare input data • Trim sequences • Unique sequences • Align sequences • Screen sequences • Filter sequences • Pre-cluster • Remove chimeric sequences • Classify sequences • Analyses...
Work flow • Prepare input data • Trim sequences • Unique sequences • Align sequences • Screen sequences • Filter sequences • Pre-cluster • Remove chimeric sequences • Classify sequences • Analyses...
File formats • FastA >ERR051700.1 FXCMJDV02HWG0G length=10 TCAGTACTCG >ERR051700.2 FXCMJDV02GE0HC length=10 TCAGTACTCG • Qual >ERR051700.1 FXCMJDV02HWG0G length=10 37 37 37 37 34 28 29 29 29 26 >ERR051700.2 FXCMJDV02GE0HC length=10 37 37 37 37 40 40 40 40 40 40
Preparing the data, option 1 • Open the command shell (cmd.exe), and change to the directory (cd) containing the data files • Append all the fasta and all the qual files together • copy /b *.fasta all.fasta • copy /b *.qual all.qual
Preparing the data, option 2 • If the data is in sff format, fasta and qual files can be extracted from it with the mothur command sffinfo(): • sffinfo(sff=454Reads.RL1.sff) • And then continue with the fasta and qual files as laid out on the previous slide
Exercise A • Extract demo data to Desktop (or some other folder where you have write rights) • Prepare the data for further analysis by appending the fasta and qual files into a single fasta file and a single qual file • Once done, extract mothur to the same place, and copy mothur.exe and uchime.exe from the subfolder mothur to the demo data folder
Mothur • Mothur is one of the tools meant for analysis of marker gene studies • Developed by Pat Schloss’ research group at University of Michigan • Freely available command line program • http://www.mothur.org/ • http://www.mothur.org/wiki/Mothur_manual • http://www.mothur.org/wiki/Analysis_examples
Using mothur • Installation, several possibilities • Copy to a certain folder, add to Windows’ path variable, invoke from command line (cmd.exe) • Copy to the folder where the data resides, and start by double clicking on it • Once mothur starts, it takes you to its own environment. • mothur >
Mothur commands • See the manual on the web for commands. • http://www.mothur.org/wiki • Alternatively type help() to mothur prompt to list all commands. • Help on a particular command can be invoked by, e.g., summary.seqs(help). • Command are always associated with brackets, and the command arguments are given inside the brackets. • summary.seqs(fasta=all.fasta)
Mothur data • Mothur does all the operations in the memory, but saves the data files directly on the hard drive to the working directory. • It also keeps a logfile that records everything that is written on the screen (mothur console), and sometimes a few extra things. • After you’ve run some command, please follow the screen output closely, and check the files that were created to the working directory.
Work flow • Prepare input data • Trim sequences • Unique sequences • Align sequences • Screen sequences • Filter sequences • Pre-cluster • Remove chimeric sequences • Classify sequences • Analyses...
Trim sequences 1 • Input • Fasta file (prepared at prepare data step) • Quality file (prepared at prepare data step) • Oligo file (Will be prepared now) • Output • Fasta file of good sequences (all.trim.fasta) • Fasta file of scrap sequences (all.scrap.fasta) • Groups file (indicates into which group/sample the sequences belong to) (all.groups)
Trim sequences 2 • Oligo file • Tab-delimited • Lines can be commented out with a # • “The oligos option takes a file that can contain the sequences of the forward and reverse primers and barcodes and their sample identifier. Each line of the oligos file can start with the key words "forward", "reverse", "barcode", "linker" and "spacer" or it can start with a "#" to tell mothur to ignore that line of the oligos file.” [Mothur wiki] • Example: forward CCTACGGGAGGCAGCAG #reverse GTATTACCGCGGCTGCTG barcode TCAGTAGCGCG ERR051699 barcode TCAGTACTCGG ERR051702 barcode TCAGTAGCGCC ERR051705
Trim sequences 3 • Options • fasta fasta file • oligos oligo file • qfile quality file • qaverage average sequence quality • qwindowaverage average quality of window • qwindowsize window width • flip reverse complement seqs • maxambig max. of ambig. bases • maxhomop max. lenght of homopolymer • bdiffs max. diffs in barcode • pdiffs max. diffs in primers • ldiffs max. diffs in linkers • sdiffs max. diffs in spacers • tdiffs max. diffs in total
Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3
Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3
Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3
Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3
Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3
Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3
Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3
Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> delete the sequence T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3
Trim sequences 4 T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qaverage = 40 -> T C A G T A C T C G 40 40 40 40 40 40 40 40 37 35 • qwindowaverage = 40, qwindowsize = 3
Trim sequences 4 • flip=T • TCAGTACTCG -> CGAGTACTGA • AGTCATGAGC • maxambiq = 1 • TCAGTACNCG • maxhomp = 3 • TCAGTTTTCG
Common problems • The sequences are not in the same order in the fasta and in the qual files. • But, mothur assumes that they are, and gives an error if this is not the case. • Solutions: • Do not use quality file when trimming the sequences • Sort the sequences and the quality files • Sometimes not all sequences in the fasta file have a match in the qual file, then you don’t have any other option but to not to use the quality files during the trimming
Trimsequences 5 • Examplecommands > summary.seqs(fasta=all.fasta) > trim.seqs(fasta=all.fasta, qfile=all.qual, oligos=oligo.tsv, pdiffs=2, bdiffs=1, ldiffs=1, sdiffs=1, qaverage=25, flip=T) > trim.seqs(fasta=all.fasta, oligos=GQY1XT001.oligos, pdiffs=2, bdiffs=1, ldiffs=1, sdiffs=1, maxambig=0, maxhomop=8, flip=T) > get.current() > summary.seqs(fasta=all.trim.fasta)
Exercise B • Runmothurbydoubleclicking on it. Itshould open in a new DOS window. • Trim the sequences of demodata using a) an average quality score of 25 for each sequence OR b) a sliding window method with a windows quality score of 25 OR c) removal of oligo sequences and too ambiguous or polymeric (this is handy if other options fail). • Runsummary.seqs() aftereachtrimmingto checkhowmanysequenceswereretained in the data. • Whichmethodwouldyoupick for furtherprocessingsteps, and why?
Work flow • Prepare input data • Trim sequences • Unique sequences • Align sequences • Screen sequences • Filter sequences • Pre-cluster • Remove chimeric sequences • Classify sequences • Analyses...
Removenon-uniqueseqs • Why? • Sequence amplification can sometimes produces large amounts of identical sequences, even if there are really not that many organisms present that have that particular sequence -> aftefacts. unique.seqs(fasta=all.trim.fasta) get.current() summary.seqs(fasta=current) • Output files: • Sequences (all.trim.unique.fasta) • List of sequence groups (all.trim.names)
Work flow • Prepare input data • Trim sequences • Unique sequences • Align sequences • Screen sequences • Filter sequences • Pre-cluster • Remove chimeric sequences • Classify sequences • Analyses...
Sequencealignment • Idea is to lineup the sequencessothat the similarity (score) of sequences is maximized (description of what the computerdoes) • Example of a pairwiseglobalalignment (an alignment of twosequencesalongtheirwholelenghts) tgagttgaact tgagt-gagc- • Minus (-) signsarecalledgaps • Gapscanbemodelledusinggapopening and extensionpenalties, e.g., match=5, mismatch=0, opening=-2, extension=-1. Total score for the alignmentabove is 31, and computertries to maximizethisalignmentscore.
Alignment in mothur • Sequencesarealigned, onebyone, againsta set of referencesequences. • Thisusuallymakes the 16 S rRNAalignmentbetterthanotheroptions (multiplealignmentoralignment of severalsequences), butthisdoesnotnecessarilyhold for allpossiblegenes! • By default, mothurperforms a globalpairwisealignment (Needleman-Wunchalgorithm) wheregapopening and extensionarepenalizedequally. • See, e.g., http://koti.mbnet.fi/tuimala/oppaat/bioinfo-laaja.pdf for more info on alignments.
Alignment in mothur • Examplecommands align.seqs(fasta=all.trim.unique.fasta, reference=silva.bacteria.fasta) summary.seqs(fasta=current, name=current) • The reference set silva.bacteria.fastaorsomeotherofferedby, e.g., mothursite, is obligatory! • After the alignment is ready, it can be checked in some alignment software, such as Seaview. • Output files: • aligned sequences (all.trim.unique.align) • alignment report listing the files and the location they were aligned against (all.trim.unique.align.report)
Exercise C • Remove the non-uniquesequences • Copy the referencesequencefilesilva.bacteria.fasta to the data folder • Align the uniquesequencesagainst the referensesequence set • Open the resultingalignmentfile in the Seavieweditor, and check the resultsvisually
Work flow • Prepare input data • Trim sequences • Unique sequences • Align sequences • Screen sequences • Filter sequences • Pre-cluster • Remove chimeric sequences • Classify sequences • Analyses...
Filteringalignedsequences • Afteraligning the sequences, it is a good idea to make the alignmentmoreneat tgagttgaact(reference) ..agt-gag.. ..agg-gaa.. • Removeallgaps-onlycolumns (-) • Removeallcolumnscontaing a specificcharacter (.). align.seqs() willprecedeany position before the sequencestart with dots. But a single mis-alignedsequencecanlead to the wholealignmentbeingdeleted!