720 likes | 1.28k Views
Meta’omic Analysis with MetaPhlAn, HUMAnN, and LEfSe. Curtis Huttenhower 08- 08-13. Harvard School of Public Health Department of Biostatistics. Some setup notes. Slides with green titles or text include instructions not needed today, but useful for your own analyses
E N D
Meta’omic Analysis withMetaPhlAn, HUMAnN, and LEfSe Curtis Huttenhower 08-08-13 Harvard School of Public Health Department of Biostatistics
Some setup notes • Slides with green titles or text include instructions not needed today, but useful for your own analyses • Keep an eye out for red warnings of particular importance • Command lines and program/file names appear in a monospaced font.
Getting some HMP data Click “Get Data” • Go to http://hmpdacc.org
Getting some HMP data • Check out what’s available Click “HMIWGS”
Getting some HMP data • Check out what’s available Click on your favorite body site
Getting some HMP data Don’t click on anything! • Check out what’s available
Getting some (prepped) HMP data • Connect to the server instead and run: • for S in `ls /class/stamps-shared/chuttenh/7*.fasta`;do ln -s $S; done • These are subsamples of six HMP files: • SRS014459.tar.bz2 763577454-SRS014459-Stool.fasta • SRS014464.tar.bz2 763577454-SRS014464-Anterior_nares.fasta • SRS014470.tar.bz2 763577454-SRS014470-Tongue_dorsum.fasta • SRS014472.tar.bz2 763577454-SRS014472-Buccal_mucosa.fasta • SRS014476.tar.bz2 763577454-SRS014476-Supragingival_plaque.fasta • SRS014494.tar.bz2 763577454-SRS014494-Posterior_fornix.fasta • All six shotgunned body sites from • One subject, first visit • Subsampled to 20,000 reads
Who’s there: MetaPhlAn X is a core gene for clade Y X is a unique marker gene for clade Y Gene X
Who’s there: MetaPhlAn • Go to http://huttenhower.sph.harvard.edu/metaphlan Scroll down
Who’s there: MetaPhlAn • You could download MetaPhlAn by clicking here
Who’s there: MetaPhlAn • But don’t! Instead, we’ve downloaded MetaPhlAn already for you by clicking here
Who’s there: MetaPhlAn • And saving just the file metaphlan.py in stamps-software/bin
Who’s there: MetaPhlAn • You could download the bowtie2 database here
From the command line... • But don’t! Instead, get it by: • for S in `ls /class/stamps-shared/chuttenh/*.bt2`; do ln -s $S; done • To see what you can do, run: • module load stamps • metaphlan.py-h| less • Use the arrow keys to move up and down,q to quit back to the prompt
Who’s there: MetaPhlAn • For future reference: these extra options aren’t necessary if you download the whole “default” MetaPhlAn package • metaphlan.pyyour_input.fasta > your_output.txt • To launch your first analysis, run: • metaphlan.py--bowtie2db mpa 763577454-SRS014459-Stool.fasta > 763577454-SRS014459-Stool.txt
Who’s there: MetaPhlAn • What did you just do? • Two new output files: • 763577454-SRS014459-Stool.fasta.bowtie2out.txt • Contains a mapping of reads to MetaPhlAn markers • 763577454-SRS014459-Stool.txt • Containstaxonomicabundances as percentages
Who’s there: MetaPhlAn • less 763577454-SRS014459-Stool.fasta.bowtie2out.txt
Who’s there: MetaPhlAn • less 763577454-SRS014459-Stool.txt
Who’s there: MetaPhlAn • Now finish the job: • metaphlan.py--bowtie2db mpa763577454-SRS014464-Anterior_nares.fasta > 763577454-SRS014464-Anterior_nares.txt • ... • Notethatyoucanuse the up arrowkeyto make your life easier!
Who’s there: MetaPhlAn • Let’s make a single table containing all six samples: • mkdirtmp • mv *.bowtie2out.txt tmp • merge_tables.py-l -d *.txt |zero.py> 763577454.tsv • Youcan look at this file usingless • Note 1: The argumentsless -x4 -Swill help • Note 2: Youcan set this “permanently” usingexport LESS="-x4 -S"
Who’s there: MetaPhlAn • But it’s easier using MeV; go to http://www.tm4.org/mev.html Click “Download”
An interlude: MeV • Don’t forget to transfer your 763577454.tsv file locally for viewing using scp • Unzip, launch MeV, and select File/Load data
An interlude: MeV • Click “Browse” to your TSV file, then • Tell MeV it’s a two-color array • Uncheck “Load annotation” • Click on the upper-leftmost data value
An interlude: MeV • “Load” your data, then make is visible by: • Display/Set Color Scale Limits • Choose Single Gradient, min 0, max 10
An interlude: MeV • Finally, to play around a bit: • Display/Set Element Size/whatever you’d like • Clustering/Hierarchical Clustering • Optimize both gene and sample order • And select Manhattan Distance (imperfect!)
An interlude: MeV • If you’d like, you can • Display/Sample-Column Labels/Abbr. Names
An interlude: MeV • MeV is a tool; imperfect, but convenient • You should likely include just “leaf” nodes • Species, whose names start include “s__” • You can filter your file using:cat 763577454.tsv | grep -E '(Stool)|(s__)'> 763577454_species.tsv • You can, but might not want to, z-score normalize • Adjust Data/Gene-Row Adjustments/Normalize Genes-Rows • Many other tools built in – experiment!
What they’re doing: HUMAnN • Back to the task at hand; you could download HUMAnN at: http://huttenhower.sph.harvard.edu/humann Click here
What they’re doing: HUMAnN • ...but instead we’ve already downloaded it • Expand HUMAnN (no install!) • tar -xzf/class/stamps-shared/chuttenh/sources/humann-0.98.tar.gz • Set up a link to the KEGG reference DB: • ln-s /class/stamps-shared/chuttenh/kegg.reduced.udb • Andalthoughyouwouldnormally download USEARCH fromhere: • http://www.drive5.com/usearch/download.html We’regoingtouseitpreinstalledinstead
What they’re doing: HUMAnN • If we weren’t all running this, you’d need to: • Get KEGG – used to be free, now it’s not! • Fortunately, we have a HUMAnN-compatible distributable version; contact me... • Index it for USEARCH: • usearch6 -makeudb_usearchkegg.reduced.fasta-output kegg.reduced.udb • This takes a minute or two, sowe’veprecomputedit; thus, forgeahead...
What they’re doing: HUMAnN • Did you notice that we didn’t QC our data at all? • MetaPhlAn is very robust to junk sequence • HUMAnN is pretty robust, but not quite as much • We’ve already run a standard metagenomic QC: • Quality trim by removing bad bases (typically Q ~15) • Length filter to remove short sequences (typically <75%)
What they’re doing: HUMAnN • Must start from FASTQ files to do this • Quality trim by removing bad bases: • TrimBWAstyle.py< your_data.fastq > your_trimmed_data.fastq • Length filter by removing short sequences: • 75% of original length is standard (thus 75nt from 100nt reads) • remove_bad_seqs.py 75 < your_trimmed_data.fastq > your_filtered_data.fastq • Now convert your FASTQ to a FASTA: • fastq2fasta.py < your_filtered_data.fastq> your_filtered_data.fasta • Some final caveats: • If you’re using paired end reads, match filters! • See my course homeworks at http://huttenhower.sph.harvard.edu/bio508 • Aren’t you glad you’re not doing this today?
What they’re doing: HUMAnN • Enter the humann directory • cd humann-0.98 • Run your first translated BLAST search: • usearch6.0.192_i86linux32-usearch_local ../763577454-SRS014459-Stool.fasta -db../kegg.reduced.udb-id 0.8 -blast6out input/763577454-SRS014459-Stool.txt • What did you just do? • less input/763577454-SRS014459-Stool.txt • Recall BLAST’s tab-delimited output headers: • qseqidsseqidpident length mismatch gapopenqstartqendsstart send evaluebitscore • Rinse andrepeatfor the remaining samples
What they’re doing: HUMAnN • Normally you’d need to install SCons from: • http://www.scons.org • Instead, we’ll use it preinstalled as well, so... GO! • /class/stamps-software/scons/bin/scons • Youshouldsee a bunch of textscrollby • Note: youcan run scons -j8toparallelizetasks
What they’re doing: HUMAnN • After a minute or two, you should see:
What they’re doing: HUMAnN • This has created four main files: • Two each for pathways (big) and modules (small) • Two each for coverage and relative abundance • Each is tab-delimitedtextwithone column per sample • Allfour are in the output directory: • output/04a-hit-keg-mpt-cop-nul-nve-nve-xpe.txt • Coverage (a) of pathways (t) • output/04a-hit-keg-mpm-cop-nul-nve-nve-xpe.txt • Coverage (a) of modules (m) • output/04b-hit-keg-mpt-cop-nul-nve-nve.txt • Abundance (b) of pathways (t) • output/04b-hit-keg-mpm-cop-nul-nve-nve.txt • Abundance (b) of modules (m) • I almostalwaysjustuse 04b-mpm (module abundances)
What they’re doing: HUMAnN • Let’s take a look: • less output/04b-hit-keg-mpm-cop-nul-nve-nve.txt
What they’re doing: HUMAnN • That’s ugly; it gets much better in Excel • Note: this is very sparse since we’re using a small subset of KEGG • Note: the mock community demo data is included on the right
What they’re doing: HUMAnN • And there’s nothing stopping us from using MeV • Or R, or QIIME, or anything that’ll read tab-delimited text
What matters: LEfSe • All of these analyses give you tables • 16S OTUs • MetaPhlAn Species • HUMAnN Modules (or pathways or genes) • If you know something about your samples... • Case/control • Different habitats • Different time points • How can you identify features that change?
What matters: LEfSe • Let’s get all of the HMP species data:http://hmpdacc.org/resources/data_browser.php Click “HMSMCP”
What matters: LEfSe • Download the MetaPhlAn table for all 700 samples Right click this irritatingly tiny icon
Downloading from the command line • Instead of saving this, download it by: • Right-click to copy the URL • Run wget <paste URL here> • Note: curl –O <URL> works just as well
What matters: LEfSe • Make sure this file is in your home directory, and expand it: • bunzip2 HMP.ab.txt.bz2 • Look at the result • less HMP.ab.txt • IMPORTANT!!! • This file’s too big to analyze directly today • ln -s /class/stamps-shared/chuttenh/HMP.ab.filtered.txt • This is great – tons of data, but no metadata • HUMAnN to the rescue • From the humann-0.98 directory:python src/metadata.py input/hmp_metadata.dat < ../HMP.ab.filtered.txt> ../HMP.ab.filtered.metadata.tsv • NOW take a look again
What matters: LEfSe • Let’s modify this file to be LEfSe-compatible • Open it up in Excel
What matters: LEfSe • Delete all of the metadata rows except: • RANDSID and STSite • Save it as tab-delimited text: HMP.ab.filtered.metadata.txt
What matters: LEfSe • Visit LEfSe at: http://huttenhower.sph.harvard.edu/lefse Click here
What matters: LEfSe • Then upload your formatted table • After you upload, wait for the progress meter to turn green! 2. Then here 1. Click here 4. Then watch here 3. Then here
What matters: LEfSe • Then tell LEfSe about your metadata: 1. Click here 2. Then select STSite 3. Then select RANDSID 4. Then here