410 likes | 426 Views
Explore the latest in RNA-Seq software, tools, and workflows for differential gene expression, splice variant identification, and more. Learn about experimental design considerations, strand-specific RNA-Seq, and differential gene expression workflows.
E N D
RNA-Seq Software, Tools, and Workflows Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 1, 2016
Some mRNA-Seq Applications • Differential gene expression analysis • Transcriptional profiling • Identification of splice variants • Novel gene identification • Transcriptome assembly • SNP finding • RNA editing Assumption: Changes in transcription/mRNA levels correlate with phenotype (protein expression)
Experimental Design • What biological question am I trying to answer? • What types of samples (tissue, timepoints, etc.)? • How much sequence do I need? • Length of read? • Platform? • Single-end or paired-end? • Barcoding? • Pooling? • Biological replicates: how many? • Technical replicates: how many? • Protocol considerations?
Strand-Specific (Directional) RNA-Seq • Now the default for Illumina TruSeq kits • Preserves orientation of RNA after reverse transcription to cDNA • Informs alignments to genome • Determine which genomic DNA strand is transcribed • Identify anti-sense transcription (e.g., lncRNAs) • Quantify expression levels more precisely • Demarcate coding sequences in microbes with overlapping genes • Very useful in transcriptome assemblies • Allows precise construction of sense and anti-sense transcripts 5’-- --3’ 5’-- --3’
Influence of the Organism • Novel– little/no previous sequencing (may need assembly) • Non-Model • some sequence available (draft genome or transcriptome assembly) • Thousands of scaffolds, maybe tens of chromosomes • Some annotation (ab initio, EST-based, etc.) • Model – genome fully sequenced and annotated • Multiple genomes available for comparison • Well-annotated transcriptome based on experimental evidence • Genetic maps with markers available • Basic research can be conducted to verify annotations (mutants available)
Read QC and Trimming Stringency of trimming depends on read length and downstream processing. Alignment can often tolerate some amount of contamination. Various sizes of trimmed reads can introduce biases in alignments. Transcriptome assembly works best with aggressive trimming. Scythe Sickle
QC and Alignment You may want to align your reads to a few different datasets, including • rRNA of your organism (or related organisms) • Contaminant screening, if desirable • Genome or transcriptome to generate counts for differential expression analysis First we’ll go through alignment to a gene set/transcriptome, then we’ll look at aligning to a genome…
Choosing an Aligner for a Gene Set Reference • Examples of gene sets: • rRNA sequences • potential contaminant sequences (such as PhiX), • a transcriptome (eukaryotic or bacterial), where each sequence in the fastarepresents a different transcript (mRNA) • De novo transcriptome assembly • A splicing-aware aligner is not needed • bwaaln(<76 bp), bwa mem (>75 bp), or bowtie2 • Requiring strand-specific orientation can be more difficult (requires filtering the bam file).
Library QC – rRNA Contamination • Total RNA contains >95% rRNA • Poly A selected libraries should contain <5% rRNA • rRNA-depleted libraries should contain <15% rRNA • Align reads to rRNA sequences from organism or relatives • Higher than expected or inconsistent rRNA content may indicate degraded RNA or problems in library prep. • Generally, don’t need to remove rRNA reads rRNA rRNA mRNA rRNA mRNA mRNA Total RNA sample After PolyA isolation After rRNA depletion
Insert Size (determined from gene set alignments) The “insert” is the cDNA (or RNA) ligated between the adapters. (The SAM spec calls this the “template”.) Typical insert size is 160-200 bases, but can be larger. Insert size distribution depends on library prep method. Adapter ~60 bases cDNA 100-800 bases =Insert Size Adapter ~60 bases Overlapping paired-end reads Gapped paired-end reads Single-end read 0 20 40 60 80 100 120 140 160 180 200
Paired-End Reads Read 1 Read 2 0 0 20 20 40 40 60 60 80 80 100 100 120 120 140 140 160 160 180 180 200 200 INSERT SIZE (TLEN) INNER DISTANCE (+ or -) Read 1 Read 2
Generating Raw Counts from Gene Set Alignments The sam/bam file can be filtered for just those pairs where both reads align in concordantly (proper pairs) to one transcript. sam2countsor samtoolsidxstatscan be used for counting. Counts per transcript need to be added up to counts per gene before running stats software
Alignment to a Genome - Choosing a Reference • Human/mouse: GENCODE (uses Ensembl IDs) (http://www.gencodegenes.org/), but may need some manipulation to work with certain software • Ensembl genomes (http://ensemblgenomes.org/) and Biomart(http://www.ensembl.org/biomart/martview/) • Illumina igenomes (http://support.illumina.com/sequencing/ sequencing_software/igenome.html) provides indexes for some software, and files with extra info for Tophat/cufflinks. • NCBI genomes (http://www.ncbi.nlm.nih.gov/genome/) • Many specialized databases (Phytozome, Patric, VectorBase, FlyBase, WormBase) • “Do it yourself” genome assembly and gene-finding (don’t forget functional annotation)
Choosing an Aligner for a Genome Reference Eukaryotic genes are separated into exons and introns, so if you are using a genome reference, you need an aligner that is either “splicing-aware” or won’t choke on spliced RNA reads. A splicing-aware aligner will recognize the difference between a short insert and a read that aligns across exon-intron boundaries Spliced Reads (joined by dashed line) Read Pairs (joined by solid line)
(Some) Splicing-Aware or Splicing-Agnostic Aligners • Tophat2 (bowtie2) – what we’ll be using in these exercises. (remember, igenomes); also integrates with cufflinks to find novel genes and transcripts • HISAT/HISAT2 – successor to tophat2, and also works for alignment of populations (i.e., multiple human samples). I haven’t tried this yet. It’s on our Galaxy … try it out and report what happens. • GSNAP – I haven’t used this; can handle (and produce) non-standard file formats. • STAR – this is what I use. It’s very fast and doesn’t need a separate counting step. But it does need a lot of memory (>128G RAM to index a mammalian genome). • bwa mem – a local aligner, is not splicing-aware, but can handle local alignments (doesn’t require the entire read to align).
Generating Raw Counts from Genome Alignments Reads aligning to a chromosome or scaffold must be parsed to assign them to a gene. We’ll need a “roadmap” to indicate where each gene (exon) is located in the genome
Generating Raw Counts from Genome Alignments This roadmap is the GTF (Gene Transfer Format) file: Let’s look at the GTF fields in more detail … The left columns list source, feature type, and genomic coordinates The right column includes attributes, including gene ID, etc.
Fields in the GTF File (left to right) Sequence Name (i.e., chromosome, scaffold, etc.) chr12 Source (program that generated the gtf file or feature) unknown Feature (i.e., gene, exon, CDS, start codon, stop codon) CDS Start (starting location on sequence) 3677872 End (end position on sequence) 3678014 Score. Strand (+ or -) + Frame (0, 1, or 2: which is first base in codon, zero-based) 2 Attribute (“;”-delimited list of tags with additional info) gene_id "PRMT8"; gene_name "PRMT8"; p_id "P10933"; transcript_id"NM_019854"; tss_id "TSS4368";
Generating Gene Counts from Genome Alignments • HTSeq-count – runs rather slowly; doesn’t work well on coordinate-sorted bam files. • FeatureCounts – Runs faster, and will work on coordinate-sorted bam files. Has more parameters than HTSeq-count. • STAR – Will generate counts equivalent to HTSeq-count with default parameters, in the same run as alignment. • All of these can utilize strand-specific information in GTF file to assign reads from strand-specific libraries. • All will also report reads that can’t be assigned unambiguously to a gene feature.
Multiple-Mapping Reads (“Multireads”) • Some reads will align to more than one place in the reference, due to: • Common domains, gene families • Paralogs, pseudogenes, polyploidy etc. • This can distort counts, leading to misleading expression levels • If a read can’t be uniquely mapped, how should it be counted? • Should it be ignored (not counted at all)? • Should it be randomly assigned to one location among all the locations to which it aligns equally well? • This may depend on the question you’re asking… • …and also depends on the alignment and counting software you use.
Reads That Align to Gene Features This read unambiguously aligns to Gene A This read also uniquely aligns to Gene A This read aligns uniquely, but could be unspliced This spliced read uniquely aligns to Gene A Genes A and B overlap, but the read is uniquely within Gene A Genes A and B overlap, the read is within Gene A, but overlaps Gene B Genes A and B overlap, and the read ambiguously maps to both genes From: http://www-huber.embl.de/users/anders/HTSeq/doc/count.html
Read Count Definitions • When using tools such as HTSeq-count, FeatureCounts, and STAR, the output table will include counts of reads that can be assigned to each gene and some additional lines for reads that can’t be assigned, including: • Ambiguous – read overlaps more than one gene • Multiple Mapping – read maps equally well to more than one place in the genome • No Feature – read maps uniquely to genome, but in an intergenic (or intronic) region • Unmapped – read cannot be mapped to genome (based on run parameters)
How Well Did Your Reads Align to the Reference? Calculate percentage of reads mapped per sample % Reads Mapped Great! Sample Contamination? Good Incomplete Reference?
Checking Your Results • Key genes that may confirm sample ID • Knock-out or knock-down genes • Genes identified in previous research • Specific genes of interest • Hypothesis testing • Important pathways • Experimental validation (e.g., qRT-PCR) • Ideally, validation should be conducted on a different set of samples
Differential Gene Expression Generalized Workflow Bioinformatics analyses are insilico experiments The tools and parameters you choose will be influenced by factors including: • Available reference/annotation • Experimental design (e.g., pairwise vs. multi-factor) The “right” tools are the ones that best inform on your experiment Don’t just shop for methods that give you the answer you want
Differential Gene Expression Generalized Workflow File Types Fastq Fasta (reference) gtf (gene features) sam/bam Text tables
Today’s Exercises – Differential Gene Expression Today, we’ll analyze a few types of data you’ll typically encounter: • Single-end reads (“tag counting” with well-annotated genomes) • Paired-end reads (can also be used to discover novel transcripts in a genome with incomplete annotation) • Paired-end reads for differential gene expression when only a transcriptome is available (such as after a transcriptome assembly)
Today’s Exercises – Differential Gene Expression And we’ll be using a few different software: • Tophat to align spliced reads to a genome • FeatureCountsto generate raw counts tables for… • edgeR, for differential expression of genes • Cufflinks to find novel genes and transcripts • Bowtie2 to align reads to a transcriptome reference • sam2counts to extract raw counts from bowtie2 alignments
Alignment and Differential Expression Read set(s) TopHat bam file(s) We will follow these steps in the first exercise. Existing annotation (GTF) Toptables, etc. edgeR/limma FeatureCounts
But, do we have all the genes? • For organisms with genomes, gene models are stored in gtf files • Assumptions: • The gtf file contains annotation for ALL transcripts and genes • All splice sites, start/stop codons, etc. are correct • Are these assumptions correct for every sequenced organism? • RNA-Seq reads can be used to independently construct genes and splice variants using limited or no annotation • Method used depends on how much sequence information there is for the organism…
Gene Construction (Alignment) vs. Assembly Trinity software Genome-Sequenced Organisms Novel or Non-Model Organisms Haas and Zody (2010) Nat. Biotech. 28:421-3
Gene / TranscriptomeConstruction • Annotation can be improved – even for well-annotated model organisms • Identify all expressed exons • Combine expressed exons into genes • Find all splice variants for a gene • Discover novel transcripts • For newly sequenced organisms • Validate ab initio annotation • Comparison between different annotation sets • Can assist in finding some types of contamination • Reconstruction of rRNA genes • Genomic/mitochondrial DNA in RNA library preps.
Reference Annotation Based Transcript (RABT) Assembly Read set(s) TopHat bam file(s) Read-set specific GTF(s) Cufflinks Existing annotation (GTF) [optional] Cuffmerge Merged GTF Final assembly (GTF and stats) Cuffcompare edgeR/limma FeatureCounts Toptables, etc.
Eukaryotic and Bacterial Gene Structures are Different • Eukaryotes • Gene structure includes introns and exons • Splicing, poly-adenylation • Each mRNA is a discrete molecule when translated • Bacteria / Prokaryotes • Individual genes and groups of genes in operons • Generally, no splicing, no polyA • One mRNA can contain coding sequences for multiple proteins
Bacterial RNA-Seq Considerations • rRNA depletion strategies may leave considerable amounts of non-coding RNA molecules • Splicing-aware aligners (such as Tophat) may not be useful • Reads from polycistronic mRNA may overlap two genes • How would HTSeq-Count or FeatureCountshandle this? • Compare alignments to the genome to alignments to transcriptome. • Some aligners, such as bwa-mem, will report secondary alignments • Transcriptome alignments can be used to generate counts table for edgeR • Specialized software, such as Rockhopper (stand-alone, http://cs.wellesley.edu/~btjaden/Rockhopper/)