250 likes | 419 Views
Introduction to RNAseq. NGS - Quick Recap. Many applications -> research intent determines technology platform choice High volume data BUT error prone FASTQ is accepted format standard Must assess quality scores before proceeding ‘ Bad ’ data can be rescued.
E N D
NGS - Quick Recap • Many applications -> research intent determines technology platform choice • High volume data BUT error prone • FASTQ is accepted format standard • Must assess quality scores before proceeding • ‘Bad’ data can be rescued
The Central Dogma of Molecular Biology Reverse Transcription
RNAseq Protocols • cDNA, not RNA sequencing • Types of libraries available: • Total RNA sequencing (not advised) • polyA+ RNA sequencing • Small RNA sequencing (specific size range targeted)
Genome-scale Applications • Transcriptome analysis • Identifying new transcribed regions • Expression profiling • Resequencing to find genetic polymorphisms: • SNPs, micro-indels • CNVs • Question: Why even bother with exome sequencing then?
What about microarrays??!!! • Assumes we know all transcribed regions and that spliceforms are not important • Cannot find anything novel • BUT may be the best choice depending on QUESTION
Arrays vs RNAseq (1) • Correlation of fold change between arrays and RNAseq is similar to correlation between array platforms (0.73) • Technical replicates almost identical • Extra analysis: prediction of alternative splicing, SNPs • Low- and high-expressed genes do not match
RNA-Seq promises/pitfalls • can reveal in a single assay: • new genes • splice variants • quantify genome-wide gene expression • BUT • Data is voluminous and complex • Need scalable, fast and mathematically principled analysis software and LOTS of computing resources
Experimental considerations • Comparative conditions must make biological sense • Biological replicates are always better than technical ones • Aim for at least 3 replicates per condition • ISOLATE the target mRNA species you are after
Analysis strategies • De novo assembly of transcripts: + re-constructs actual spliced transcripts + does not require genome sequence easier to work post-transcriptional modifications - requires huge computational resources (RAM) - low sensitivity: hard to capture low abundance transcripts • Alignment to the genome => Transcript assembly + computationally feasible + high sensitivity + easier to annotate using genomic annotations - need to take special care of splice junctions
Basic analysis flowchart Illumina reads Remove artifacts AAA..., ...N... Clip adapters (small RNA) "Collapse" identical reads Align to the genome Pre-filter: low complexity synthetic Count and discard Re-align with different number of mismatches etc un-mapped mapped mapped un-mapped Assemble: contigs (exons) + connectivity Filter out low confidence contigs (singletons) Annotate
Software • Short-read aligners • BWA, Novoalign, Bowtie, TOPHAT (eukaryotes) • Data preprocessing • Fastx toolkit, samtools • Expression studies • Cufflinks package, R packages (DESeq, edgeR, more…) • Alternative splicing • Cufflinks, Augustus
The ‘Tuxedo’ protocol • TOPHAT + CUFFLINKS • TopHat aligns reads to genome and discovers splice sites • Cufflinks predicts transcripts present in dataset • Cuffdiff identifies differential expression Very widely adopted suite
Read alignment with TopHat • Uses BOWTIE aligner to align reads to genome • BOWTIE cannot deal with large gaps (introns) • Tophat segments reads that remain unaligned • Smaller segments mostly end up aligning
Read alignment with TopHat (3) • When there is a large gap between segments of same read -> probable INTRON • Tophat uses this to build an index of probable splice sites • Allows accurate measurement of spliceform expression
Cufflinks package • http://cufflinks.cbcb.umd.edu/ • Cufflinks: • Expression values calculation • Transcripts de novo assembly • Cuffdiff: • Differential expression analysis
Cufflinks: Transcript assembly • Assembles individual transcripts based on aligned reads • Infers likely spliceforms of each gene • Quantifies expression level of each
Cuffmerge • Merges transfrags into transcripts where appropriate • Also performs a reference based assembly of transcripts using known transcripts • Produces single annotation file which aids downstream analysis
Cuffdiff: Differential expression • Calculates expression level in two or more samples • Expression level relates to read abundance • Because of bias sources, cuffdiff tries to model the variance in its significance calculation
FPKM (RPKM): Expression Values • Fragments Reads Per Kilobase of exon model per Million mapped fragments • Nat Methods. 2008, Mapping and quantifying mammalian transcriptomes by RNA-Seq. Mortazavi A et al. C= the number of reads mapped onto the gene's exons N= total number of reads in the experiment L= the sum of the exons in base pairs.
Cuffdiff (differential expression) • Pairwise or time series comparison • Normal distribution of read counts • Fisher’s test test_id gene locus sample_1 sample_2 status value_1 value_2 ln(fold_change) test_stat p_value significant ENSG00000000003 TSPAN6 chrX:99883666-99894988 q1 q2 NOTEST 0 0 0 0 1 no ENSG00000000005 TNMD chrX:99839798-99854882 q1 q2 NOTEST 0 0 0 0 1 no ENSG00000000419 DPM1 chr20:49551403-49575092 q1 q2 NOTEST 15.0775 23.8627 0.459116 -1.39556 0.162848 no ENSG00000000457 SCYL3 chr1:169631244-169863408 q1 q2 OK 32.5626 16.5208 -0.678541 15.8186 0 yes
Recommendations • You can use BOWTIE or BOWTIE2 but • Use CUFFDIFF2 • Better statistical model • Detection of truly differentially expressed genes • VERY easy to parse output file (See example on course page)