1 / 41

High throughput expression analysis using RNA sequencing ( RNAseq )

High throughput expression analysis using RNA sequencing ( RNAseq ). June 18, 2019. Lecture objectives. Theory and practice of RNA sequencing (RNA- seq ) analysis Rationale for sequencing RNA Challenges specific to RNA- seq Types of libraries for RNA- seq Sequencing coverage (depth)

zasha
Download Presentation

High throughput expression analysis using RNA sequencing ( RNAseq )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. High throughput expression analysisusing RNA sequencing (RNAseq) June 18, 2019

  2. Lecture objectives • Theory and practice of RNA sequencing (RNA-seq) analysis • Rationale for sequencing RNA • Challenges specific to RNA-seq • Types of libraries for RNA-seq • Sequencing coverage (depth) • Basic themes of RNA-seq analysis work flows • Terminology for RNA-seq • Steps in RNAseq analysis • Downstream interpretation of expression and differential estimates

  3. Gene expression

  4. RNA sequencing Generate cDNA, fragment, size select, add linkers Isolate RNAs Samples of interest Condition 1 (normal colon) Condition 2 (colon tumor) Sequence ends Map to genome, transcriptome, and predicted exon junctions 100s of millions of paired reads 10s of billions bases of sequence Downstream analysis

  5. Before you start: • What is your experimental question? • Is a global expression experiment the best approach? • What is your experimental model? • Is global expression technically feasible? • How many replicates are needed? • What is your budget? • Minimal 3 replicates/condition to do statistics • Number of replicates depends on biological variability • Cell culture < inbred mice < human tumors Number of replicates needed increases

  6. Common analysis goals of RNA-Seq analysis • Gene expression and differential expression • Alternative expression analysis • Transcript discovery and annotation • Allele specific expression • Relating to SNPs or mutations • Mutation discovery • Fusion detection • RNA editing • miRNA & other non-coding RNA detection/differences

  7. Why sequence RNA (versus DNA)? • Functional studies • Genome may be constant but an experimental condition has a pronounced effect on gene expression • e.g. Drug treated vs. untreated cell line • e.g. Wild type versus knock out mice • Some molecular features can only be observed at the RNA level • Alternative isoforms, fusion transcripts, RNA editing • Predicting all possible transcript sequences from genome sequence is difficult • Alternative splicing, RNA editing, etc.

  8. Why sequence RNA (versus DNA)? • Interpreting mutations that do not have an obvious effect on protein sequence • ‘Regulatory’ mutations that affect what mRNA isoform is expressed and how much • e.g. splice sites, promoters, exonic/intronic splicing motifs, etc. • Prioritizing protein coding somatic mutations (often heterozygous) • If the gene is not expressed, a mutation in that gene would be less interesting • If the gene is expressed but only from the wild type allele, this might suggest loss-of-function (haploinsufficiency) • If the mutant allele itself is expressed, this might suggest a candidate drug target

  9. Challenges • Sample • Purity?, quantity?, quality? • RNAs consist of small exons that may be separated by large introns • Mapping reads to genome is challenging • The relative abundance of RNAs vary wildly • 105 – 107 orders of magnitude • Since RNA sequencing works by random sampling, a small fraction of highly expressed genes may consume the majority of reads • Ribosomal and mitochondrial genes most highly expressed • RNAs come in a wide range of sizes • Small RNAs must be captured separately • PolyA selection of large RNAs may result in 3’ end bias • RNA is fragile compared to DNA (easily degraded)

  10. Challenges, cont. • Experimental model: • Mammalian, eukaryote, bacteria or virus? • Patient derived tissue or cells? • Minimum number biological replicates to achieve statistical significance? • Three biological replicates for cultured cells • Three biological replicates for inbred mice, for each condition • Human samples? • As many as possible, assuming that you can identify reasonable controls

  11. Biological versus technical replicates Technical replicate 1 Library prep & sequencing Library prep & sequencing Technical replicate 2 Library prep & sequencing Biological replicate 1 Library prep & sequencing Biological replicate 2 Biological replicates are preferred (necessary) Technical variation between library preps is not large for experienced technicians

  12. Variability of RNA-seq libraries • The relative log expression should be constant across the libraries • Libraries from different conditions should cluster together Reference: Risso D. et.al. Nature Biotech. (2014) 32:896-902

  13. PCA of time-course RNAseq experiment

  14. RNA quality is key to success • Garbage in…garbage out • Amount of RNA influences which type of library you can construct • Quality of the RNA determines if the library construction will work well enough to give you usable data

  15. Purity of RNA • Pure RNA absorbs strongly at 260 nm • Ratio of absorbance 260/280 used to assess purity • Ratio = 2.1 if pure (1.8-2.0 acceptable) • Ratio depends on pH if there are protein contaminants • A260 remains constant, A280 changes if proteins are present • Ratio 260/230 should also be close to 2 • If <2, proteins, guanidinium isothiocyanate or phenol may be contaminants • Always take a scanning measurement (220-300 nm) Reference: http://biomedicalgenomics.org/RNA_quality_control.html

  16. RNA purity, cont. • RNA quality determined using a Agilent Bioanalyzer • Lab on a chip to perform capillary electrophoresis with a fluorescent dye that binds to RNA • Determines RNA concentration & integrity • Works with ng quantities of RNA (50 – 300 ng/ul) • QC determined by 28S/18S ratio

  17. MW of 28S RNA is ~2.5x more than 18S RNA (mammalian) • So a 28S/18S ratio of 2 is considered ideal • Can vary between species

  18. Agilent example / interpretation • RNA quality assessed by Agilent Bioanalyzer • “RIN” = RNA integrity number • 0 (bad) to 10 (good) RIN = 6.0 RIN = 10 RIN values of >7 sufficient for library prep

  19. Enrichment of desired RNA • mRNA makes up 1-3% of total RNA • Ribosomal RNA >80%, with majority of that 18S/28S RNA • polyA selection • Not species specific • Works reasonably well for gene expression analyses • Ribosomal depletion (Ribo-Minus) • Based on oligonucleotide probes that capture rRNA and remove it via binding to streptavadin-coated magnetic beads • Species specific, or at least specific to taxonomic groups • Necessary if interested in non-coding RNA analysis

  20. Other RNA-seq library construction strategies • Size selection (before and/or after cDNA synthesis) • Small RNAs (microRNAs) vs. large RNAs? • Less bias with RNA fragmentation prior to cDNA synthesis • Linear amplification • insertion sites (viral, gene therapy) • Ref: Nature Protocols (2011) 6:1026 • Strand-specific sequencing • Ref: Current Genomics (2013) 14:173 • Exome capture • Cancer Genomes, 1000 genomes, NHLBI genome sequencing • Review on exomevs transcriptome for variant detection • Exp. Rev. Mol. Diag. (2012) 12:241-251

  21. Sequence coverage (depth) needed for RNA-seq • Question being asked of the data • Gene expression? Alternative expression? Mutation calling? • Tissue type • RNA preparation, quality of input RNA, library construction method, etc. • Sequencing type • read length, paired vs. unpaired, etc. • Computational approach and resources • Identify publications with similar goals • Pilot experiment • Work with your Core facility

  22. Sequence coverage (depth) C = LN/G • C = coverage • L = read length • N = number of reads (depends on sequencer; Illumina v3 capacity is 189 million reads /lane) • G = haploid genome length 100 bp reads of human DNA on Illumina (single lane) C = (100 bp)*(189 x 106)/(3 x 109bp) = 6.3 That is, each base will be sequenced between 6 & 7 times SNP discovery requires 10-30X coverage

  23. Sequence coverage, cont. 100 bp reads of human transcriptome (RNA) on Illumina v3 Human transcriptome: 50 million bp C = (100 bp)*(189 x 106)/(50 x 106bp) = 378 • Only need 30X coverage for gene expression analysis • Run 10 RNAseq libraries on a single lane and still get enough coverage for the analysis • Use barcodes that distinguish different libraries for analysis

  24. How many replicates? • Statistical considerations suggest that you need at least 3 biological replicates to make valid comparisons • More complex samples need more replicates to increase the signal above the biological variability • Cost of experiment which is dependent on the depth of sequencing required and thus how many chips (Ion Torrent) or lanes (HiSeq) of sequencing you need • Plan this well in advance in consultation with the core personnel • Do as many replicates as is feasible and affordable

  25. Steps in RNAseq workflow • Obtain raw data from sequencer • Align/assemble reads • Process alignment to quantify reads/gene feature • Conduct differential analysis • Summarize and visualize • Create gene lists, prioritize candidates for validation, etc. • Conduct gene enrichment or pathway analysis

  26. Some nomenclature Sequencing file types: Fasta (sequence only) Fastq (contains quality scores) >SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

  27. Alignment files • SAM – Sequence Alignment/MAP coordinate file • Tab-delimited ASCII file with all of the information needed for the alignment of the reads to reference HWI-ST155_0544:7:2:7658:34048#0 16 Chr1 2082 1 42M * 0 0 GTAGAGGTAGGACCAACAAGGACCAAGTTTCCCTGTTCCAAC ghghhhhgcghh_hhhhhhhhhhhhhhhhghhhhhghhhhhh NM:i:0 NH:i:4 CC:Z:Chr10 CP:i:1057787 HWI-ST155_0544:7:61:11040:141129#0 0 Chr1 2956 1 42M * 0 0 CTTTTCTGGCGTAACTTGGTTCCCTTTAGTTTGGAACAGATA hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhghhchfhhhh NM:i:0 NH:i:3 CC:Z:Chr10 CP:i:1056913 HWI-ST155_0544:7:8:11214:130793#0 0 Chr1 3645 1 42M * 0 0 CAAATAGTGGTTTGAAACCTATCAATCAAGTCACTTTCAAGT gggggggggggggggggeggdffffggggggggggggdggfc NM:i:0 NH:i:3 CC:Z:Chr10 CP:i:1056224 • BAM – Binary version of SAM • Machine readable, smaller more compact version • More commonly used by viewers and analysis programs

  28. Reference guided RNAseq alignment • You have a sequenced genome and a file with the gene model annotations • Genome reference file is usually a fasta file with each chromosome as a separate entry • GTF = gene transcript file • GFF = gene feature file (GFF3) • The GTF/GFF chromosome name must match the chromosome name in the genome reference file CHR Source Type Start Stop Strand Feature Chr_1 CNA3 gene 14025 16421 . - . ID=CNAG_00003; Name=transporter Chr_1 CNA3 mRNA 14025 16421 . - . ID=CNAG_00003T0; Parent=CNAG_0003; Name=transporter Chr_1 CNA3 exon 16002 16421 . - . ID=7000009599314867;Parent=CNAG_00003T0 Chr_1 CNA3 CDS 16002 16370 . - . ID=CNAG_00003T0.cds;Parent=CNAG_00003T0

  29. RNAseq alignment

  30. How to quantify reads per feature?

  31. Gene expression differences • Once you have your data as counts (number of reads/feature), you can use a variety of tools to compare the different conditions and determine what genes are differentially expressed • EdgeR, DESeq2, NOISeq • Employ different statistical models and normalization methods

  32. Multiple approaches advisable

  33. Multiple testing correction • As more attributes are compared, it becomes more likely that the treatment and control groups will appear to differ on at least one attribute by random chance alone. • Well known from array studies • 10,000s genes • 100,000s exons • With RNA-seq, even greater problem • All the complexity of the transcriptome • Large variety and number of potential features • Genes, transcripts, exons, junctions, retained introns, microRNAs, lncRNAs

  34. Lessons learned from microarray days • Hansen et al. “Sequencing Technology Does Not Eliminate Biological Variability.” Nature Biotechnology 29, no. 7 (2011): 572–573 • You still need to do replicates!! • Minimum of 3 replicates for statistical analysis to be robust

  35. Source of data Carreras-Gonzalez A. et. al. “A multi-omics analysis reveals the regulatory role of CD180 during the response of macrophages to Borrelia burgdorferi” Emerging Microbes and Infections (2018) 7: 1-13 Lyme Borreliosis Ixodesscapularis Borrelia burgdorferi LB most prevalent arthropod-borne infectious disease in N. America Photo Credit: Content Providers(s): CDC - This media comes from the Centers for Disease Control and Prevention's Public Health Image Library (PHIL), with identification number #6631. 

  36. Carreras-Gonzalez et. al. approach • Large-scale analytical approach (transcriptomics and proteomics) to identify biomarkers or define the disease stage and improve early detection and diagonosis • Measured global transcriptional profile in murine macrophages and human monocytes after exposure to B. burgdorferi • Proteomics analysis of murine macrophages after stimulation with B. burgdorferi • Genes and proteins identified are mouse or human, not from the bacteria or tick

  37. For Exercise 2 • Gene list: • List of genes showing differential expression in B. burgdorferi stimulated murine macrophages vs unstimulated macrophages. • You will use Excel to do simple transformations of data, sort and cull data and compare lists of genes between Excel worksheets • You will then use db2db program within bioDBnet to input a list of Gene IDs and get back a list of additional database identifiers, such as gene symbols, names and other annotation • Write a narrative of how you generated the final gene list

  38. Why do you need other IDs? • Motivation: • Quick way to find basic information about a list of genes • Some websites for analysis prefer specific identifiers • To compare to a list of genes from a different experiment, may need different identifiers • Find homologs in other species which might be more tractable for experiments

  39. A few comments on the gene list • This list was generated by the program DESeq2, one of the more commonly used RNAseq analysis programs. • By default it generates output with 7 columns

  40. What do you do with this list? • Remove the unnecessary columns: • baseMean, lfcSE, stat and pvalue • Cull the data to look at only the significantly differentially expressed genes • Sort by padj and remove all rows with values >0.01 • Sort by log2FC; delete all rows with values between 1 and -1 • A log2FC of 1 = 2-fold change in expression • The log2FoldChange (log2FC) represents the comparison of B. burgdorferi stimulated/unstimulated murine macrophages

  41. bioDBnet https://biodbnet-abcc.ncifcrf.gov/ https://frederick.cancer.gov/

More Related