1 / 57

RNA- seq 序列和表达谱分析

RNA- seq 序列和表达谱分析. 唐海宝. Measuring Expression. What & Why What is expression and why do we care? Platforms / Technology Closed approaches – Microarray Open approaches - Sequencing Experimental Design Quality control Analysis “Align-first” “Assemble-first”

keola
Download Presentation

RNA- seq 序列和表达谱分析

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RNA-seq序列和表达谱分析 唐海宝

  2. Measuring Expression • What & Why • What is expression and why do we care? • Platforms / Technology • Closed approaches – Microarray • Open approaches - Sequencing • Experimental Design • Quality control • Analysis • “Align-first” • “Assemble-first” • Statistical Issues and Analysis

  3. mRNA tRNA rRNA siRNA DNA microRNA piRNA tasiRNA lncRNA What is expression / transcriptome

  4. Why the expression ? High-throughput friendly Genome Predicts Biology ** Regulatory network Transcriptome Context dependent Proteome

  5. Measuring Expression ? Parts Description • Function? • Interconnectedness? • Comparisons • Population - level • Between genomes

  6. Measuring Expression ? • What are important members of a transcriptome? • mRNA • polyadenylated, coding • alternatively spliced • Noncoding RNA (small RNA) • varying lengths, functions (18 – 32 bases) • microRNA, siRNA, piRNA, tasiRNA, long non-coding RNA • “Dark” RNA • transcription outside of annotated genes • Non-polyadenylated • Anti-sense transcription

  7. Measuring Expression ? • How does the transcriptome vary to give rise to phenotype ? • Changes in Abundance • Abundance = Rate of Transcription – Rate of Decay • Changes in Function • Availability for function – polyadenylation, silencing, localisation • Suitability for function – alternate splicing

  8. 1.PLATFORMS / TECHNOLOGY

  9. Microarray is based on hybridization

  10. Single colour Probe Library Labelling Sample A Two colour Labelling Array Experimental Control Single and two colour arrays Hybridisation Array Manufacture Scanning

  11. Array profiling Affymetrix Array Targets • Arabidopsis Genome 24,000 • C. elegans Genome 22,500 • Drosophila Genome 18, 500 • E. coli Genome 20, 366 • Human Genome U133 Plus 47, 000 • Mouse Genome 39, 000 • Yeast Genome • S.cerevisiae 5, 841 • S. pombe 5, 031 • Rat Genome 30, 000 • Zebrafish 14, 900 • Plasmodium / Anopheles • P. faciparum 4,300 • A. gambiae 14,900 • Barley (25,500), Soybean (37,500 + 23,300 pathogen), Grape (15,700) • Canine (21,700), Bovine (23,000) • B.subtilis (5,000), S. aureus (3,300 ORFS), Xenopus (14, 400)

  12. Closed System – Microarray • Pros • High-throughput • Targeted profiling • Inexpensive – “population friendly” • Analytical methods are standardised • Cons • “Closed system”, novel = invisible • Difficult to see allelle-specific expression • Biases due to hybridisation • SNPs • Competitive and non-specific hybridisation

  13. Open systems – RNA Sequencing Technology: • Illumina • SOLiD, IonTorrent • 454 Pros: • Transcript discovery • Allelic expression • High resolution abundance measures Cons: • Analysis can be complex • Expensive • Sensitivity is sequencing depth dependent

  14. How RNA-seq data is generated Isolate Transcript RNA AAAAAA AAAAAA AAAAAA Reverse Transcription AAAAAA Fragment cDNA Size Selection Illumina Sequencing of each end CAGG CAAA GGAG AAAA CTGG GAAA *based on Illumina approach **strand-specific RNA-seq protocols exist for both Illumina and SOLiD Slide complements of Andrew McPherson

  15. RNA Sequencing Mortazavi et al., 2008

  16. RNASeq - Correspondence • Range > 5 orders of magnitude • Better detection of low abundance transcripts Marioni et al., 2009

  17. Platform Choice / Sample Preparation Choice What do you want to profile ? • Polyadenylated • PolyA RNA extraction • Small RNA (< 100 bases) • Size filtering by gel • Strand-specific • RNA – Protein Interactions • RNA Immunoprecipitation (IP)

  18. RNASeq - Workflow Sample Total RNA PolyA RNA Small RNA Mapping to Genome Differential Expression SNP detection Transcript structure Secondary structure Targets or Products Library Construction Assembly to Contigs Sequencing Base calling & QC

  19. Illumina RNASeq : TruSeq

  20. Strand - specificity Using adaptors Using chemical modification Ligation : 3’ and 5’ adaptors added sequentially dUTP : Addition and removal after selection SMART : addition of C’s on 5’ end Levin et al., 2010

  21. Strand-specific data Levin et al., 2010

  22. Non-polyA methods • Total RNA extraction • Ribosomal RNA and tRNA > 95-97% of total RNA • Ribosomal reduction methods • Subtractive hybridisation with rRNA probes • Exonuclease cleave of rRNA • NuGen – “proprietary combination of reverse transcriptase and primers in the Ovation RNA-Seq System” • cDNA normalisation methods • Partial digestion of any highly abundant species (Evrogen)

  23. 2. EXPERIMENTAL DESIGN

  24. RNASeq Experimental Design • Issues: • Single-end vs Paired-end • sequencing depth - how much ? • number of replicates – how many ?

  25. Single-end vs. paired-end sequencing

  26. Depth Sequencing Depth is the average reads coverage of target Sequences - Sequencing depth = total number of reads X read length / estimated target sequence length - Example, for a 5MB transcriptome, if 1Million 50 bp reads are produced, the depth is 1 M X 50 bp / 5M ~ 10 X

  27. Depth Sequencing Depth is the average reads coverage of target Sequences - Sequencing depth = total number of reads X read length / estimated target sequence length - Example, for a 5MB transcriptome, if 1Million 50 bp reads are produced, the depth is 1 M X 50 bp / 5M ~ 10 X

  28. Library 1 Library 2 Library 3 Library 4 Multiplex Lane 1 L1 L2 L3 L4 25% lane / sample Defining Replicates • Technical Replicates • Biological Replicates Individual Individual 1 Individual 2 , Library 1 Library 2 Library 1 Library 2 Lane 1 Lane 2 Lane 3 Lane 4 Lane 1 Lane 2 Depth = 2 x 100% lane / sample 100% lane / sample

  29. Number of Replicates • edgeR <= 0.01 , DESeq <= 0.01 More information in biological replicates than depth For differential expression

  30. RNASeq – Compositional properties Depth of Sequence • Sequence count ≈ Transcript Abundance • Majority of the data can be dominated by a small number of highly abundant transcripts • Ability to observe transcripts of smaller abundance is dependent upon sequence depth

  31. 3. Quality control

  32. Quality control • Is NOT NEEDED, if: • In the right format • Good reads quality • Phred score per base & per sequence >=20 ( better if >=30) • No contamination detected • Paired reads are synchronized • Bad mapping efficiency of PE reads is symptomatic of de-synchronization

  33. Base quality

  34. Adapter contamination

  35. 4. ANALYSIS – align first

  36. RNASeq Analysis • Overall Aim : • To get an accurate measurement of transcript abundance, structure and identity • Alignment • Bowtie / TopHat/ Cufflinks • Assembly • TRINITIY

  37. Two assembly strategies • There is no one ‘correct’ way to analyze RNA-seq data (though there are some incorrect ways) • Two major branches 1. Direct alignment of reads (spliced or un-spliced) to genome or transcriptome 2. Assembly of reads followed by alignment • Assembly is the only option when working with a creature with no genome sequence Image from Haas & Zody, 2010

  38. Read density variability

  39. RNASeq – Alignment Considerations • Reads with multiple locations • Discard / Random Allocation • Clustering - local coverage • Weighting • Reads Spanning Exons • Make and align to exon junction libraries • Denovo junction detection • Summarisation of counts • Exons • Transcript boundaries • Inferred read boundaries

  40. Which aligner to choose? • Paired end alignment improves sensitivity • Use of (trusted) base quality could improve alignment • BWA and Bowtie are the fastest (~7Gbp per CPU day) Li, H and Homer, N (2010) Briefings in Bioinformatics 11:473

  41. TopHat Multimapping : ≤10 sites Assembly : consensus ‘island’ exon Trapnell et al., 2009; Roberts et al., 2011

  42. Cufflinks Trapnell,C., et al Transcript assembly and quantification by RNA‐Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology (2010)

  43. Alignment • Great if you have a fully annotated, reference • Okay.. If you have a partially annotated reference • “Different” if you have a big bunch of ESTs Options: • Align to a neighbouring genome or EST library • Denovotranscriptome assembly Tools: • ABySS, Mira, Trinity, HT-Seq, SAMtools

  44. 5. ANALYSIS – assemble first

  45. Assembly – Kmer graphs K = 4 Miller et al., 2010

  46. Assembly – Kmer graphs • Spurs • Sequencing error • Bubbles • Sequencing error • Polymorphism • Frayed Rope / Cycles • Repeats Miller et al., 2010

  47. Assembly – Kmer graphs • Spurs • Sequencing error • Bubbles • Sequencing error • Polymorphism • Frayed Rope / Cycles • Repeats Miller et al., 2010

  48. Denovo transcriptome assembly • Will run on reasonable computer resources for large genomes • (e.g. < 1 TB of RAM) • Paired end data handling • Platform flexible • Handles haplotype complexity and polyploid genomes • ABySS • MIRA • Trinity • Velvet • AllPaths • Soap-denovo • Euler • CABOG • Edena • SHARCGS • VCAKE • SSAKE • CAP3

  49. Assessing assembly quality ? • Comparisons between assembly algorithms • Contig summary statistics • Comparisons to known resources (e.g. ESTs) Trial on Rice Transcriptome: • 120 Million 75 bp single end Illumina reads – embryo • ABySS : • Number of contigs = 6, 804 • Contig length range = 38 – 2,818 [mean = 203] • Database comparisons : • Rice public cDNA sequences : 67, 393 • Contigs with high quality matches to cDNA : 6,555 (96%)

  50. Assembly QC All 380 assemblies (cov cutoffs2 to 20 and kmer sizes 25 to 63) screened for complete transcripts of five genes. If a complete transcript was present in an assembly it was marked in grey and in black otherwise. Gruenheit et al., 2012

More Related