570 likes | 755 Views
RNA- seq 序列和表达谱分析. 唐海宝. Measuring Expression. What & Why What is expression and why do we care? Platforms / Technology Closed approaches – Microarray Open approaches - Sequencing Experimental Design Quality control Analysis “Align-first” “Assemble-first”
E N D
RNA-seq序列和表达谱分析 唐海宝
Measuring Expression • What & Why • What is expression and why do we care? • Platforms / Technology • Closed approaches – Microarray • Open approaches - Sequencing • Experimental Design • Quality control • Analysis • “Align-first” • “Assemble-first” • Statistical Issues and Analysis
mRNA tRNA rRNA siRNA DNA microRNA piRNA tasiRNA lncRNA What is expression / transcriptome
Why the expression ? High-throughput friendly Genome Predicts Biology ** Regulatory network Transcriptome Context dependent Proteome
Measuring Expression ? Parts Description • Function? • Interconnectedness? • Comparisons • Population - level • Between genomes
Measuring Expression ? • What are important members of a transcriptome? • mRNA • polyadenylated, coding • alternatively spliced • Noncoding RNA (small RNA) • varying lengths, functions (18 – 32 bases) • microRNA, siRNA, piRNA, tasiRNA, long non-coding RNA • “Dark” RNA • transcription outside of annotated genes • Non-polyadenylated • Anti-sense transcription
Measuring Expression ? • How does the transcriptome vary to give rise to phenotype ? • Changes in Abundance • Abundance = Rate of Transcription – Rate of Decay • Changes in Function • Availability for function – polyadenylation, silencing, localisation • Suitability for function – alternate splicing
Single colour Probe Library Labelling Sample A Two colour Labelling Array Experimental Control Single and two colour arrays Hybridisation Array Manufacture Scanning
Array profiling Affymetrix Array Targets • Arabidopsis Genome 24,000 • C. elegans Genome 22,500 • Drosophila Genome 18, 500 • E. coli Genome 20, 366 • Human Genome U133 Plus 47, 000 • Mouse Genome 39, 000 • Yeast Genome • S.cerevisiae 5, 841 • S. pombe 5, 031 • Rat Genome 30, 000 • Zebrafish 14, 900 • Plasmodium / Anopheles • P. faciparum 4,300 • A. gambiae 14,900 • Barley (25,500), Soybean (37,500 + 23,300 pathogen), Grape (15,700) • Canine (21,700), Bovine (23,000) • B.subtilis (5,000), S. aureus (3,300 ORFS), Xenopus (14, 400)
Closed System – Microarray • Pros • High-throughput • Targeted profiling • Inexpensive – “population friendly” • Analytical methods are standardised • Cons • “Closed system”, novel = invisible • Difficult to see allelle-specific expression • Biases due to hybridisation • SNPs • Competitive and non-specific hybridisation
Open systems – RNA Sequencing Technology: • Illumina • SOLiD, IonTorrent • 454 Pros: • Transcript discovery • Allelic expression • High resolution abundance measures Cons: • Analysis can be complex • Expensive • Sensitivity is sequencing depth dependent
How RNA-seq data is generated Isolate Transcript RNA AAAAAA AAAAAA AAAAAA Reverse Transcription AAAAAA Fragment cDNA Size Selection Illumina Sequencing of each end CAGG CAAA GGAG AAAA CTGG GAAA *based on Illumina approach **strand-specific RNA-seq protocols exist for both Illumina and SOLiD Slide complements of Andrew McPherson
RNA Sequencing Mortazavi et al., 2008
RNASeq - Correspondence • Range > 5 orders of magnitude • Better detection of low abundance transcripts Marioni et al., 2009
Platform Choice / Sample Preparation Choice What do you want to profile ? • Polyadenylated • PolyA RNA extraction • Small RNA (< 100 bases) • Size filtering by gel • Strand-specific • RNA – Protein Interactions • RNA Immunoprecipitation (IP)
RNASeq - Workflow Sample Total RNA PolyA RNA Small RNA Mapping to Genome Differential Expression SNP detection Transcript structure Secondary structure Targets or Products Library Construction Assembly to Contigs Sequencing Base calling & QC
Strand - specificity Using adaptors Using chemical modification Ligation : 3’ and 5’ adaptors added sequentially dUTP : Addition and removal after selection SMART : addition of C’s on 5’ end Levin et al., 2010
Strand-specific data Levin et al., 2010
Non-polyA methods • Total RNA extraction • Ribosomal RNA and tRNA > 95-97% of total RNA • Ribosomal reduction methods • Subtractive hybridisation with rRNA probes • Exonuclease cleave of rRNA • NuGen – “proprietary combination of reverse transcriptase and primers in the Ovation RNA-Seq System” • cDNA normalisation methods • Partial digestion of any highly abundant species (Evrogen)
RNASeq Experimental Design • Issues: • Single-end vs Paired-end • sequencing depth - how much ? • number of replicates – how many ?
Depth Sequencing Depth is the average reads coverage of target Sequences - Sequencing depth = total number of reads X read length / estimated target sequence length - Example, for a 5MB transcriptome, if 1Million 50 bp reads are produced, the depth is 1 M X 50 bp / 5M ~ 10 X
Depth Sequencing Depth is the average reads coverage of target Sequences - Sequencing depth = total number of reads X read length / estimated target sequence length - Example, for a 5MB transcriptome, if 1Million 50 bp reads are produced, the depth is 1 M X 50 bp / 5M ~ 10 X
Library 1 Library 2 Library 3 Library 4 Multiplex Lane 1 L1 L2 L3 L4 25% lane / sample Defining Replicates • Technical Replicates • Biological Replicates Individual Individual 1 Individual 2 , Library 1 Library 2 Library 1 Library 2 Lane 1 Lane 2 Lane 3 Lane 4 Lane 1 Lane 2 Depth = 2 x 100% lane / sample 100% lane / sample
Number of Replicates • edgeR <= 0.01 , DESeq <= 0.01 More information in biological replicates than depth For differential expression
RNASeq – Compositional properties Depth of Sequence • Sequence count ≈ Transcript Abundance • Majority of the data can be dominated by a small number of highly abundant transcripts • Ability to observe transcripts of smaller abundance is dependent upon sequence depth
Quality control • Is NOT NEEDED, if: • In the right format • Good reads quality • Phred score per base & per sequence >=20 ( better if >=30) • No contamination detected • Paired reads are synchronized • Bad mapping efficiency of PE reads is symptomatic of de-synchronization
RNASeq Analysis • Overall Aim : • To get an accurate measurement of transcript abundance, structure and identity • Alignment • Bowtie / TopHat/ Cufflinks • Assembly • TRINITIY
Two assembly strategies • There is no one ‘correct’ way to analyze RNA-seq data (though there are some incorrect ways) • Two major branches 1. Direct alignment of reads (spliced or un-spliced) to genome or transcriptome 2. Assembly of reads followed by alignment • Assembly is the only option when working with a creature with no genome sequence Image from Haas & Zody, 2010
RNASeq – Alignment Considerations • Reads with multiple locations • Discard / Random Allocation • Clustering - local coverage • Weighting • Reads Spanning Exons • Make and align to exon junction libraries • Denovo junction detection • Summarisation of counts • Exons • Transcript boundaries • Inferred read boundaries
Which aligner to choose? • Paired end alignment improves sensitivity • Use of (trusted) base quality could improve alignment • BWA and Bowtie are the fastest (~7Gbp per CPU day) Li, H and Homer, N (2010) Briefings in Bioinformatics 11:473
TopHat Multimapping : ≤10 sites Assembly : consensus ‘island’ exon Trapnell et al., 2009; Roberts et al., 2011
Cufflinks Trapnell,C., et al Transcript assembly and quantification by RNA‐Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology (2010)
Alignment • Great if you have a fully annotated, reference • Okay.. If you have a partially annotated reference • “Different” if you have a big bunch of ESTs Options: • Align to a neighbouring genome or EST library • Denovotranscriptome assembly Tools: • ABySS, Mira, Trinity, HT-Seq, SAMtools
Assembly – Kmer graphs K = 4 Miller et al., 2010
Assembly – Kmer graphs • Spurs • Sequencing error • Bubbles • Sequencing error • Polymorphism • Frayed Rope / Cycles • Repeats Miller et al., 2010
Assembly – Kmer graphs • Spurs • Sequencing error • Bubbles • Sequencing error • Polymorphism • Frayed Rope / Cycles • Repeats Miller et al., 2010
Denovo transcriptome assembly • Will run on reasonable computer resources for large genomes • (e.g. < 1 TB of RAM) • Paired end data handling • Platform flexible • Handles haplotype complexity and polyploid genomes • ABySS • MIRA • Trinity • Velvet • AllPaths • Soap-denovo • Euler • CABOG • Edena • SHARCGS • VCAKE • SSAKE • CAP3
Assessing assembly quality ? • Comparisons between assembly algorithms • Contig summary statistics • Comparisons to known resources (e.g. ESTs) Trial on Rice Transcriptome: • 120 Million 75 bp single end Illumina reads – embryo • ABySS : • Number of contigs = 6, 804 • Contig length range = 38 – 2,818 [mean = 203] • Database comparisons : • Rice public cDNA sequences : 67, 393 • Contigs with high quality matches to cDNA : 6,555 (96%)
Assembly QC All 380 assemblies (cov cutoffs2 to 20 and kmer sizes 25 to 63) screened for complete transcripts of five genes. If a complete transcript was present in an assembly it was marked in grey and in black otherwise. Gruenheit et al., 2012