320 likes | 448 Views
RNA- S eq : An assessment of technical reproducibility and comparison with gene expression arrays. Wei Zhang University of Minnesota - Twin City March 21 st 2011. Outline. DNA Microarrays RNA- Seq Overview Experimental Design Normalization of RNA- Seq Data
E N D
RNA-Seq: An assessment of technical reproducibility and comparison with gene expression arrays Wei Zhang University of Minnesota - Twin City March 21st 2011
Outline • DNA Microarrays • RNA-Seq Overview • Experimental Design • Normalization of RNA-Seq Data • RNA-Seq Global Data Properties • Identifying Differentially Expressed Genes • Comparison of Results Across Technologies
Outline • DNA Microarrays • RNA-Seq Overview • Experimental Design • Normalization of RNA-Seq Data • RNA-Seq Global Data Properties • Identifying Differentially Expressed Genes • Comparison of Results Across Technologies
DNA Microarrays • Since the mid-1990s, DNA microarrays have been the technology of choice for large-scale studies of gene expression levels. • The ability of these arrays simultaneously interrogate thousands of transcripts has led to important advances in a wide range of biological problems. • Identification of differential expressed genes. • New insights into developmental processes, pharmacogenomics responses. • Evolution of gene regulation
Limitation of DNA Microarrays • Post-transcriptional regulations: not all mRNAs are translated. • Seriously limit the detection of RNA splice patterns and previously unmapped genes. • Lack of synchronization: translation rates, alternative splicing, translation lag
Outline • DNA Microarrays • RNA-Seq Overview • Experimental Design • Normalization of RNA-Seq Data • RNA-Seq Global Data Properties • Identifying Differentially Expressed Genes • Comparison of Results Across Technologies
RNA-SeqOverview • High-throughput sequencing technology for sequencing RNAs (actually cDNAs which contain the RNAs' content) • Avoids the need for bacterial cloning of the cDNA input. • The resulting sequence reads are individually mapped to the source genome and counted reads to obtain the number and density of reads corresponding to RNA from each known exon.
RNA-SeqMotivation • Allows researchers to obtain information like: • gene/transcript/exon expressions • alternative splicing • gene fusions • post-transcriptional mutations • single nucleotide variations
RNA-SeqDetails • Briefly, long RNAs are first converted into a library of cDNA fragments through either RNA fragmentation or DNA fragmentation. Sequencing adaptors (blue) are subsequently added to each cDNA fragment and a short sequence is obtained from each cDNA using high-throughput sequencing technology. The resulting sequence reads are aligned with the reference genome or transcriptome, and classified as three types: exonic reads, junction reads and poly(A) end-reads. These three types are used to generate a base-resolution expression profile for each gene
Fusion Gene • A fusion gene is a hybrid gene formed from two previously separate genes. It can occur as the result of a translocation, interstitial deletion, or chromosomal inversion. Often, fusion genes are oncogenes. • Alarge set reads would be expected to map to known exon-exon junctions. The remaining unmapped short reads would then be further analyzed to determine whether they match an exon-exon junction where the exons come from different genes.
Outline • DNA Microarrays • RNA-Seq Overview • Experimental Design • Normalization of RNA-Seq Data • RNA-Seq Global Data Properties • Identifying Differentially Expressed Genes • Comparison of Results Across Technologies
Outline • DNA Microarrays • RNA-Seq Overview • Experimental Design • Normalization of RNA-Seq Data • RNA-Seq Global Data Properties • Identifying Differentially Expressed Genes • Comparison of Results Across Technologies
Normalization Method: • Reads that fell onto exons were summed up for each locus and normalized by the predicted mRNA length into expanded exonic read density. • Reads per kilo base million reads (RPKM): • C is the number of mappable reads that fell onto the gene’s exons • N is the total number of mappable reads in the experiment • L is the sum of exons in base pairs
Outline • DNA Microarrays • RNA-Seq Overview • Experimental Design • Normalization of RNA-Seq Data • RNA-Seq Global Data Properties • Identifying Differentially Expressed Genes • Comparison of Results Across Technologies
RNA-Seq Global Data Properties • From the technical replicate data, the Spearman is 0.96. Such result suggest that RNA-Seq has high reproducibility.
RNA-Seq Global Data Properties • Distribution of uniquely mappable reads onto gene parts in the liver sample. 93% of the reads fall onto exon or the predicted exons regions, 4% of the reads falls onto introns, 3% in intergenic regions.
Outline • DNA Microarrays • RNA-Seq Overview • Experimental Design • Normalization of RNA-Seq Data • RNA-Seq Global Data Properties • Identifying Differentially Expressed Genes • Comparison of Results Across Technologies
Identifying Differentially Expressed Genes • Affymetrix Microarray: T-test • RNA-Seq: Likelihood ratio test Each procedure leads to a P-value for each gene. The significance threshold to control FDR at a give value was calculated using the method of Storey and Tibshirani (2003).
Likelihood ratio test Given samples from group A and group B, estimate , and for the Poisson model, where is the average of all the samples. Now, compute the likelihood ratio as follows, ) Where , , and are samples in A, samples in B and all samples. Pois The probability that there are exactly k occurrences
Outline • DNA Microarrays • RNA-Seq Overview • Experimental Design • Normalization of RNA-Seq Data • RNA-Seq Global Data Properties • Identifying Differentially Expressed Genes • Comparison of Results Across Technologies
Comparing counts from Illumina sequencing with normalized intensities from the array
Comparing counts from Illumina sequencing with normalized intensities from the array • Compare the number of sequence reads mapped to each gene with the corresponding (normalized) absolute intensities from the array. • These two independent measures of transcript abundance are highly correlated (Spearman correlation =0.73 for liver, 0.75 for kidney). • The array intensities are large and the sequence counts small.
Comparison of estimated log2 fold changes from Illuminaand Affymetrix • DE by ILM • Red >250 • Green <250 • Black Not DE by ILM