530 likes | 758 Views
Cufflinks. Matt Paisner , Hua He, Steve Smith and Brian Lovett. The Vision. RNAseq can be used for transcript discovery and abundance estimation
E N D
Cufflinks Matt Paisner, Hua He, Steve Smith and Brian Lovett
The Vision • RNAseq can be used for transcript discovery and abundance estimation • What’s missing: algorithms which aren’t restricted by prior gene annotations (which are often incomplete) and account for alternative transcription and splicing. • Hence, Cufflinks.
The Need • Evidence of ambiguous assignment of isoforms. TSS site/promoter changes and splice site changes were found previously by the authors • Longer reads and pair end reads do not do enough
The Biology • General assumption of randomization of reads • Central Dogma • Transcription Start Site (TSS) • Splice site • Isoform
Splicing • Thisblahblahblahblahblahblahisblahblahimportant • Thisblahblahblahblahblahblahisblahblahimportant • “This” “is” “important” - Exons • “blah” - Introns (Intrusions)
Why it matters: Isoforms • Not only different sizes, but different shapes • Shape determines function • Isoforms would map to the same section of the genome: undetected without Cufflinks • Separating transcripts into isoforms elucidates a more realistic representation of what is happening
TopHat Mapping short reads Trapnellet. al,Bioinformatics, 2009
TopHat • No genome reference annotations are needed • The output of TopHat is the input of Cufflinks. • Input: Reads and genome • Output: Read mappings • Short reads present computational challenges • BOWTIE
How does TopHat Work?! Big Idea: “Exon Inference”!!
Step 1: Initial Mapping via Bowtie • Group 1: Mapped Reads (Segments) • Group 2: Initially Unmapped (IUM)Reads • possibly intron-spanning read • Based on Group 1, we want to get intron-spanning reads from Group 2 Reference Mapped Reads
Step 3: Look for Potential Splice Signals Putative Exons
Cufflinks Isoform/Transcript Detection and Quantification Trapnellet al, Nature Biotech, 2010
Step 5: Identify Compatible Reads Two reads are compatible if their overlap contains the exact same implied introns (or none). If two reads are not compatible they are incompatible.
Step 6: Less BIOLOGY, and NOW it is the time for some GRAPH THEORIES……. “We emphasize that the definition of a transcription locus is not biological……” - Authors
Step 6: Create Overlap Graph Connect compatible reads in order Create a DAG
Theory • Solving minimum pathcover (isoforms) in the overlap graph implies the fewest transcriptsnecessary to explain the reads. • Solve minimumpath cover by finding largest set of individual reads such that no two are compatible. • According to Dilworth Thereom, find a maximum matching in a bipartitegraph
Step 9: Looking for Maximum Matching inside a bipartite graph via Bipartite Matching Algorithm BIPARTITE-MATCHING Algorithm: Add augmenting path via BFS, repeatedly adding the paths into the matching until none can be added.
Projective normalization underestimates expression isoform a isoform b project all isoforms into genome coordinates • R reads total, r reads for the gene: • ra for isoform a • rb for isoform b so but
How should expression levels be estimated? • A-B are distinguished by the presence of splice junction (a) or (b). • A-C are distinguished by the presence of splice junction (a) and change in UTR • B-C are distinguished by the presence of splice junction (b) and change in UTR (a) (b)
How should expression levels be estimated? • Longer transcripts contain more reads. • Reads that could have originated from multiple transcripts are informative. • Relative abundance estimation requires “discriminatory reads”. (a) (b)
A model for RNA-Seq • r = Transcript proportions for assignment of reads to transcripts • L = Likelihood of this assignment • R = all reads
A model for RNA-Seq • T = All transcripts
A model for RNA-Seq • Define: • Expected possible positions for an arbitrary fragment in Transcript t • F(i) = pr(random fragment has length i) • l(t) = Full length of transcript t
A model for RNA-Seq • It (r) = Implied length of r’s fragment if r is assigned to transcript t • Recall: F(i) = pr(fragment length = i)
Projective normalization underestimates expression isoform a isoform b project all isoforms into genome coordinates • R reads total, r reads for the gene: • ra for isoform a • rb for isoform b so but
A model for RNA-Seq • Now we have a maximum likelihood function in terms of r, the distribution of reads among transcripts. • Non-negative linear model
Inference with the sequencing model • Maximum likelihood function is concave - optimization using the EM algorithm. • Asymptotic MLE theory leads to a covariance matrix for the estimator in the form of the inverse of the observed Fisher information matrix • Importance sampling from the posterior distribution used for estimating the abundances from the posterior expectation, and 95% confidence intervals for the estimates. • This approach extends the log linear model of H. Jiang and W. Wong, Bioinformatics 2009 to a linear model for paired end reads. • For more background see Li et al., Bioinformatics, 2010 and Bullard et al., BMC Bioinformatics, 2010.
Utility of Cufflinks • mRNA as proxy for gene expression & action • Control points • transcriptional vs • post transcriptional • Does isoform-level discovery & quantification matter? • Apparently, yes • Putatively discovered about 12K new isoforms while recovering about 13K known • Plus other stuff…
The skeletal myogenesistranscriptomeRNA-Seq (2x75bp GAIIx) along time course of mouse C2C12 differentiation 168 hours 120 hours -24 hours 60 hours myotube myoctyte • 123,575,666reads • 84,369,078 reads • 140,384,062reads • 82,138,212reads • 66,541,668alignments • 103,681,081alignments • 47,431,271alignments • 89,162,512alignments • 10,754,363to junctions • 19,194,697to junctions • 9,015,806to junctions • 17,449,848to junctions • 58,008transfrags • 69,716transfrags • 55,241transfrags • 63,664transfrags fusion differentiation (starting at 0 hours) Illustration based on: Ohtake et al, J. Cell Sci., 2006; 119:3822-3832 Slide courtesy of Hector CorradaBravo
Projective normalization underestimates expression isoform a isoform b project all isoforms into genome coordinates • R reads total, r reads for the gene: • ra for isoform a • rb for isoform b so but Slide courtesy of Hector CorradaBravo
Discovery is necessary for accurate abundance estimates Slide courtesy of Hector CorradaBravo
Some Questions… • Do isoforms of a given gene have interesting temporal patterns? • Increasing, decreasing, more complex… • What does this mean biologically? • What about transcriptional versus post transcriptional regulation? • Differential transcription • Differential splicing
Dynamics of Myc expression Slide courtesy of Hector CorradaBravo
Overloading Metric using Jensen-Shannon Divergence Average Entropy Entropy of Average Metric: One-sided t-test under the null hypothesis that there is no difference in abundance; Type I errors controlled with Benjamini-Hotchbergcorrection (FDR)
Regulatory Overloading # Genes (FDR < 0.05) Differential splicing Fibronectin Tropomyosin 1 Mef2d … 231 17 Fhl3 Fhl1 Myl1 … 101 Differential TSS preference
Dynamics of Myc expression d( , ) Slide courtesy of Hector CorradaBravo
New TSS = New Points of Regulation Whatwould a “collapsed” RNA-seq alignment look like? Microarray? TSS=Transcription Start Site