380 likes | 555 Views
Finding, Aligning and Analyzing Non Coding RNAs. Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program. They are Everywhere…. And ENCODE said…
E N D
Finding, Aligning and AnalyzingNon Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program
They are Everywhere… • And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” • Who Are They? • tRNA, rRNA, snoRNAs, • microRNAs, siRNAs • piRNAs • long ncRNAs (Xist, Evf, Air, CTN, PINK…) • How Many of them • Open question • 30.000 is a common guess • Harder to detect than proteins .
Searching “…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”
A A C C C C A A A A C C G G G G G G G G A A A A C C G G G G CTTGCCTCC GAACGGACC CTTGCCTGG GAACGGAGG ncRNAs Can Evolve Rapidly CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG **-------*--**---*-**------**
ncRNAs are Difficult to Align CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG **-------*--**---*-**------** Regular Alignment --CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG-- * * *** * * *** *
ncRNAs are Difficult to Align • Same Structure Low Sequence Identity • Small Alphabet, Short Sequences Alignments often Non-Significant
Obtaining the Structure of a ncRNA is difficult • Hard to Align The Sequences Without the Structure • Hard to Predict the Structures Without an Alignment
The Holy Grail of RNA ComparisonSankoff’ Algorithm • Simultaneous Folding and Alignment • Time Complexity: O(L2n) • Space Complexity: O(L3n) • In Practice, for Two Sequences: • 50 nucleotides: 1 min. 6 M. • 100 nucleotides 16 min. 256 M. • 200 nucleotides 4 hours 4 G. • 400 nucleotides 3 days 3 T. • Forget about • Multiple sequence alignments • Database searches
The next best Thing: Consan • Consan = Sankoff + a few constraints • Use of Stochastic Context Free Grammars • Tree-shaped HMMs • Made sparse with constraints • The constraints are derived from the most confident positions of the alignment • Equivalent of Banded DP
Consan for Databases: Infernal • Infernal is a Faster version of Consan • For Database Search • Sill Very Slow Receiver operating characteristic (ROC) Comparison of Infernal with BLAST
Consan for Databases: Infernal • BLAST: 360 s. • Fast Infernal: 182 000 s. • Slow Infernal: 5 320 000 s.
Rfam: In practice • Rfam contains RNA families • Families Multiple Sequence Alignment Models • Models are like Pfam Profiles • Use Consan or Cmsearch rather than HMMer • Much Slower • Too expensive to search the models • Models are used to build Rfam • People usually BLAST Rfam
Where do Rfam Families Come From? • Infernal Requires a Model • Models requires an MSA • The MSA requires a Family • It all starts with a BlastN Rfam, Gardner et al. NAR 2008
Can we make BlastN more accurate ? • BlastN is not very accurate because: • Poor substitution models for Nucleic Acids • Low information density (4 symbols) • BlastN assumes • Equal evolution rates for all nucleotides • Independence form Neighbors
Love Thy Neighbor Measured Nearest Neighbor Dependencies on Rfam sequences
Measuring Di-Nucleotide Evolution • Each Nucleotide can be made more informative • It can incorporate the “name” of its Neighbor • AA => a • AG => b • AC => c • AT => d • … • A 16 Letter alphabet can be used to recode all nucleotide sequences • We name these extended Nucleotides
Substitutions ?? • How much does it cost to turn one nucleotide into another one ? • Blosum/Pam style matrix • Matrices estimated on Rfam families
Using BlastR • When Nucleic Acids look like Proteins • They can be aligned with Protein Methods • BlastN BlastP • BlastP with eRNA is BlastR
Benchmarking BlastR PP Query PN E V A L U E S Blast Rfam
Benchmarking BlastR Blast Rfam 001 Rfam 001 ROC Blast Rfam 002 Rfam 002 Blast Rfam … Rfam …
Benchmarking BlastR False Positives Bad Good True Positive Good Bad
Benchmarking BlastR False Positives Bad Good Area Under Curve Small AUC Better True Positive
The 3 Components of Blast R • BlastP is better than BlastN • BlosumR makes BlastP a little bit better Blast: wuBlast
The 3 Components of Blast R • BlastP is better than BlastN • BlosumR makes BlastP a little bit better • And Faster
BlastR and Clustering Sensitivity • Given all Rfam in Bulk • How good is BlastR at reconstituting all the families 1-Specificty
BlastR and Clustering Sensitivity • Given all Rfam in Bulk • How good is BlastR at reconstituting all the families 1-Specificty
BllastR: In Practice BlastR -20 E-Value Threshold: 10 BlastN
Take Home • Searching Nucleotides is Difficult • BlastN is not a very good algorithm • Simple Adaptations can improve the situation • Changing the algorithm (BlastP) • Changing the Scoring Scheme (BlastP-Nuc) • Changing the alphabet (BlastR)