1 / 38

Exploring Non-Coding RNAs: Finding, Aligning, Analyzing

Discover the world of non-coding RNAs - tRNA, rRNA, microRNAs, and more. Uncover their structures, evolution, and alignment challenges. Learn about tools like Sankoff’s Algorithm, Consan, and Infernal for RNA comparison.

jmeade
Download Presentation

Exploring Non-Coding RNAs: Finding, Aligning, Analyzing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding, Aligning and AnalyzingNon Coding RNAs Cédric Notredame Comparative Bioinformatics Group Bioinformatics and Genomics Program

  2. They are Everywhere… • And ENCODE said… “nearly the entire genome may be represented in primary transcripts that extensively overlap and include many non-protein-coding regions” • Who Are They? • tRNA, rRNA, snoRNAs, • microRNAs, siRNAs • piRNAs • long ncRNAs (Xist, Evf, Air, CTN, PINK…) • How Many of them • Open question • 30.000 is a common guess • Harder to detect than proteins .

  3. Searching “…When Looking for a Needle in a Haystack, the optimistic Wears Gloves…”

  4. ncRNAs can have different sequences and Similar Structures

  5. A A C C C C A A A A C C G G G G G G G G A A A A C C G G G G CTTGCCTCC GAACGGACC CTTGCCTGG GAACGGAGG ncRNAs Can Evolve Rapidly CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG **-------*--**---*-**------**

  6. ncRNAs are Difficult to Align CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG **-------*--**---*-**------** Regular Alignment --CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATAGAACGGAGG-- * * *** * * *** *

  7. ncRNAs are Difficult to Align • Same Structure Low Sequence Identity • Small Alphabet, Short Sequences  Alignments often Non-Significant

  8. Obtaining the Structure of a ncRNA is difficult • Hard to Align The Sequences Without the Structure • Hard to Predict the Structures Without an Alignment

  9. The Holy Grail of RNA Comparison:Sankoff’ Algorithm

  10. The Holy Grail of RNA ComparisonSankoff’ Algorithm • Simultaneous Folding and Alignment • Time Complexity: O(L2n) • Space Complexity: O(L3n) • In Practice, for Two Sequences: • 50 nucleotides: 1 min. 6 M. • 100 nucleotides 16 min. 256 M. • 200 nucleotides 4 hours 4 G. • 400 nucleotides 3 days 3 T. • Forget about • Multiple sequence alignments • Database searches

  11. The next best Thing: Consan • Consan = Sankoff + a few constraints • Use of Stochastic Context Free Grammars • Tree-shaped HMMs • Made sparse with constraints • The constraints are derived from the most confident positions of the alignment • Equivalent of Banded DP

  12. Consan for Databases: Infernal • Infernal is a Faster version of Consan • For Database Search • Sill Very Slow Receiver operating characteristic (ROC) Comparison of Infernal with BLAST

  13. Consan for Databases: Infernal • BLAST: 360 s. • Fast Infernal: 182 000 s. • Slow Infernal: 5 320 000 s.

  14. Searching Databases for New RNAs

  15. Rfam: In practice • Rfam contains RNA families • Families  Multiple Sequence Alignment  Models • Models are like Pfam Profiles • Use Consan or Cmsearch rather than HMMer • Much Slower • Too expensive to search the models • Models are used to build Rfam • People usually BLAST Rfam

  16. Where do Rfam Families Come From? • Infernal Requires a Model • Models requires an MSA • The MSA requires a Family • It all starts with a BlastN Rfam, Gardner et al. NAR 2008

  17. Can we make BlastN more accurate ? • BlastN is not very accurate because: • Poor substitution models for Nucleic Acids • Low information density (4 symbols) • BlastN assumes • Equal evolution rates for all nucleotides • Independence form Neighbors

  18. Love Thy Neighbor Measured Nearest Neighbor Dependencies on Rfam sequences

  19. High Rate of CpG mutations

  20. Measuring Di-Nucleotide Evolution • Each Nucleotide can be made more informative • It can incorporate the “name” of its Neighbor • AA => a • AG => b • AC => c • AT => d • … • A 16 Letter alphabet can be used to recode all nucleotide sequences • We name these extended Nucleotides

  21. Blosum-R and eRNA

  22. Substitutions ?? • How much does it cost to turn one nucleotide into another one ? • Blosum/Pam style matrix • Matrices estimated on Rfam families

  23. Blosum-R and eRNA

  24. Using BlastR • When Nucleic Acids look like Proteins • They can be aligned with Protein Methods • BlastN  BlastP • BlastP with eRNA is BlastR

  25. Validating Blast-R

  26. Benchmarking BlastR PP Query PN E V A L U E S Blast Rfam

  27. Benchmarking BlastR Blast Rfam 001 Rfam 001 ROC Blast Rfam 002 Rfam 002 Blast Rfam … Rfam …

  28. Benchmarking BlastR False Positives Bad Good True Positive Good Bad

  29. Benchmarking BlastR False Positives Bad Good Area Under Curve Small AUC  Better True Positive

  30. BlastR vs The World

  31. The 3 Components of Blast R • BlastP is better than BlastN • BlosumR makes BlastP a little bit better Blast: wuBlast

  32. The 3 Components of Blast R • BlastP is better than BlastN • BlosumR makes BlastP a little bit better • And Faster

  33. BlastR and Clustering Sensitivity • Given all Rfam in Bulk • How good is BlastR at reconstituting all the families 1-Specificty

  34. BlastR and Clustering Sensitivity • Given all Rfam in Bulk • How good is BlastR at reconstituting all the families 1-Specificty

  35. BllastR: In Practice

  36. BllastR: In Practice BlastR -20 E-Value Threshold: 10 BlastN

  37. Take Home • Searching Nucleotides is Difficult • BlastN is not a very good algorithm • Simple Adaptations can improve the situation • Changing the algorithm (BlastP) • Changing the Scoring Scheme (BlastP-Nuc) • Changing the alphabet (BlastR)

More Related