C omputational ncRNA gene finding (& nc RNA structure prediction)

Computational ncRNA gene finding (& ncRNA structure prediction) ncRNA structure prediction (& computational ncRNA gene finding) Liming Cai (BINF8210@UGA, Fall 2010)

Non-coding RNAs • Functions other than coding proteins, e.g., structural, catalytic, and regulatory factors functional RNAs = ncRNAs + UTR motifs • (-) No strong statistical features, such as ORFs, or polyadenylated, demonstrated in coding genes • (+) Transcribed ncRNA molecules can fold into secondary and tertiary structures (more conserved than sequences)

Sources of ncRNAs • Non-coding RNA genes encode RNAs, e.g., miRNAs, rox1 and rox2 RNAs in male Drosophila melanogaster. • In introns and intergenic regions, e.g., snoRNAs • In 5’ and 3’ UTRs, e.g., regulatory motifs (functional RNAs)

Functions of ncRNAs • rRNAs and tRNAs • RNA maturation: snRNA in recognizing splicing sites • RNA modification: snoRNA converting uridine to pseudo-uridine • Regulation of gene expression and translation: e.g., miRNAs • DNA replication: e.g., telomerase RNAs - template for addition of telomeric repeats • Etc.

Classes of ncRNAs(Bompfunewerer, et al, 2005)

Some ncRNAs databases • Rfam (280,000 regions of 379 families) • NONCODE (109 transitional classes and 9 groups) • RNAdb (800 mammalian ncRNAs, excluding tRNAs, rRNAs and snRNAs) • Arabidposis small RNA Project (ASRP) • Etc.

ncRNA gene finding strategies • Computational predictive methods • cDNA cloning to enrich ncRNAs • Detecting new transcripts with oligonucleotide microarrays

ncRNA gene finding: a computational challenge • ncRNA genes do not have significant statistical signals • large in number • diverse, 20 nts to 22,000 nts • Not sure what to look for • Computationally intensive • - Simply no good method • - Methods compromising accuracy

Computational ncRNA gene finding methods • Specific (custom-designed) ncRNA search and annotation (e.g., tRNAscan, methylattion-guide snoRNA, miRNA, tmRNA) • Reconfigurable search systems (e.g., Infernal, ERPIN, RNATOPS,FastR) • mechanism to profile the target ncRNA (structure) - need training data • De novo ncRNA gene detection with • base composition (e.g., G+C %) • structure fold (e.g., RNAz) • Comparative analysis (e.g., QRNA, EvolFold) - consensus structure • ncRNA “holy grail” ?

Review literature in computational ncRNA gene finding and annotation • A. Laederach (2007) Informatics challenges in Structural RNA, Brief Bioinformatics 8(5) 294-303. • S. Eddy (2001) Non-coding RNA genes and modern RNA world, Nature Reviews Genetics, 2(12), 919-929. • S. Griffiths-Jones (2007) Annotating noncoding RNA genes, Annual Rev. Genomics & Human Genetics, 8:279-298. • Machado-Lima et al (2008) Computational methods in noncoding RNA research, Mathematical Biology, 56: 15-49.

506 miRNAs Comparison between NUPACK and Triple 499 tRNAs Comparison between NUPACK and Triple Data were from Bonnet et al, 2004

499 tRNA Comparisons between HG, Triple, NUPACK 499 tRNA Comparisons between HG and NUPACK Data were from Bonnet et al, 2004

[tRNA unfolding pathway] [Doudna,et al, 1999] What are in this lecture? • RNA secondary structure prediction 1. ab initio structure prediction 2. consensus structure prediction 3. structural model-based prediction but why just secondary structure?

What else are in this lecture? • ncRNA gene finding and annotation 4. Structural profile-based ncRNA gene annotation 5. comparative analysis based ncRNA gene finding 6. ab initio ncRNA gene detection

Base pairings of RNAs • Base pairings allow RNA to fold • Watson-Crick base pairs: A-U, C-G • Wobble pair G-U • called canonical pairs for secondary structure Note: all 16 (including non-canonical) base pairs are possible for RNA tertiary structure

P a H g P N H O c P N N N H N P u N a O H N N P H H N H O N N N H N N N O 5’-u-u-c-c-g-a-a-g-c-u-c-a-a-c-g-g-g-a-a-a-u-g-a-g-c-u-3’ 3’ 5’ CYTOSINE GUANINE URACIL ADENINE

Secondary structure is important to tertiary structure

acc acc Stems in nested or parallel pattern c guu aga aac c ucu cccc gc gca ggg ugc ggu cc stem (double helix): stacked base pairs loop: strand of unpaired bases

Stems in crossing patterns c guu aga aac c ucu cccc acc gc gca ggg ugc acc ggu cc Pseudoknots: crossing patterns of stems

RNA secondary structure elements Pseudoknot Stem Interior Loop Single-Stranded Bulge Loop Junction (Multiloop) Hairpin loop Image– Wuchty

RNA stem-loop (pseudoknot-free) structure example

RNA secondary structure prediction • ab inito structure prediction to predict the structure of a single sequence 2. Consensus structure prediction to predict the structure shared by more than one sequences 3. Statistical model-based prediction and alignment to search for desirable structures on genomes or data bases

1. ab initio structure prediction • Hydrogen bonds consume energy contained in the molecule. • The smaller the free energy is, the more stable the structure folded.

ab initio structure prediction (cont’) • Consider only canonical base pairs A-U, C-G, and G-U. Base pairings reduce the amount of free energy contained in the molecule. • Maximizing the number of base pairs would minimize the free energy in the molecule. (Only an approximate model)

ab initio structure prediction (cont’) • But how to count? An RNA could be very long; there may be many possible ways that base pairs can be formed: e.g., ……ACGGUACGUC….. conflicting pairs A-U, A-U G-C, G-C etc. Even the number of non-conflicting combinations of base pairs is exponentially large.

j i (1) head paired with tail (2) tail is unpaired (3) head is unpaired (4) two subfolds j i k ab initio structure prediction (cont’)

looking at shorter (e.g., very short) subsequences in a long sequence ACGGU…ACGUC • For subsequences of length 1, A, C, G, G, U, …, A, C, G, U, C #of base pairs 0, 0, 0, 0, 0, …, 0, 0, 0, 0, 0 • For subsequences of length 2, AC, CG, GG, GU, …, AC, CG, GU, UC # 0, 1. 0, 1, …, 0, 1, 1, 0 • For subsequence of length 3, ACG, CGG, GGU, …, UAC, ACG, CGU, GUC, UUC ?: e.g., GUC (1) G-C + U --> 1+0 =1 head-tail (2) G + UC --> 0+0 =0 head unpaired (3) GU + C --> 1+0 =1 tail unpaired (4) GU + C --> 1+0 =1 split (5) G + UC --> 0+0 =0 split

examine a little longer sequence …..ACGGUACGU….. i j ==> max of {cases 1, 2, 3, 4} • Head-tail paired, count = 1 + max count in subsequence CGGUACG i+1 j-1 2. Head unpaired, count = max count in subsequence CGGUACGU i+1 j • Tail unpaired, count = max count in subsequence ACGGUACG i j-1 • Split (why needed and where to split ?) ACGGUACGU when k=i+2 i j ==> ACG + GUACGU <---- k ---> count = max count in ACG + max count in GUACGU

simple model: (i, j) = 1 Ab initio structure prediction (cont’) • Maximizing the number of base pairs (Nussinov et al, 1978)

G G G A A A U C C 0 0 0 0 0 0 1 2 3 0 0 0 0 0 1 2 3 0 0 0 0 1 2 2 GAAAUC 0 0 0 1 1 1 0 0 1 1 1 0 1 1 1 0 0 0 0 0 0 G G G A A A U C C GGGAAAUCC Ci,j = 0 when i=j AAUC AU

Example 2: ACGGUU subsequence of length 0: empty sequence, 0 pairs subsequences of length 1: A, C, G, G, U, U 0 0 0 0 0 0 pairs subsequences of length 2: AC, CG, GG, GU, UU 0 1 1 0 0 pairs subsequences of length 3: ACG, CGG, GGU, GUU 1 1 1 1 pairs Subsequences of length 4: ACGG, CGGU, GGUU 1 2 2 pairs Subsequences of length 5: ACGGU, CGGUU 2 2 pairs subsequence of length 6: ACGGUU 3 pairs

Prediction Algorithm Web Server • http://frontend.bioinfo.rpi.edu/applications/mfold/cgi-bin/rna-form1.cgi • Sample sequence: (1) tRNA GGGGUCAUAGCUCAGUUGGUAGAGCGCUACAAUGGCAUUGUAGAGGUCAGCGGUUCGAUCCCGCUUGGCUCCACCA (2) a part of tmRNA CCUCUCUCCCUAGCCUCCGCUCUUAGGACGGGGAUCAAGAGAGGUCAAACCCAAAAGAGA • Simple matrix, • simple matrix with G-U pair • Complex matrix Rfam database: http://www.sanger.ac.uk/Software/Rfam/

Thermodynamic energy based structure prediction • Energy minimization algorithm predicts the correct secondary structure by minimizing the free energy (G) • G calculated as sum of individual contributions of: • loops • base pairs • secondary structure elements

Free-energy values (kcal/mole at 37oC ) • Energies of stems calculated as stacking contributions between neighboring base pairs

Free-energy values (kcal/mole at 37oC )

Zuker’s algorithm MFOLD: computing loop dependent energies

Assumptions in such algorithms • Most likely structure corresponds to energetically most stable structure • Energy associated with any position is only influenced by local sequence and structure • Structure formed does not produce pseudoknots

RNA structure prediction web servers • MFOLD http://www.bioinfo.rpi.edu/applications/mfold/rna/form1.cgi • RNAfold ( a part of Vienna Package) http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi Examples: GCTTACGACCATATCACGTTGAATGCACGC CATCCCGTCCGATCTGGCAAGTTAAGCAAC GTTGAGTCCAGTTAGTACTTGGATCGGAGA CGGCCTGGGAATCCTGGATGTTGTAAGCT

RNA pseudoknot (tmRNAs) Bacterial tmRNA consensus structure (Felden et al. 2001. NAR 29) terminates translation errors

Functions of pseudoknots (TMV 3’ UTR) Promotes efficient translation Binds EF1A, cooperates with 5’UTR (Leathers et al. 1993 MCB 13 Zeenko et al. 2002 JVI 76)

Pseudoknots drastically increase computational complexity

RNA pseudoknot prediction web servers • Pknots-RG: http://bibiserv.techfak.uni-bielefeld.de/pknotsrg/ • Pknots-RE (the first pseudoknot prediction algorithm) • Kinefold: http://kinefold.curie.fr/cgi-bin/form.pl • ILM http://cic.cs.wustl.edu/RNA/

Computational complexity issues • Pseudoknot-free structures: O(n3) CUP time • Pseudoknots: NP-hard, restricted cases O(n5) • Heuristics added: O(n4) • Difficult for search RNA structures in genomes

2. Consensus structure prediction • Covariance fact for RNAs: • Variations in RNA sequence maintain base-pairing patterns for secondary structures • When a nucleotide in one base changes, the base it pairs to must also change to maintain the same structure

query: GGGGGCAACCCC query: GGGGGCAACCCC query: GGGGGCAACCCC     | | |        |  | |        |  | |    A: AUCCGAAAGGAU B: CCUAGAAAGGAU B: CCUAGAAAGGAU query: GGGGGCAACCCC |||||  | | | | | | A: AUCCGAAAGGAU Structure alignments (example) C A G A G•C G•C G•C G•C A A G A C•G C•G U•A A•U G A G A AG UG CA CU Query RNA structure A: structural homolog B: nonhomologous primary sequence alignment scoring: -6 -6 structure + sequence alignment scoring: +11 -6

Covariance Mutlipel Structural Alignment of 13 tmRNA genes from the β-proteobacteria [Felden et al’01]

j i p q Dynamic programming approach • (Sankoff 1984) This can be regarded as running ‘two Nussinov algorithms at the same time’ to simultaneously fold two RNAs ‘the coordinated fold’ is found through computing Ci,j,p,q, needs: O(n6) time for two sequences and O(n3k) for k seqs

Inferring structure by comparative sequence analysis • (1) calculate a multiple sequence alignment Requires sequences to be similar enough so that they can be initially aligned Sequences should be dissimilar enough for covarying substitutions to be detected

C omputational ncRNA gene finding (& nc RNA structure prediction)