270 likes | 408 Views
Comparative Study of Small RNA and Small Peptides in Complete Genome Sequences. Stanislav Luban 1,2 Daisuke Kihara 2,1 1. Department of Computer Sciences 2. Department of Biological Sciences Purdue University, West Lafayette, IN. Introduction: Structural Small RNA (sRNA).
E N D
Comparative Study of Small RNA and Small Peptides in Complete Genome Sequences Stanislav Luban1,2 Daisuke Kihara2,1 1. Department of Computer Sciences 2. Department of Biological Sciences Purdue University, West Lafayette, IN BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Introduction: Structural Small RNA (sRNA) • Genes which produce non-coding transcripts that function directly as structural, regulatory, or catalytic RNAs • Include rRNAs, tRNAs, small nucleolar RNAs, spliceosomal RNAs, viral associated RNAs, microRNAs, ctRNAs, and others • In Rfam (RNA families) database, 34496 sRNA entries distributed among 352 known families are stored • In E. coli, about 50 sRNAs are known (figure from Rfam database: http://www.sanger.ac.uk/Software/Rfam/) BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Methods: QRNA Model distinctive patternof mutation: • Conserved Structural RNA • Pattern of compensatory mutations consistent with base-paired secondary structure • Pair Stochastic Context-Free Grammar Model • Conserved Coding Region • Pattern of synonymous codon substitutions • Pair Hidden Markov Model • Other Types of Conserved Regions • Approximated by “null hypothesis” that mutations occur position independently, without pattern • Pair Hidden Markov Model • Scores are log likelihoods used to calculate final log odds score for RNA model compared to other two models (Figure: Rivas et al, Current Biol. 2001) BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Procedure for Extracting sRNAs Extract Intergenic Regions From 30 Sequenced Genomes Eliminate Family Regions Not Found Using Both Query And Database Organism As Source Verify Results Computationally And Experimentally (Yet To Be Done) Perform All Vs. All Nucleotide- Nucleotide BLAST Merge sRNA Regions Which Align or Exactly Overlap Into Families Extend Regions Within 25 nt Of Other Regions Causing Them To Include Each Other Select Significant Alignments, Concatenate and Format into QRNA Program Input Run QRNA, Extract Alignments Scoring as sRNAs vs. Coding and Null Hypothesis Regions Eliminate Alignment Regions Which Overlap >50% with E. coli Regulatory Regions BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
30 Microbial Genomes Used as Queries and Databases: Gammaproteobacteria Acinetobacter calcoaceticus Blochmannia floridanus Buchnera aphidicola Coxiella burnetii Erwinia carotovora Escherichia coli Haemophilus ducreyi Haemophilus influenzae Pasteurella multocida Photorhabdus luminescens Pseudomonas aeruginosa Pseudomonas putida Pseudomonas syringae Salmonella enterica Salmonella typhimurium Shewanella oneidensis Shigella flexneri Vibrio cholerae Vibrio parahaemolyticus Vibrio vulnificus Wigglesworthia brevipalpis Xanthomonas campestris Xanthomonas citri Xylella fastidiosa Yersinia pestis Alphaproteobacteria Agrobacterium tumefaciens Brucella melitensis Caulobacter crescentus Mesorhizobium loti Deinococci Deinococcus radiodurans Genome Data Set BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Result Statistics • Total number of intergenic regions: 94464 • Average number of intergenic regions per organism: 3148.8 • Total combined length of intergenic regions: 16663732 nt • Average length of intergenic region: 176.4 nt BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
sRNA Length vs. Score Plot Total: 29488 sRNAs BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Number of sRNA Entries by Organism 1 - Pseudomonas putida 2 - Shigella flexneri 3 - Xanthomonas citri 4 - Shewanella oneidensis 5 - Wigglesworthia brevipalpis 6 - Haemophilus ducreyi 7 - Pseudomonas syringae 8 - Erwinia carotovora 9 - Escherichia coli 10 - Vibrio parahaemolyticus 11 - Mesorhizobium loti 12 - Buchnera aphidicola 13 - Brucella melitensis 14 - Yersinia pestis 15 - Xylella fastidiosa 16 - Pseudomonas aeruginosa 17 - Salmonella enterica 18 - Caulobacter crescentus 19 - Agrobacterium tumefaciens 20 - Blochmannia floridanus 21 - Pasteurella multocida 22 - Deinococcus radiodurans 23 - Vibrio cholerae 24 - Photorhabdus luminescens 25 - Coxiella burnetii 26 - Vibrio vulnificus 27 - Salmonella typhimurium 28 - Acinetobacter calcoaceticus 29 - Xanthomonas campestris 30 - Haemophilus influenzae Total: 29488 sRNAs BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Conservation of sRNAs Total: 3768 families BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Conservation of sRNAs Along with statistics for all entries, statistics for entries containing at least one entry from E. coli were added for comparison E. Coli Total: 554 families Total: 3768 families BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Common OrganismCombinations in Families • Top 5 most frequent combinations of 4 and 7 organisms: Combination: Occurances: Ecoli, Senterica, Sflexneri, Styphimurium 117 Ecarotovora, Ecoli, Senterica, Styphimurium 26 Ecoli, Senterica, Styphimurium, Ypestis 20 Ecarotovora, Ecoli, Sflexneri, Styphimurium 18 Ecoli, Sflexneri, Styphimurium, Ypestis 17 Ecarotovora, Ecoli, Pluminescens, Senterica, Sflexneri, Styphimurium, Ypestis 4 Acalcoaceticus, Ccrescentus, Mloti, Paeruginosa, Pputida, Psyringae, Xcampestris 2 Acalcoaceticus, Atumefaciens, Ccrescentus, Mloti, Pputida, Psyringae, Xcampestris 2 Acalcoaceticus, Atumefaciens, Ccrescentus, Mloti, Paeruginosa, Psyringae, Xcampestris 2 Acalcoaceticus, Atumefaciens, Ccrescentus, Mloti, Paeruginosa, Pputida, Xcampestris 2 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Result Verification • 71 total sRNAs related to E. coli already found to be annotated in Rfam database were used as benchmark • Of those: • 15 – found by computational method that were also listed in Rfam and not tRNAs • 6 – not found due to shortcomings of method • 29 – tRNAs already annotated as gene loci in E. coli genome sequence used • 10 – E. coli plasmid loci not found in full E. coli genome sequence used • 2 – 4.5S RNAs already annotated as gene loci in E. coli genome sequence used • 2 – E. coli reverse transcriptase loci not found in full E. coli genome sequence used • 1 – E. coli insertion sequence not found in full E. coli genome sequence used • 1 – E. coli small RNA annotated separately, not found in full E. coli genome sequence used • 1 – Antisense RNA already annotated as gene locus in E. coli genome sequence used • 1 – Cloning vector with E. coli promoter not found in full E. coli genome sequence used • 1 – E. coli transposable element not found in full E. coli genome sequence used • 1 – Reporter vector not found in full E. coli genome sequence used • 1 – E. coli retron not found in full E. coli genome sequence used BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Candidates for ExperimentalVerification of Findings For the following 2 slides: • Family designation expressed as [Organism name] [locus absolute start location] [locus absolute end location] and is synonymous with the first (header) entry of that family • Entries refer to number of different organism (2 chromosomes counted separately) sRNA entries in the family • Length (nt) and score only refer to the header entry of the family • Scores calculated by QRNA program with log odds post for RNA likelihood as opposed to null hypothesis BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Candidates for ExperimentalVerification of Findings • Top 10 highest statistically scoring E. coli sRNA loci found by computational method: • Family designation: Ecoli 3941194 3941327 Length: 133 Score: 34.114 • Family designation: Ecoli 2744345 2744445 Length: 100 Score: 29.631 • Family designation: Ecoli 780875 781068 Length: 193 Score: 29.194 • Family designation: Ecoli 2687537 2687689 Length: 152 Score: 27.734 • Family designation: Ecoli 2519348 2519548 Length: 200 Score: 23.876 • Family designation: Ecoli 4169337 4169400 Length: 63 Score: 21.625 • Family designation: Ecoli 4038218 4038281 Length: 63 Score: 21.596 • Family designation: Ecoli 2751994 2752022 Length: 28 Score: 20.893 • Family designation: Ecoli 3420989 3421058 Length: 69 Score: 20.821 • Family designation: Ecoli 3808832 3808858 Length: 26 Score: 16.995 BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Candidates for ExperimentalVerification of Findings • Top 10 largest sRNA families found by computational method: • Family designation: Styphimurium 3358766 3358804 Entries: 18 Length: 38 Score: 4.590 • Family designation: Ecarotovora 3161909 3161946 Entries: 15 Length: 37 Score: 12.604 • Family designation: Ecarotovora 1144121 1144141 Entries: 12 Length: 20 Score: 5.265 • Family designation: Styphimurium 3342804 3342899 Entries: 12 Length: 95 Score: 4.328 • Family designation: Ecarotovora 2597534 2597593 Entries: 10 Length: 59 Score: 3.343 • Family designation: Paeruginosa 2508264 2508282 Entries: 9 Length: 18 Score: 7.068 • Family designation: Styphimurium 975191 975219 Entries: 8 Length: 28 Score: 16.296 • Family designation: Styphimuriu 3746886 3746903 Entries: 8 Length: 17 Score: 1.146 • Family designation: Ecarotovora 3477891 3477922 Entries: 8 Length: 31 Score: 2.697 • Family designation:Ecarotovora 4490537 4490683* Entries: 7 Length: 146 Score: 16.753 *This last entry was used a sample for detailed study and is discussed subsequently. BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Detailed Study of Located Sample sRNA • Hit to Alpha_RBS RNA (Rfam: RF00140) (115 nt) Rfam Sequence: GUCCUUGAUAUUCUGUUUGAGUAUCCUGAAAACGGGCUUUUCAAGAUCAGAAUAUCAAAUUAAUUAAAAUAUAGGAGUGCAUAGUGGCCCGUAUUGCAGGCAUUAACAUUCCUGAU Organism Location (in genome) Length(nt) Score Neighboring Genes Ecarotovora 4490537-4490683 146 16.753 rpsM - rpmJ Pluminescens 5487752-5487866 114 10.791 rpsM - secY Ypestis 232330-232476 146 15.757 rpmJ - rpsM Styphimurium 3585744-3585879 135 41.980 rpsM - rpmJ Senterica 4243623-4243770 147 40.046 rpmJ - rpsM Ecoli 3440108-3440255 147 43.556 rpsM - rpmJ Sflexneri 3426855-3427002 147 41.980 rpsM - rpmJ BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Detailed Study of Located Sample sRNA • Most Likely (Lowest Free Energy) Predicted Fold of 80 nt Segment of Sequence • Mfold by Zuker et al, 2004 Used BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Another Approach to Finding sRNAs in E. Coli: Paper Summary BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Method Used in Paper to Find Putative sRNAs • A database of all E. coli intergenic DNA sequences was created based on gene annotations in early release of the EcoGene database, and used as input to profile search program (pftools2.2, Swiss Bioinformatics Institute) set to find sigma-70 promoter • Terminator motif was searched for in database using following search criteria: (1) An 11-nt A-rich region; (2) variable-length hairpin; (3) variable-length spacer; (4) 5-nt T-rich region nearest the hairpin; and (5) 7-nt distal extra T-rich region • Predicted promoter and terminator pairs were combined to generate putative sRNAs if (1) pair was on same strand; and (2) pair was greater than 45 but less than 350 nt apart • To verify, open reading frames and possible ribosome binding sites were searched for downstream of each promoter BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Synopsis of Method Used in Paper • Using the E. Coli MG1655 genome, DNA regions that contained a sigma-70 promoter within a short distance of a rho-independent terminator were searched for • 227 putative sRNAs between 80 and 400 nt in length were predicted in E. coli by paper, 32 of which were already known to be sRNAs • Transcripts of some of the candidate loci were verified using Northern hybridization • Approach may possibly be used in annotating sRNA loci in other bacterial genomes BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Verification of Paper Results with Results Using Our Method • Along with other results, the paper gives a detailed listing of the 277 sRNAs predicted, including the designation, strand orientation (forward or reverse), left and right boundaries (nt from genome start position), and length (nt) of each sRNA • Left and right boundary positions in genome given by paper were compared with left and right boundary positions of putative sRNAs found by our method • If an sRNA candidate from the paper was within 100 nt of any sRNA predicted by our method, that sRNA was scored as ‘found’ BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Results of Verification • 227 candidate sRNAs were predicted in E. coli by the paper • Among them, 150 (66.1 %) were localized by our method, according to previously utilized criteria • The test was re-run with a 50 nt threshold, yielding 140 hits (61.7 %), a 10 nt threshold, yielding 128 hits (56.4 %), and a 1000 nt threshold, yielding 187 hits (82.4 %) BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Preliminary Procedure for Extracting Small Peptides Extract Intergenic Regions From 30 Sequenced Genomes Blast Resulting Family Entries Against SwissProt Database Observe Results and Refine Extraction Method Perform All Vs. All Nucleotide- Nucleotide BLAST Merge sRNA Regions Which Align or Exactly Overlap Into Families Extend Regions Within 25 nt Of Other Reions Causing Them To Include Each Other Select Significant Alignments, Concatenate and Format into QRNA Program Input Run QRNA, Extract Alignments Scoring as Coding vs. sRNA and Null Hypothesis Regions Score Regions Based on Quality of Fit Inside a Nearby Open Reading Frame BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Preliminary Results of Small Peptide Search • Tblastx Alignment Query: 133 LPPNAGTYVPACWPSPALPYRQIPPEYPDSNP 38 Subject: 1373 LPPLXTSXXPPPPPPPSXPLXSLPPSXPPSLP 1278 • Query Sequence Information Organism Location (in genome) Length(nt) E-Value Erwinia Carotovora 843815-843948 133 0.69 Aligns To gb|AAF36091.1| flagelliform silk protein [Nephila madagascariensis] Sequence aattccgtcgcatgttctctggtgagtacgacagcgcggattgctatctggatattcaggcgggatctggcggtacggaagcgcaggactgggccagcatgctggtacgtatgtacctgcgttgggcggaagc BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Preliminary Results of Small Peptide Search • Tblastx Alignment Query: 62 PRATAPHPDPVRPAPETAPTP 124 Subject: 90 PPAPAPRPPPVAPAPRPLPPP 28 • Query Sequence Information Organism Location (in genome) Length(nt) E-Value Pseudomonas syringae 6171796-6172006 210 0.23 Aligns To emb|CAD88221.2| C. elegans GRL-25 protein (corresponding sequence ZK 643.8) Sequence tgagttccggcagctcgtcatccagcttctgacgcaaccgcccggtcagaaacgcaaagccctcgagcaaccgctccacatccggatcccgtccggcctgccccagaaacggcgccaacgccggactacgctcggcgaagcgacgaccaagctggcgcagtgcagtgagttcgctctggtagtaatggttaaaggacacgggttacctgc BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Conclusions • Possible sRNAs are found from 20~39% of the intergenic regions in each organism • Among them, ~31% of the sRNAs satisfy the log-odds score threshold of 5.0 or higher • 137 “families” are conserved in equal to or more than 5 organisms • Being well conserved, sRNAs may be responsible for fundamental functions of living organisms BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.
Future Direction • Search for sRNAs will be expanded to a larger quantity of more diverse genomes • Secondary structure prediction will be later employed in greater detail to verify well conserved sRNA regions among multiple evolutionarily distant organisms • Experimental verification of the findings of this particular study under way (particularly for Shewanella oneidensis) • Comparative genomics will be used to discover the function associated with each sRNA and possibly lead to learning its part in pathway BIOL497 Undergraduate Presentation, Stanislav Luban, Member of Kihara Lab, Purdue Univ.