210 likes | 407 Views
GENOME SIGNATURES OF MICROBIAL ORGANISMS IDENTIFIED BY AMINO ACID N-GRAM ANALYSIS B. Suman Bharathi Advisor: Judith Klein-Seetharaman Forschungszentrum, Juelich, Germany. Genome Signatures.
E N D
GENOME SIGNATURES OF MICROBIAL ORGANISMS IDENTIFIED BY AMINO ACID N-GRAM ANALYSIS B. Suman Bharathi Advisor: Judith Klein-Seetharaman Forschungszentrum, Juelich, Germany
Genome Signatures • Sequence peptides which occur with unusually high frequency unlike others in particular organism or pathogen • Potential applications: • Drug development: synthetize drugs which target genome signature in pathogen • Sensor development: use genome signature to identify organism quickly using antibody
Approach • Linguistic approach • N-gram analysis using toolkit • What the BLMT toolkit provides • N-gram statistical analysis • Definition of signature sequences • Use of toolkit on Neisseria Meningitidis 0.09 Neisseria meningitidis versus other species n=4 0.08 0.07 0.06 0.05 Occurrence of n-gram (%) 0.04 0.03 0.02 0.01 0 SDGI LAAL AALL LLAA ALLA AAAL LAAA ALAA AALA AVLA AAAA AVAA AAAV EAAA AEAA AAEA AAVA AAAE GRLK MPSE n-gram = sequence of length n
Use of BLMT • N-gram statistical analysis gives us a detailed statistical data in terms of frequency of n-grams and their respective mean and standard deviations. • We have taken 45 organisms into consideration –bacteria, archaea, mycoplasmas and human • Search for n-grams whose standard deviations are away from the mean values. • Indicates the difference between expected and observed values in frequency of the n-grams. • Eventually helps us to see the unsusuality of this n-gram in the organism unlike the others compared.
Xylella(black) Vibrio(red) Ureaplasma(green) Treponema(blue) Thermotoga(yellow) Difference Between Expected and Observed frequencies n-gram The positive values indicate the over-represented n-grams while the negative values indicate the under-represented n-grams
Initial Points of difference between expected and observed frequency graph Xylella(black) Vibrio(red) Ureaplasma(green) Treponema(blue) Thermotoga(yellow) Ureapasma shows high difference values (approx 0.00021), indicating over-representation of n-grams compared to expected probability of occurence in the organism
Mycoplasma genitalium(black) M.tuberculosis(red) M.leprae(green) Mesorhizobium(blue) Lactococcus(yellow) Standard deviation away from the mean • Mycoplasma genitalium(black) • M.tuberculosis(red) • M.leprae(green) • Mesorhizobium(blue) • Lactococcus(yellow) Shows distribution of n-gram standard deviations with both high and low values of difference, indicating the over-expressed and under-expressed n-gram values.
Highest standard deviations away from the mean • Mycoplasma genitalium(black) • M.tuberculosis(red) • M.leprae(green) • Mesorhizobium(blue) • Lactococcus(yellow) Shows initial (highest) values of standard deviation away from mean N-grams of M.tuberculosis much higher than M.leprae.
Comparison of genome size with varying standard deviations • Examine the relationship between genome size and distribution of n-gram standard deviations for each organism • Human genome taken as reference. • Compare genome size and standard deviations within same genus but across different species.
Size Distribution of Genomes 1.Human 22889476 2.Bacteria_Mesorhizobium_loti 4080256 3.Bacteria_Pseudomonas_aeruginosaPA01 3730192 4.baceria E_coi0157H7Baceria_Escherichia_coiO157H7 3229098 5.Bacteria_Escherichia_coliO157H7EDL933 3228100 6.Bacteria_Escherichia_coliK12 2726558 7.Bacteria_Mycobacterium_tuberculosisH37Rv 2666338 8.Bacteria_Bacillus_subtilis 2442200 9.Bacteria_Bacillus_halodurans_C125 2384352 10.Bacteria_SynechocystisPCC6803 2072748 11.Bacteria_Vibrio_cholerae_chr1 1725852 12.Bacteria_Deinococcus_radioduransR1_chr1 1559376 13.Bacteria_Xylella_fastidiosa 1490262 14.Archaea_Archaeoglobus_fulgidus 1343990 15.Bacteria_Pasteurella_multocida 1340102 16.Bacteria_Lactococcus_lactis_subsp_lactis 1335222 17.Archaea_Aeropyrum_pernix 1280062 18.B_Neisseria_meningitidis_serogroupBstrainMC58 1178096 19.Archaea_Halobacterium_spNRC1 1178038 20.B_Neisseria_meningitidis_serogroupAstrainZ2491 1176104 21.Bacteria_thermotoga_maritima 1167344 22.Bacteria_Pyrococcus_horikoshiiOT3 1141216 23.Bacteria_Mycobacterium_leprae_strinTN 1080756 24.A_Methanobacterium_thermoautotrophicum_deltaH 1054752 25.Bacteria_Haemophilus_influenzaeRd 1045572 26.Bacteria_Campylobacter_jejuni 1020944 27.Bacteria_Helicobacter_pylori_strianJ99 990942 28.Bacteria_Helicobacter_pylori26695 986258 29.Archaea_Methanococcus_jannaschii 970558 30.Bacteriae_Aquifex_aeolicus 968068 31.Archaea_Thermoplasma_acidophilum 909164 32.Archaea_thermoplasma_volcanium 903228 33.Bacteria_Chlamydophila_pneumonieaeJ138 735350 34.Bacteria_Chlamydophila_pneumonieaCWL029 725492 35.Bacteria_Chlamydophila_pneumonieaeAR39 729896 36.Bacteria_Treponema_pallidum 703414 37.Bacteria_Chlamydia_muridarum 646712 38.Bacteria_Chlamydia_trachomatis 626142 39.Bacteria_Rickettsia_prowazekii_strain_MadridE 559828 40.Bacteria_Mycoplasma_pneumoniae 480870 41.Bacteria_Ureaplasma_urealyticum 457608 42.Bacteria_Buchnera_sp_APS 371470 43.mycoplasma genitalium 352826 44.Bacteria_Borrelia_burgdorferi 300106
Size genome graph and varying std deviation values • Human(black22889476) • Mesorhizobium(red,4080256) • P.aeruginosa(green,3730192) • E_coi0157h7(blue,3229098) • E_coli0157h7EDl933 • (yellow,3228100) The organisms are listed in descending order of genome size. The relation between distribution of n-gram standard deviations and size is compared.
Tail end of Genome size and n-gram distribution of standard deviations Human(black,22889476) Mesorhizobium(red,4080256) P.aeruginosa(green,3730192) E_coi0157h7(blue,3229098) E_coli0157h7EDl933 (yellow,3228100) Human genome, though largest in size, has low values of n-gram standard deviation values away from the mean compared to smaller genomes
Initial points: Genome size and n-gram distribution of standard deviations Human(black,22889476) Mesorhizobium(red,4080256) P.aeruginosa(green,3730192) E_coi0157h7(blue,3229098) E_coli0157h7EDl933 (yellow,3228100) Human n-gram std deviation values are almost equal to Mesorhizobium though Mesorhizobium has much smaller genome.
Genome size and n-gram distribution of standard deviations • Human (black,22889476) • E_coliK12(red,2726558) • M.tuberculosis(green,2666338) • B.subtilis(blue,2442200) • B.halodurans(yellow,2384352) • Synechocystis(brown,2072748) M.tuberculosis has very high n-gram standard deviation values. It exceeds the values of human, despite its smaller genome size.
Initial points of Genome size and n-gram distribution of standard deviations Human (black,22889476) E_coliK12(red,2726558) M.tuberculosis(green,2666338) B.subtilis(blue,2442200) B.halodurans(yellow,2384352) Synechocystis(brown,2072748) The thickness of lines indicates the genome size. The thinnest line represents E_coliK12. Mycobacterium tuberculosis shows highest values.
Final points of Genome size and n-gram distribution of standard deviations Human (black,22889476) E_coliK12(red,2726558) M.tuberculosis(green,2666338) B.subtilis(blue,2442200) B.halodurans(yellow,2384352) Synechocystis(brown,2072748) M.tuberculosis and all other organisms here have n-grams with higher difference values than human.
Same genus / different species • 4-grams in M. tuberculosis have much higher 4-gram standard deviations from mean than M. leprae
Mycobacterium M. tuberculosis M. leprae
Neisseria meningitidis Thermotoga maritima Synechocystis spec. Haemophilus influenza Human Other Organisms
Conclusions • n-grams which are at least 30 standard deviations away from the mean are significant candidates for genome signatures. • Difference graphs: estimate the likelihood of n-gram observed in an organism. • Genome size graphs : there is no specific relationship between the size of genome and its standard deviation values. • Same genus and different species, where genome size is specified: There is a noticeable difference observed between Mycobacterium species (M.leprae and M.tuberculosis).
Current and future work • Find n-gram signatures n-grams in E.coli. • Explore the relationship between genome size and distribution of n-gram standard deviations different species of the same organism. • Find more specific targets to differentiate species in terms of signature peptides for all the 44 organisms taken for study.