1 / 36

S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology

Information theoretic. perspective of whole genome sequences. S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology Madurai Kamaraj University Madurai 625 021. mkukrishna@rediffmail.com mkukrishna@gmail.com. Alphabet + Grammar + vocabulary Language 0, 1

melvyn
Download Presentation

S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information theoretic perspective of whole genome sequences S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology Madurai Kamaraj University Madurai 625 021 mkukrishna@rediffmail.com mkukrishna@gmail.com

  2. Alphabet + Grammar + vocabulary Language 0, 1 a b c d e f . . m M , < . . A T C G A C D E F G H I K L M N P Q R S T V W Y a b . .

  3. We are but aliens looking at the world of molecules. The question is: Can we learn the language of the molecules?

  4. Hartley 1928 Shannon, 1948Quantifying information H = -  pi log2 pi (bits per symbol) H is the average uncertainty, and pi is the probability of occurrence of the ith symbol in the set of alphabets, the summation is over all the symbols in the language

  5. Information - the reduction in uncertainty caused when a string of symbols is received through a noisy channel  Gatlin, 1972 extended the formalism to allow application to biological sequences Tom Schneider (http://www-lmmb.ncifcrf.gov/~toms/) Popularising application of information theory to molecular analysis

  6. Sequence logos

  7. T. D. Schneider et al J. Mol. Biol., 188, 415-431, 1986 • R. M. Stephens and T. D. Schneider J. Mol. Biol., 228, 1124-1136 1992

  8. Scheider (2000) Nucl Acid Res 28:2794-2799

  9. Scheider (2000) Nucl Acid Res 28:2794-2799

  10. Sub-classification of HNHc class of proteins and identification of commonality in the His-Me endonuclease superfamily Preeti Mehta, Krishnamohan Katta and S Krishnaswamy Sub-classification of HNHc class of proteins and identification of commonality in the His-Me endonuclease superfamily. Protein Science(2004), 13:295–300.

  11. Subset classification of HNHc domain family • The HNHc domain family consists of a range of DNA cutting proteins (Homing endonucleases, recombinases, RE, toxins) • It belongs to the His-Me Endonuclease superfamily along with His-Cys box, Sm endonuclease and T4 endo VII proteins • Is characterized by presence of a central conserved Asn/His residue flanked by conservedHis(N-terminal) and His/Asn/Glu (C-terminal) residues at some distance. • The family could be sub-classified into atleast 35 subsets by iterative refinement of HMM profiles

  12. McrA: GICENCGKNAPFYLNDGNPYLEVHHVIPLSSGGADTTDNCVALCPNCHRELHYS

  13. Gatlin, 1972

  14. Highlights of genome analysis • 141 prokaryotic chromosomes • 157 eukaryotic chromosomes • Provides a framework for understanding messaging strategies • Evolutionary aspects of genomes • Server to calculate Informationt content Preeti et al (submitted for publication)

  15. Eukaryotes Prokaryotes • Despite size and compositional variations, both prokaryotic and eukaryotic genomes do not deviate significantly from an equiprobable and random situation. But their distributions are different.

  16. Inter and intra-strand A=T and G=C rules of Chargaff are broadly adhered to in all genomes.

  17. For prokaryotes : Variation of information density 0.022 bits to 0.263 bits (0.083±0.052).

  18. Chromosomes in eukaryotic organisms maintain similar information densities (Id) suggestive of common informational restraints.

  19. A. thaliana, human chromosomes and Rattus norvegicus (not shown) • Id values are similar also for the two arms of the chromosomes.

  20. What is the smallest unit of a chromosome that maintains a constant information density? Statistical similarity between the various chromosomes of yeast has been demonstrated previously (Li et al., 1998)

  21. Two hypotheses, ‘single common origin’ or ‘duplication/polyploidization of a limited set of chromosome’ were suggested to explain the uniformity seen in the various chromosomes of an organism (Von Bertalanffy, 1975) : few rather than one Id should be seen • Polyploidization of a few related sequences of a common origin, a mix of the two hypotheses, could explain the constancy amongst the chromosomes. • A result of functional constraints imposed by the need to use common cellular machinery ?

  22. Variation of |%AT-50| with information density. The thinner line corresponds to the D1 values for the respective genomes. The inverse correlation of (RD2+RD3) with |%AT-50| and the trend of D1 with |%AT-50| illustrates the balance between scalar (variation of nucleotides composition) and vector (variation in the order of occurrence of nucleotides) strategies to combat error prokaryotes eukaryotes

  23. Inverse correlation: contribution of compositional redundancy (RD1) and Shannon redundancy (dinucleotide (RD2) and trinucleotide (RD3) frequency distributions ) • (RD2+RD3) with | %AT-50 |. Correlation values -0.93 and -0.83 for prokaryotes and eukaryotes. • D1 with the |%AT-50| follows that of Id except at compositional frequencies closer to the equiprobable (50%) • RD1 with RD2: -0.89 and -0.90 prokaryotes and eukaryotes • RD1 with RD3: -0.84 and -0.58 for prokaryotes and eukaryotes

  24. Suggests to combat error A balance between strategies involving variation in nucleotide composition and variation in the order of occurrence of nucleotides.

  25. Fidelity and error correction • Eg: the process of xerography • hardware (the machine’s capability) • compositional and letter arrangements of the text • Genome duplication and transmission requires • Mechanistic cellular molecular machinery • Composition bias and arrangement of the nucleotide bases in the genome. • Analysis looks at the messaging strategy built into the arrangement of the letters in the genome sequences.

  26. Possibly the presence of a number of proof-reading mechanisms at various levels in the living systems (both within organisms and in evolution in the form of natural selection) • allows biological language strings to maintain higher potential information at the expense of retrievable information thereby providing the possibility of higher message variety.

  27. Acknowledgements Preeti Mehta, Srividhya K.V Alaguraj V Hirendra Vikram Govind, M.K. Ramneek Gupta DBT BTIS Bioinformatics Thank you

More Related