360 likes | 450 Views
Information theoretic. perspective of whole genome sequences. S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology Madurai Kamaraj University Madurai 625 021. mkukrishna@rediffmail.com mkukrishna@gmail.com. Alphabet + Grammar + vocabulary Language 0, 1
E N D
Information theoretic perspective of whole genome sequences S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology Madurai Kamaraj University Madurai 625 021 mkukrishna@rediffmail.com mkukrishna@gmail.com
Alphabet + Grammar + vocabulary Language 0, 1 a b c d e f . . m M , < . . A T C G A C D E F G H I K L M N P Q R S T V W Y a b . .
We are but aliens looking at the world of molecules. The question is: Can we learn the language of the molecules?
Hartley 1928 Shannon, 1948Quantifying information H = - pi log2 pi (bits per symbol) H is the average uncertainty, and pi is the probability of occurrence of the ith symbol in the set of alphabets, the summation is over all the symbols in the language
Information - the reduction in uncertainty caused when a string of symbols is received through a noisy channel Gatlin, 1972 extended the formalism to allow application to biological sequences Tom Schneider (http://www-lmmb.ncifcrf.gov/~toms/) Popularising application of information theory to molecular analysis
T. D. Schneider et al J. Mol. Biol., 188, 415-431, 1986 • R. M. Stephens and T. D. Schneider J. Mol. Biol., 228, 1124-1136 1992
Sub-classification of HNHc class of proteins and identification of commonality in the His-Me endonuclease superfamily Preeti Mehta, Krishnamohan Katta and S Krishnaswamy Sub-classification of HNHc class of proteins and identification of commonality in the His-Me endonuclease superfamily. Protein Science(2004), 13:295–300.
Subset classification of HNHc domain family • The HNHc domain family consists of a range of DNA cutting proteins (Homing endonucleases, recombinases, RE, toxins) • It belongs to the His-Me Endonuclease superfamily along with His-Cys box, Sm endonuclease and T4 endo VII proteins • Is characterized by presence of a central conserved Asn/His residue flanked by conservedHis(N-terminal) and His/Asn/Glu (C-terminal) residues at some distance. • The family could be sub-classified into atleast 35 subsets by iterative refinement of HMM profiles
McrA: GICENCGKNAPFYLNDGNPYLEVHHVIPLSSGGADTTDNCVALCPNCHRELHYS
Highlights of genome analysis • 141 prokaryotic chromosomes • 157 eukaryotic chromosomes • Provides a framework for understanding messaging strategies • Evolutionary aspects of genomes • Server to calculate Informationt content Preeti et al (submitted for publication)
Eukaryotes Prokaryotes • Despite size and compositional variations, both prokaryotic and eukaryotic genomes do not deviate significantly from an equiprobable and random situation. But their distributions are different.
Inter and intra-strand A=T and G=C rules of Chargaff are broadly adhered to in all genomes.
For prokaryotes : Variation of information density 0.022 bits to 0.263 bits (0.083±0.052).
Chromosomes in eukaryotic organisms maintain similar information densities (Id) suggestive of common informational restraints.
A. thaliana, human chromosomes and Rattus norvegicus (not shown) • Id values are similar also for the two arms of the chromosomes.
What is the smallest unit of a chromosome that maintains a constant information density? Statistical similarity between the various chromosomes of yeast has been demonstrated previously (Li et al., 1998)
Two hypotheses, ‘single common origin’ or ‘duplication/polyploidization of a limited set of chromosome’ were suggested to explain the uniformity seen in the various chromosomes of an organism (Von Bertalanffy, 1975) : few rather than one Id should be seen • Polyploidization of a few related sequences of a common origin, a mix of the two hypotheses, could explain the constancy amongst the chromosomes. • A result of functional constraints imposed by the need to use common cellular machinery ?
Variation of |%AT-50| with information density. The thinner line corresponds to the D1 values for the respective genomes. The inverse correlation of (RD2+RD3) with |%AT-50| and the trend of D1 with |%AT-50| illustrates the balance between scalar (variation of nucleotides composition) and vector (variation in the order of occurrence of nucleotides) strategies to combat error prokaryotes eukaryotes
Inverse correlation: contribution of compositional redundancy (RD1) and Shannon redundancy (dinucleotide (RD2) and trinucleotide (RD3) frequency distributions ) • (RD2+RD3) with | %AT-50 |. Correlation values -0.93 and -0.83 for prokaryotes and eukaryotes. • D1 with the |%AT-50| follows that of Id except at compositional frequencies closer to the equiprobable (50%) • RD1 with RD2: -0.89 and -0.90 prokaryotes and eukaryotes • RD1 with RD3: -0.84 and -0.58 for prokaryotes and eukaryotes
Suggests to combat error A balance between strategies involving variation in nucleotide composition and variation in the order of occurrence of nucleotides.
Fidelity and error correction • Eg: the process of xerography • hardware (the machine’s capability) • compositional and letter arrangements of the text • Genome duplication and transmission requires • Mechanistic cellular molecular machinery • Composition bias and arrangement of the nucleotide bases in the genome. • Analysis looks at the messaging strategy built into the arrangement of the letters in the genome sequences.
Possibly the presence of a number of proof-reading mechanisms at various levels in the living systems (both within organisms and in evolution in the form of natural selection) • allows biological language strings to maintain higher potential information at the expense of retrievable information thereby providing the possibility of higher message variety.
Acknowledgements Preeti Mehta, Srividhya K.V Alaguraj V Hirendra Vikram Govind, M.K. Ramneek Gupta DBT BTIS Bioinformatics Thank you