S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology

Information theoretic perspective of whole genome sequences S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology Madurai Kamaraj University Madurai 625 021 mkukrishna@rediffmail.com mkukrishna@gmail.com

Alphabet + Grammar + vocabulary Language 0, 1 a b c d e f . . m M , < . . A T C G A C D E F G H I K L M N P Q R S T V W Y a b . .

We are but aliens looking at the world of molecules. The question is: Can we learn the language of the molecules?

Hartley 1928 Shannon, 1948Quantifying information H = -  pi log2 pi (bits per symbol) H is the average uncertainty, and pi is the probability of occurrence of the ith symbol in the set of alphabets, the summation is over all the symbols in the language

Information - the reduction in uncertainty caused when a string of symbols is received through a noisy channel Gatlin, 1972 extended the formalism to allow application to biological sequences Tom Schneider (http://www-lmmb.ncifcrf.gov/~toms/) Popularising application of information theory to molecular analysis

Sequence logos

T. D. Schneider et al J. Mol. Biol., 188, 415-431, 1986 • R. M. Stephens and T. D. Schneider J. Mol. Biol., 228, 1124-1136 1992

Scheider (2000) Nucl Acid Res 28:2794-2799

Sub-classification of HNHc class of proteins and identification of commonality in the His-Me endonuclease superfamily Preeti Mehta, Krishnamohan Katta and S Krishnaswamy Sub-classification of HNHc class of proteins and identification of commonality in the His-Me endonuclease superfamily. Protein Science(2004), 13:295–300.

Subset classification of HNHc domain family • The HNHc domain family consists of a range of DNA cutting proteins (Homing endonucleases, recombinases, RE, toxins) • It belongs to the His-Me Endonuclease superfamily along with His-Cys box, Sm endonuclease and T4 endo VII proteins • Is characterized by presence of a central conserved Asn/His residue flanked by conservedHis(N-terminal) and His/Asn/Glu (C-terminal) residues at some distance. • The family could be sub-classified into atleast 35 subsets by iterative refinement of HMM profiles

McrA: GICENCGKNAPFYLNDGNPYLEVHHVIPLSSGGADTTDNCVALCPNCHRELHYS

Gatlin, 1972

Highlights of genome analysis • 141 prokaryotic chromosomes • 157 eukaryotic chromosomes • Provides a framework for understanding messaging strategies • Evolutionary aspects of genomes • Server to calculate Informationt content Preeti et al (submitted for publication)

Eukaryotes Prokaryotes • Despite size and compositional variations, both prokaryotic and eukaryotic genomes do not deviate significantly from an equiprobable and random situation. But their distributions are different.

Inter and intra-strand A=T and G=C rules of Chargaff are broadly adhered to in all genomes.

For prokaryotes : Variation of information density 0.022 bits to 0.263 bits (0.083±0.052).

Chromosomes in eukaryotic organisms maintain similar information densities (Id) suggestive of common informational restraints.

A. thaliana, human chromosomes and Rattus norvegicus (not shown) • Id values are similar also for the two arms of the chromosomes.

What is the smallest unit of a chromosome that maintains a constant information density? Statistical similarity between the various chromosomes of yeast has been demonstrated previously (Li et al., 1998)

Two hypotheses, ‘single common origin’ or ‘duplication/polyploidization of a limited set of chromosome’ were suggested to explain the uniformity seen in the various chromosomes of an organism (Von Bertalanffy, 1975) : few rather than one Id should be seen • Polyploidization of a few related sequences of a common origin, a mix of the two hypotheses, could explain the constancy amongst the chromosomes. • A result of functional constraints imposed by the need to use common cellular machinery ?

Variation of |%AT-50| with information density. The thinner line corresponds to the D1 values for the respective genomes. The inverse correlation of (RD2+RD3) with |%AT-50| and the trend of D1 with |%AT-50| illustrates the balance between scalar (variation of nucleotides composition) and vector (variation in the order of occurrence of nucleotides) strategies to combat error prokaryotes eukaryotes

Inverse correlation: contribution of compositional redundancy (RD1) and Shannon redundancy (dinucleotide (RD2) and trinucleotide (RD3) frequency distributions ) • (RD2+RD3) with | %AT-50 |. Correlation values -0.93 and -0.83 for prokaryotes and eukaryotes. • D1 with the |%AT-50| follows that of Id except at compositional frequencies closer to the equiprobable (50%) • RD1 with RD2: -0.89 and -0.90 prokaryotes and eukaryotes • RD1 with RD3: -0.84 and -0.58 for prokaryotes and eukaryotes

Suggests to combat error A balance between strategies involving variation in nucleotide composition and variation in the order of occurrence of nucleotides.

Fidelity and error correction • Eg: the process of xerography • hardware (the machine’s capability) • compositional and letter arrangements of the text • Genome duplication and transmission requires • Mechanistic cellular molecular machinery • Composition bias and arrangement of the nucleotide bases in the genome. • Analysis looks at the messaging strategy built into the arrangement of the letters in the genome sequences.

Possibly the presence of a number of proof-reading mechanisms at various levels in the living systems (both within organisms and in evolution in the form of natural selection) • allows biological language strings to maintain higher potential information at the expense of retrievable information thereby providing the possibility of higher message variety.

Acknowledgements Preeti Mehta, Srividhya K.V Alaguraj V Hirendra Vikram Govind, M.K. Ramneek Gupta DBT BTIS Bioinformatics Thank you

S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology

S. Krishnaswamy Centre of Excellence in Bioinformatics School of Biotechnology

Presentation Transcript

Bioinformatics Facility of the Biotechnology

School of Biotechnology

Bioinformatics in Cancer Biotechnology

Centre of Excellence in Biomedical Applications

HUMINT CENTRE OF EXCELLENCE

Centre of Excellence Registration System User s Guide

The Role of Bioinformatics in Cancer Biotechnology

BEO Centre of Excellence

Bioinformatics Group Institute of Biotechnology University of Helsinki

Dr. N. JEYAKUMAR, M.Sc., Ph.D., Bioinformatics Centre School of Biotechnology

Africa Centre of Excellence for Neglected Tropical Diseases and Forensic Biotechnology

HRMS Centre of Excellence

European School of Bioinformatics

CENTRE OF EXCELLENCE

Centre of Phytosanitary Excellence

CENTRE OF EXCELLENCE

Centre of Excellence in Mobility

The Centre of Excellence

HUMINT CENTRE OF EXCELLENCE

Centre of Phytosanitary Excellence

Bioinformatics Facility of the Biotechnology

Bioinformatics Facility of the Biotechnology