CISC 841 Bio Informatics

CISC 841 Bio Informatics Non-additivity in protein–DNA binding R. A. O’Flanagan, G. Paillard, R. Lavery and A. M. Sengupta ( A Review ) Presentation : Manoj Pillay 13 April 2006

Research Orientation : GENOMIC ANNOTATION Concentration : Identifying Protein Binding Sites Introduction Brushing up the prerequisite Molecular Biology basics • DNA-Binding Protein – any protein that binds to double or single stranded DNA • Binding Site – A region on a protein, DNA or RNA at which chemical bonds are formed

Protein Classification • Enzymes • Structural Proteins • Receptors, Kinases etc • Transcription Factors • Transcription Factors • A transcription factor is a protein that binds DNA at a specific promoter or enhances region or state; where it regulates transcription • They regulate the production of all proteins • Classes of Transcription Factors • General • Upstream • Inducible So, why are we interested in just transcription factors?

Why does this paper exist? With current technology, it is impossible to exactly predict where transcription factors are going to bind to - • Approaches to the problem in the past • Using PWM • Using PSSM • Using HMM A significant limitation All these methods assume that the overall binding affinity of a given protein is made up of additive contributions from interactions at each nucleotide position within the binding site !!! Experimental Results Challenge the assumption - Mnt Repressor - EGR1 Zn-finger Observation : Correlation exists between neighbouring nucleotide positions

So then, Do we really need to formulate a new approach? • Why don’t we modify one of the existing approaches instead of spending a lot of money and energy on a new research line? • Why don’t we just adhere to the existing conventions? • Why don’t we take care of those correlations by replacing mononucleotide PWM representations with those based on dinucleotides or longer sequence elements? • Can’t we extend the HMM approach by adding hidden layers to HMM formulations? • Alternatively, can we pioneer an SVM based approach to generalize the problem and then resolve it? • Why do we have to make thought process complicated for ppl? • Why??????????????

The Answer to all the WHY’s LACK OF EXPERIMENTAL DATA - for most transcription factors, only a few binding sites are experimentally characterized. Big Deal!!! So then, why don’t we get all those so-called experimental characterizations and results and then modify our existing approach. Do we really have to innovate? Yes! Because, technology for studying a complete binding site comprising of 10-20 nucleotide positions is still under development - DNA Microarrays - Genome SELEX - Micro-array based chromatic immuno-precipitation assays - SELEX SAGE

OUR APPROACH • Theoretical • uses ADAPT – a methodology for analyzing protein DNA recognition mechanisms • takes into account 10 to 20 nucleotide positions sequences • Only sequences corresponding to most stable complexes are studies ( the ones which fall within 5Kcal/mol of the best sequence ) to generate a weigh matrix • ADAPT allows to calculate : • E int – Protein-DNA interaction energy • E def – DNA deformation energy • - energy necessary to deform a free DNA segment to the structure it adopts when bound to the protein • BINDING ENERGY : Etot= Eint + Edef • Eint direct recognition • Edef indirect recognition Direct and Indirect components of protein DNA recognition

Edef is important to us because • in some cases such as the TBP, binding introduces severe deformation. • apart from Eint, it is the other component that influences interactions between neighboring nucleotide pairs. • Correlation can occur if and only if there is degeneracy in sequence preference i.e. existence of Edef We therefore understand that Edef is a necessary but insufficient condition for correlation to exist. Investigation of this Research Correlation effects on the binding specificity of some prototypical protein-DNA complexes. How effectively can correlation be incorporated into binding site prediction?

METHODOLOGY IN DETAIL CALCULATING protein-DNA binding energies tests were carried out on 3 proteins – TBP, BamH1 & GCN4 An optimal sequence which exhibits the best binding characteristics is generated from the set of all sequences. Let us call the binding energy of this sequence as Eopt Eopt+ 10 Kcal/mol Eopt+ 5 Kcal/mol EoptKcal/mol True Binding Sites True Non-Binding Sites Discarded Sequences Optimal Sequence

Binding Site Length Ltot= N – log M/log 4 N = Total Length of DNA fragment M = Number of Sequences with energies < cutoff Derivation N base pairs 4N base sequences If B base pairs remain after N-M sequences are selected using cutoff criterion 4B=M B = log M/log 4 Ltot=N-B=N-log M/log 4 The effective length of the protein binding site ANALYZING CORRELATION • Analysis is limited to neighboring nucleotide positions. • Correlation = Pi,i+1 – (Pi + Pi+1) • which can be calculate in reality as a change in entropy MonoNucleotide Entropy Dinucleotide Entropy Si (0,2) and Si+1 (0,4)

We introduce sequence lengths and These lengths yield a quantitative measure of the correlation. It may be noted that EXTRACTING WEIGHT MATRIX PARAMETERS wia = C0 is chosen in such a way that the best binding site scores 0 and therefore poorer sites have positive scores Note : There exists an assumption that Binding probability is proportional to the exponential of Wm Assumption may not hold for our distribution as we sample training set sequences using a cutoff criterion Therefore, considering sharp cutoffs , we use an SVM for our mononucleotides

Binding energy of protein to a sequence is then given by, Εiα = free energy contribution from ith base Εiα is incorporated to minimize variances of over the background dist. of sequences subject to constraint Binding sites which means that sequences with Generalizing WM and SVMs for dinucleotides we find the following equations where are chosen to minimize the variances

EVIDENCE FOR NON-ADDITIVITY The optimal sequence for TBP is as AGTATAATTAAA C0 is now calculated as ( - +……………………)

A diagonal implying a perfect correlation between Wm scores and binding energies would have wiped out all our hypotheses which state that dinucleotide(or higher) dependencies are considerable. The variation of 3A and 3B confirms that non-additivity arises in the process of binding and is dominated by interaction between adjacent nucleotides. 3C further reinforces this fact.

ANALYZING NON-ADDITIVITY WITHIN BINDING SITE - analysis by calculating binding site lengths - analysis by calculating entropies

PREDICTING BINDING SITES TAKING NON-ADDITIVITY INTO ACCOUNT • 200 sequences among 880(binding sites) are used as inputs (TBP) • Resulting weight matrices and energy matrices are used to assign information scores and predicting binding energies to every candidate site. So, then we have an SVM with the following characteristics : True Positives True Negatives False Positives False Negatives When algorithm identifies a true binding site as such When algorithm identifies a true non-binding site as such When algorithm declares a true non-binding site to be a binding site When algorithm declares a true binding site to be a non-binding site

Probability of misclassification is much higher with a mononucleotide base consideration i.e. overlooking the existence of non-additivity. Recall discussed on slide 12. The threshold parameter u can be adjusted to yield even better results Positive prediction value is given by TP/(TP+FP) unlike other SVMs we generally see, in which prediction value is often given by FP/(FP+TN). This is because FPs tend to be very small for most reasonable values in our calculations.

MonoNucleotide Model DiNucleotide Model Training sets ranging from 2 t0 200 TBP binding sites Notice that Di-nucleotide Model outperforms the mononucleotide model as the size of training set increases. WHY?

When size of training set increases Fraction of misclassified sequences increase…. FP+FN Size of training set And Unless nearest neighbor interactions are taken into account, the increases are sharp! GCN4 TBP - We therefore, are in a position to say that, a minimum number of binding sites are necessary before it becomes advantageous to introduce correlations.

- The point to note from this graph is that there is significant improvement from results in TABLE 2 where only 3 or 4 sequences where considered.

EXPERIMENTAL SIGNATURE OF CORRELATION FROM EXPERIMENT - Preliminary analysis of dimeric protein CAP is conducted as our proteins cannot be subject to experiments without high throughput technological aid. - 76 binding sites are confirmed using our cutoff criterion in CAP

CONCLUSION • DNA deformation within a protein leads to significant non-additivity • Effects are more or less limited to neighboring nucleotide interactions • Non – additivity may be relevant only to a limited number of dinucleotide steps in target site. • SVM and WM approaches may be used to conduct the experiment although SVM approach is found to have been slightly outsmarting WM approach • Improvement in prediction power depends upon size of the training set. • Non- additivity should be taken into account for only those steps where it is really needed or else overfitting of dinucleotide model is imminent which means nothing but poor predictive power • All findings are based on the worthiness of ADAPT.. ADAPT has never failed in the past as experimental results generally do not vary much from its simulated results… but of course.. we do not overlook the possibility that one day there might be a protein which…………………………………………….

Thank you everyone for participation and patient attention to my presentation of the review on Non-additivity in protein-DNA binding.

CISC 841 Bio Informatics

CISC 841 Bio Informatics

Presentation Transcript

Bio-informatics and Ethics

From Bio-Informatics towards e-BioScience

Joined up Health and Bio Informatics:

CISC Processor

The Bio-Health Informatics Group

CISC 841 Bio Informatics

DATA WAREHOUSE FOR BIO-GEO HEALTH CARE INFORMATICS

Concept Modeling in Bio-informatics

CISC 841 Bioinformatics (Fall 2007) Hidden Markov Models

CISC 841 - BIOINFORMATICS

Informatics perspectives in Bio-Informatics

Introduction to Bio-Informatics

Bio-Medical Informatics

Northwest Institute for Bio-Health Informatics

CISC 841 Bioinformatics (Fall 2008) Hidden Markov Models

CISC 841 Bioinformatics (Fall 2008) Hidden Markov Models

CISC 841 Bioinformatics (Fall 2008) Review Session

CISC 841 Bioinformatics Combining HMMs with SVMs

TWR 841 Characteristics

CISC 841 - BIOINFORMATICS

MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS