260 likes | 469 Views
TF-DNA binding dependency A progress report. March 17, 2010 Hugo Willy. Outline. Re-Introduction of my problem Current state of affair Known dependency factor 1 – Rotamer Known dependency factor 2 – Water Known dependency factor 3 – DNA flexibility Some thoughts on what to do next.
E N D
TF-DNA binding dependencyA progress report March 17, 2010 Hugo Willy
Outline • Re-Introduction of my problem • Current state of affair • Known dependency factor 1 – Rotamer • Known dependency factor 2 – Water • Known dependency factor 3 – DNA flexibility • Some thoughts on what to do next
Re-Introduction • I am working on finding dependency model of TF-DNA binding • What is TF-DNA binding? • If you ask this, you may be in the wrong room • It is known that different TFs prefer different DNA sequence to bind to. • Classic example TATA box binding proteins binds the sequence “TATA”.
Re-Introduction (2) • It is commonly assumed that each position in T-A-T-A contributes independently to the binding energy. • That is to say, some guys from the TF will bind the first “T”, some other will bind the second “A” and so on. • If the sequence become CATA, then it depends on how much the guys who binds the 1st position likes the new “C”. If they are OK, the binding energy may change a little but the TF still binds. • Otherwise, too bad.
Re-Introduction (3) • One such model, a very popular one, is the PSSM model. • And it is shown to be very good in estimating the real binding sites of many TF. • However, some were curious whether the model holds for all TF.
Current state of affair • There are quite a few publications which tries to show that there are measurable dependencies among the positions. • RECOMB 2003-Modeling dependencies in Protein-DNA binding sites • Multi PSSM, Tree, Multi Tree. Bayesian network based training. • Bioinformatics 2004-Modeling within-motif dependence for transcription factor binding site predictions • PSSM with pairwise correlated position using Bayes Factor. Gibbs sampling based. • BIBE 2006-Discovering DNA Motifs with Nucleotide Dependency • PSSM with multi-positions, heuristic. • Bioinformatics 2007-Position dependencies in transcription factor binding sites • Checks dependencies within a set of aligned binding site with different statistical measures.
Current state of affair (2) • Bioinformatics 2008-Context-dependent DNA recognition code for C2H2 zinc-finger transcription factors • Neural network based. • PLoSCompBio 2008-A Feature-Based Approach to Modeling Protein-DNA Interactions • Feature based – currently only consider pairwise position dependency feature. • NAR 2010-On the detection and refinement of transcription factor binding sites using ChIP-Seq data • Similar to Bioinformatics 2004.
Current state of affair (3) • However, they have a similar framework • Start with a set of “known” binding sequence • Try to guess a model with and without dependencies • Train the model using the dataset (possibly making gradual change on the model during the training) • Compare which model is better • They will list down the positions with dependencies – most are consecutive positions, but some have quite distant positions.
Current state of affair (4) • Well, these are just a fitting of a model to a set of sequence known to bind. The binding energy was not really taken into account. • So others, with more $$$ in their lab, did a huge biological experiments and try to see if the experimental binding energies of some TFs do exhibit some dependency pattern.
Current state of affair (5) • Hence some more paper, • NAR 2002-Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors • NAR2002-Additivity in protein-DNA interactions-how good an approximation is it? • Nature Biotechnology 2006-Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities • Science 2009-Diversity and Complexity in DNA Recognition by Transcription Factors • PLoSCompBio 2009-Inferring Binding Energies from Selected Binding Sites
Current state of affair (6) From Science 2009, Protein binding microarray experiment.
Current state of affair (7) • Yet, none of the publication I have read so far gives a concrete evidence on HOW such dependencies could happen. • We are now trying to find the answer on what happen on the physical level when two positions in the DNA are dependent.
Known dependency factor 1 – Rotamer • Recently there is an experiment involving the Zinc Finger TF, Zf268 which has been one of the most popular Zinc finger modeling target.
Known dependency factor 1 – Rotamer • They tried to change the DNA sequence of the wildtype GCG to ACG, CCG, AAG, and CAG • We try to see if a program that can change the side chains of the TF to conform to the new DNA sequence can approximate the change in the binding energy. • We tried FoldX – it does rotamer checks-not sure if it is optimal.
Known dependency factor 1 – Rotamer • However, the rotamers that FoldX predict does not coincide with the diagrams. • Either FoldX is not optimal, or the homology modeling done in the paper is not accurate. • But given the close agreement on the predicted and experimental difference in the binding affinity, most probably they are (more) correct. • I am still checking on that.
Known dependency factor 2 – Water • The thing that is explicitly computed in the NAR paper are the solvation penalties (the circles, rectangles and triangles in the diagram). • They claim that the water mediated H-bonds are not that crucial. • We can see that FoldX does compute hydration to a certain extent. Yet the rotamer search may not be good enough.
Known dependency factor 3 – DNA flexibility • DNA are not a rigid rod.
Known dependency factor 3 – DNA flexibility • G-C will have higher roll angle – making it less stable (weaker stacking energy) and easier to “open”. • There are several work showing that different dinucleotide steps have different bending and twisting energy.
Known dependency factor 3 – DNA flexibility • TATA binding protein actually binds TATA not because it generates the best binding energy • The bindings are mostly non-specific.
Conclusion • Up to now, the 3 factors are the known/most probable factors of DNA dependency. • The challenge would be to combine all these into one scoring function that is simple enough to run on large dataset.