Modeling Dependencies in Protein-DNA Binding Sites

Modeling Dependencies in Protein-DNA Binding Sites Yoseph Barash 1 Gal Elidan 1 Nir Friedman 1 Tommy Kaplan 1,2 1 School of Computer Science & Engineering2 Hadassah Medical SchoolThe Hebrew University, Jerusalem, Israel

Dependent positions in binding sites ?T ?C Pros: Biology suggests dependencies • Single amino-acid interacts with two nucleotides • Change in conformation of protein or DNA Cons: Modeling dependencies is harder • Additional parameters • Requires more data, not as robust gene A binding site promoter Most approaches assume position independence To model or not to model dependencies ? [Man & Stormo 2001, Bulyk et al, 2002, Benos et al, 2002]

Data driven approach • Can we learn dependencies from available genomic data ? • Do dependency models perform better ? Outline • Flexible models of dependencies • Learning from (un)aligned sequences • Systematic evaluation  Biological insights • Yes • Yes

X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 T T How to model binding sites ? represent a distribution of binding sites Profile: Independency model Tree: Direct dependencies Mixture of Profiles: Global dependencies Mixture of Trees: Both types of dependencies

Aligned binding sites Models GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT AAAGGGCCGGGC GGGAGGCCGGGA GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC X1 X2 X3 X4 X5 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X1 X2 X3 X4 X5 T T Learning models: Aligned binding sites Learning based on methods for probabilistic graphical models (Bayesian networks) Learning Machineryselect maximum likelihood model

Test set Evaluation using aligned data 95 TFs with ≥ 20 binding sites from TRANSFAC database [Wingender et al, 2001’] Estimate generalization of each model: Test: how probable is the site given the model? Cross-validation: Training set Data set Test Log-Likelihood GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG ATGGGGCGGGGC GTGGGGCGGGGC ATGGGGCGGGGC GTGGGGCGGGGCGCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG -20.34 -23.03 -21.31 -19.10 -18.42 -19.70 -22.39 -23.54 -22.39 -23.54 -18.07 -19.18 -18.31 -21.43 TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC Testavg. LL = -20.77

Mixture of Profiles 76% 24% Tree X4 X5 X6 X7 X8 X9 X10 X11 X12 Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold) Arabidopsis ABA binding factor 1 Profile Test LL per instance -19.93 Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold)

Likelihood improvement over profiles TRANSFAC 95 aligned data sets 128 Significant(paired t-test) 64 Not significant 32 16 Fold-change in likelihood 8 4 Significant improvement in generalization  Data often exhibits dependencies 2 1 0.5 10 20 30 40 50 60 70 80 90

Evaluation for unaligned data Motif finding problem Input: A set of potentially co-regulated genes Output: A common motif in their promoters Sources of data: • Gene annotation (e.g. Hughes et al, 2000) • Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000) • ChIP (e.g. Simon et al, 2001; Lee et al, 2002)

Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model Models X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 T T Learning models: unaligned data EM algorithm Unaligned Data Learna model Identify binding sites

ChIP location analysis[Lee et al, 2002] Yeast genome-wide location experiments Target genes for 106 TFs in 146 experiments ….... Gene ABF1 Targets ZAP1 Targets YAL001CYAL002WYAL003W YAL005C...YAL010CYAL012CYAL013WYPR201W YAL001CYAL002WYAL003W +– +–...+––– –+––. ..– ++– # genes ~ 6000

Learned Mixture of Profiles 43 Learned profile 492 Example: Models learned for ABF1 (YPD)Autonomously replicating sequence-binding factor 1 Known profile(from TRANSFAC)

Detect target genes on a genomic scale: Evaluating Performance ACGTAT…………….………………….AGGGATGC GAGC -473 -1000 0

-4 -5 -6 -7 -8 10 10 10 10 10 Profile Mix of Trees Bonferroni corrected p-value ≤ 0.01 p-value -3 10 -2 10 -1 10  -180 -160 -140 -120 -100 -80 -60  Evaluating Performance Detect target genes on a genomic scale: Biologicallyverified site Gal4 regulates Gal80

Test set YAL001CYAL002WYAL003W +–+ Evaluation using ChIP location data[Lee et al, 2002] Evaluate using a 5-fold cross-validation test: Prediction Data set –+––+––– YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W YAL001CYAL002WYAL003W +–+

True –+––+––– +–+ Evaluation using ChIP location data[Lee et al, 2002] Evaluate using a 5-fold cross-validation test: Prediction Data set ––––++– – YAL001CYAL002WYAL003W YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W +–+ √√√√FN√√√FP√√

90% Mixture of Trees 80% 70% Mixture of Profiles 60% Tree Profile 50% True Positive Rate (Sensitivity) 40% 30% 20% 10% 0% 0% 1% 2% 3% 4% 5% False Positive Rate Example: ROC curve of HSF1 ~60 FP

Tree vs. Profile 20 3 30 15 10 5 0 Δ specificity -5 -10 15 6 -15 -20 -25 -20 -10 0 10 20 30 40 50 60 Δ sensitivity Improvement in sensitivity&specificity 105 unaligned data sets from Lee et al. True TP Predicted SensitivityTP / True SpecificityTP / Predicted

20 0 52 15 10 5 0 -5 -10 18 17 -15 -20 -25 -20 -10 0 10 20 30 40 50 60 Improvement in sensitivity&specificity 105 unaligned data sets from Lee et al. Mixture of Profiles vs. Profile True TP Predicted Δ specificity SensitivityTP / True SpecificityTP / Predicted Δ sensitivity

20 15 10 5 1 84 0 -5 -10 -15 -20 -25 2 16 -20 -10 0 10 20 30 40 50 60 Improvement in sensitivity&specificity 105 unaligned data sets from Lee et al. Mixture of Trees vs. Profile True TP Predicted Δ specificity SensitivityTP / True SpecificityTP / Predicted Δ sensitivity

“Is it worthwhile to model dependencies?”Evaluation clearly supports this What about the underlying biology ?(with Prof. Hanah Margalit, Hadassah Medical School)

50 Weak (< 0.3 bits) Medium (< 0.7 bits) 40 Strong 30 Num of dependencies 20 10 0 1 2 3 4 5 6 7 8 9 10 11 Distance Distance between dependent positions Tree models learned from thealigned data sets < 1/3 of the dependencies

128 128 Not Significant Significant(paired t-test) 64 64 32 32 16 16 8 Fold-change in likelihood 8 Fold-change in likelihood 4 4 2 2 1 1 0.5 0.5 Helix Turn Helix ??? bZIP bHLH others 10 20 30 40 50 60 70 80 90 β Sheet Zinc finger Structural families Dependency models vs. Profile on aligned data sets

Conclusions • Flexible framework for learning dependencies • Dependencies are found in many cases • It is worthwhile to model them - Better learning and binding site prediction Future work • Link to the underlying structural biology • Incorporate as part of other regulatory mechanism models http://compbio.cs.huji.ac.il/TFBN

Modeling Dependencies in Protein-DNA Binding Sites

Modeling Dependencies in Protein-DNA Binding Sites

Presentation Transcript

Protein Binding Phenomena

Plasma drug protein binding

PROTEIN BINDING

Identification of protein-protein binding motifs

Protein modeling

Protein Modeling

Drug-Protein Binding

Protein Function –Binding

Creb Binding Protein

Predicting ligand binding sites on protein surface

Binding Free Energies of Water Molecules in Protein Active Sites

TEAD1 binding sites in Gli2 promoter

Putative DNA-binding domain

Computational Modeling of DNA Binding Molecules

Discovering gapped binding sites

Ligand-binding site prediction based on 3D protein modeling

DNA RNA Protein

Protein Binding Site Mapping

E1A binding protein p300

DNA  PROTEIN

Computational Modeling of DNA Binding Molecules

Protein Binding Site Mapping