1 / 24

Modeling Dependencies in Protein-DNA Binding Sites

Modeling Dependencies in Protein-DNA Binding Sites. Yoseph Barash 1 Gal Elidan 1 Nir Friedman 1 Tommy Kaplan 1,2. 1 School of Computer Science & Engineering 2 Hadassah Medical School The Hebrew University, Jerusalem, Israel. Dependent positions in binding sites. ?T. ? C.

raina
Download Presentation

Modeling Dependencies in Protein-DNA Binding Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Dependencies in Protein-DNA Binding Sites Yoseph Barash 1 Gal Elidan 1 Nir Friedman 1 Tommy Kaplan 1,2 1 School of Computer Science & Engineering2 Hadassah Medical SchoolThe Hebrew University, Jerusalem, Israel

  2. Dependent positions in binding sites ?T ?C Pros: Biology suggests dependencies • Single amino-acid interacts with two nucleotides • Change in conformation of protein or DNA Cons: Modeling dependencies is harder • Additional parameters • Requires more data, not as robust gene A binding site promoter Most approaches assume position independence To model or not to model dependencies ? [Man & Stormo 2001, Bulyk et al, 2002, Benos et al, 2002]

  3. Data driven approach • Can we learn dependencies from available genomic data ? • Do dependency models perform better ? Outline • Flexible models of dependencies • Learning from (un)aligned sequences • Systematic evaluation  Biological insights • Yes • Yes

  4. X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 T T How to model binding sites ? represent a distribution of binding sites Profile: Independency model Tree: Direct dependencies Mixture of Profiles: Global dependencies Mixture of Trees: Both types of dependencies

  5. Aligned binding sites Models GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG TAGGGGCCGGGC TGGGGGCGGGGT AAAGGGCCGGGC GGGAGGCCGGGA GCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC X1 X2 X3 X4 X5 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X1 X2 X3 X4 X5 T T Learning models: Aligned binding sites Learning based on methods for probabilistic graphical models (Bayesian networks) Learning Machineryselect maximum likelihood model

  6. Test set Evaluation using aligned data 95 TFs with ≥ 20 binding sites from TRANSFAC database [Wingender et al, 2001’] Estimate generalization of each model: Test: how probable is the site given the model? Cross-validation: Training set Data set Test Log-Likelihood GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG ATGGGGCGGGGC GTGGGGCGGGGC ATGGGGCGGGGC GTGGGGCGGGGCGCGGGGCGGGGC GAGGGGACGAGT CCGGGGCGGTCC ATGGGGCGGGGC GCGGGGCCGGGC TGGGGGCGGGGT AGGGGGCGGGGG -20.34 -23.03 -21.31 -19.10 -18.42 -19.70 -22.39 -23.54 -22.39 -23.54 -18.07 -19.18 -18.31 -21.43 TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC TAGGGGCCGGGC TGGGGGCGGGGT TGGGGGCCGGGC Testavg. LL = -20.77

  7. Mixture of Profiles 76% 24% Tree X4 X5 X6 X7 X8 X9 X10 X11 X12 Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold) Arabidopsis ABA binding factor 1 Profile Test LL per instance -19.93 Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold)

  8. Likelihood improvement over profiles TRANSFAC 95 aligned data sets 128 Significant(paired t-test) 64 Not significant 32 16 Fold-change in likelihood 8 4 Significant improvement in generalization  Data often exhibits dependencies 2 1 0.5 10 20 30 40 50 60 70 80 90

  9. Evaluation for unaligned data Motif finding problem Input: A set of potentially co-regulated genes Output: A common motif in their promoters Sources of data: • Gene annotation (e.g. Hughes et al, 2000) • Gene expression (e.g. Spellman et al, 1998; Tavazoie et al, 2000) • ChIP (e.g. Simon et al, 2001; Lee et al, 2002)

  10. Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model Models X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 X1 X1 X2 X2 X3 X3 X4 X4 X5 X5 T T Learning models: unaligned data EM algorithm Unaligned Data Learna model Identify binding sites

  11. ChIP location analysis[Lee et al, 2002] Yeast genome-wide location experiments Target genes for 106 TFs in 146 experiments ….... Gene ABF1 Targets ZAP1 Targets YAL001CYAL002WYAL003W YAL005C...YAL010CYAL012CYAL013WYPR201W YAL001CYAL002WYAL003W +– +–...+––– –+––. ..– ++– # genes ~ 6000

  12. Learned Mixture of Profiles 43 Learned profile 492 Example: Models learned for ABF1 (YPD)Autonomously replicating sequence-binding factor 1 Known profile(from TRANSFAC)

  13. Detect target genes on a genomic scale: Evaluating Performance ACGTAT…………….………………….AGGGATGC GAGC -473 -1000 0

  14. -4 -5 -6 -7 -8 10 10 10 10 10 Profile Mix of Trees Bonferroni corrected p-value ≤ 0.01 p-value -3 10 -2 10 -1 10  -180 -160 -140 -120 -100 -80 -60  Evaluating Performance Detect target genes on a genomic scale: Biologicallyverified site Gal4 regulates Gal80

  15. Test set YAL001CYAL002WYAL003W +–+ Evaluation using ChIP location data[Lee et al, 2002] Evaluate using a 5-fold cross-validation test: Prediction Data set –+––+––– YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W YAL001CYAL002WYAL003W +–+

  16. True –+––+––– +–+ Evaluation using ChIP location data[Lee et al, 2002] Evaluate using a 5-fold cross-validation test: Prediction Data set ––––++– – YAL001CYAL002WYAL003W YAL005CYAL007CYAL008WYAL009WYAL010CYAL012CYAL013WYPR201W +–+ √√√√FN√√√FP√√

  17. 90% Mixture of Trees 80% 70% Mixture of Profiles 60% Tree Profile 50% True Positive Rate (Sensitivity) 40% 30% 20% 10% 0% 0% 1% 2% 3% 4% 5% False Positive Rate Example: ROC curve of HSF1 ~60 FP

  18. Tree vs. Profile 20 3 30 15 10 5 0 Δ specificity -5 -10 15 6 -15 -20 -25 -20 -10 0 10 20 30 40 50 60 Δ sensitivity Improvement in sensitivity&specificity 105 unaligned data sets from Lee et al. True TP Predicted SensitivityTP / True SpecificityTP / Predicted

  19. 20 0 52 15 10 5 0 -5 -10 18 17 -15 -20 -25 -20 -10 0 10 20 30 40 50 60 Improvement in sensitivity&specificity 105 unaligned data sets from Lee et al. Mixture of Profiles vs. Profile True TP Predicted Δ specificity SensitivityTP / True SpecificityTP / Predicted Δ sensitivity

  20. 20 15 10 5 1 84 0 -5 -10 -15 -20 -25 2 16 -20 -10 0 10 20 30 40 50 60 Improvement in sensitivity&specificity 105 unaligned data sets from Lee et al. Mixture of Trees vs. Profile True TP Predicted Δ specificity SensitivityTP / True SpecificityTP / Predicted Δ sensitivity

  21. “Is it worthwhile to model dependencies?”Evaluation clearly supports this What about the underlying biology ?(with Prof. Hanah Margalit, Hadassah Medical School)

  22. 50 Weak (< 0.3 bits) Medium (< 0.7 bits) 40 Strong 30 Num of dependencies 20 10 0 1 2 3 4 5 6 7 8 9 10 11 Distance Distance between dependent positions Tree models learned from thealigned data sets < 1/3 of the dependencies

  23. 128 128 Not Significant Significant(paired t-test) 64 64 32 32 16 16 8 Fold-change in likelihood 8 Fold-change in likelihood 4 4 2 2 1 1 0.5 0.5 Helix Turn Helix ??? bZIP bHLH others 10 20 30 40 50 60 70 80 90 β Sheet Zinc finger Structural families Dependency models vs. Profile on aligned data sets

  24. Conclusions • Flexible framework for learning dependencies • Dependencies are found in many cases • It is worthwhile to model them - Better learning and binding site prediction Future work • Link to the underlying structural biology • Incorporate as part of other regulatory mechanism models http://compbio.cs.huji.ac.il/TFBN

More Related