Computational Analysis of Protein-DNA Interactions

Computational Analysis of Protein-DNA Interactions Changhui (Charles) Yan Department of Computer Science Utah State University

Problem I Identifying amino acid residues involved in protein-DNA interactions from sequence

Materials And Methods • 56 double-stranded DNA binding proteins previously used in the study of Jones et al. (2003) • Encoding

Materials And Methods

Naïve Bayes Classifier Leave-one-out cross-validation Naïve Bayes

Leave-One-Out Cross-Validations

Predictions in The Context of 3-D Structures Pit-1, PDB 1au7 TP:30 FP: 16 TN: 86 FN:14 CC: 0.51 (2nd) Accuracy: 79% Actual Predicted

Predictions in The Context of 3-D Structures -Cro, PDB 6cro TP:10 FP: 5 TN: 34 FN:10 CC: 0.37 (19th) Accuracy: 73% Predicted Actual

Predictions Compared With PROSITE Motifs • Predicted binding sites substantially overlap with 34 of the 37 “DNA-binding” PROSITE motifs • In 52 of the 56 proteins, the predictor identifies at least 20% of the DNA-binding residues • 28 of the 56 proteins contain no PROSITE motifs that are annotated as “DNA-binding”

Comparison With Previous Study *Ahmad, S. and Sarai, A. (2005) PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 6, 33.

Summary • A simple sequence-based Naive Bayes classifier predicts interface residues in DNA-binding proteins with 75% accuracy, 37% specificity+, 53% sensitivity+ and correlation coefficient of 0.29 • Predicted binding sites • correctly indicate the locations of actual binding sites • substantially overlap with known PROSITE motifs

Problem II Identification of Helix-Turn-Helix (HTH) DNA-binding motifs

HTH Motifs • Sequences sharing low similarities can fold into a similar HTH structure • Identifying HTH motifs from sequence is extremely challenging

Trick 1 • Including more information • Amino acid sequence • Secondary structure

Hidden Markov Model (HMM) LQQITHIANQL-GLE----KDVVRVWF

Hidden Markov Model (HMM_AA_SS) LQQITHIANQL-GLE----KDVVRVWF HHHEEHEEEHMHE----HHEEMMEH

Trick 2 • There are similarities among the 20 naturally occurred amino acids • Reduced alphabets

Reduced Alphabets Schemes for reducing amino acid alphabet based on the BLOSUM50 matrix by Henikoff and Henikoff (1992) derived by grouping and averaging the similarity matrix elements as described in the text. (Murphy et al. 2000)

Cross-Families Evaluations • True positive: HTH motifs that are correctly identified as such. • False positive: Non-HTH motifs that are identified as HTH motifs. • The alphabet used to encode amino acid sequences.

Questions

Within-family Three-Fold Cross-Validations .

Comparisons of HMM_AA_SS with FFAS03 in Cross-Family Evaluations

Putative HTH motifs in Ureaplasma parvum

Computational Analysis of Protein-DNA Interactions