450 likes | 574 Views
Sequence based predictors. Secondary structure and surface accessibility Bent Petersen 13 January 2011. IT og Sundhed 2010/11. NetSurfP. Real Value Solvent Accessibility predictions with amino acid associated reliability. Objective.
E N D
Sequence based predictors. Secondary structure and surface accessibility Bent Petersen13 January 2011 IT og Sundhed 2010/11
NetSurfP • Real Value Solvent Accessibility predictions with amino acid associated reliability
Objective • Predict residues as being either buried or exposed (25 % threshold) • Two states/classes, Buried/Exposed • Predict the Relative Solvent Accessibility, RSA • “Real” Value
What is ASA? • Accessible Solvent Area, Å2 • Surface area accessible to a rolling water molecule
RSA RSA = Relative Solvent AccessibilityACC = Accessible area in protein structureASA = Accessible Surface Area in Gly-X-Gly or Ala-X-Ala Classification Networks “Real” value Networks Classification: Buried = RSA < 25 %, Exposed = RSA > 25 % “Real” Value: values 0 - 1, RSA > 1 set to 1
Why predict RSA? • Residues exposed on surface can be: • Involved in PTM’s • Potential epitopes • Involved in Protein-Protein interactions • Prediction of Disease-SNP’s
How to start? • What do we want? • We want to be able to predict the exposure of an AA • What do we need? • A training dataset and an independent evaluation dataset • What information do we need? • True structural information the Neural Network can train on • Where do we get that? • PDB, DSSP
Protein Data Bank, PDB Berman, H.M., et al., The Protein Data Bank. Nucl. Acids Res., 2000. 28(1): p. 235-242.
Define Secondary Structure of Proteins, DSSP Kabsch, W. and C. Sander, Dictionary of Protein Secondary Structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983. 22(12): p. 2577--2637. ==== Secondary Structure Definition by the program DSSP, updated CMBI version by ElmK / April 1,2000 ==== DATE=23-MAR-2009 . REFERENCE W. KABSCH AND C.SANDER, BIOPOLYMERS 22 (1983) 2577-2637 . HEADER TOXIN 12-AUG-98 3BTA . COMPND 2 MOLECULE: PROTEIN (BOTULINUM NEUROTOXIN TYPE A); . SOURCE 2 ORGANISM_SCIENTIFIC: CLOSTRIDIUM BOTULINUM; . AUTHOR R.C.STEVENS,D.B.LACY . 1277 2 2 1 1 TOTAL NUMBER OF RESIDUES, NUMBER OF CHAINS, NUMBER OF SS-BRIDGES(TOTAL,INTRACHAIN,INTERCHAIN) . 55121.0 ACCESSIBLE SURFACE OF PROTEIN (ANGSTROM**2) . 815 63.8 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(J) , SAME NUMBER PER 100 RESIDUES . 24 1.9 TOTAL NUMBER OF HYDROGEN BONDS IN PARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES . 198 15.5 TOTAL NUMBER OF HYDROGEN BONDS IN ANTIPARALLEL BRIDGES, SAME NUMBER PER 100 RESIDUES . 1 0.1 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-5), SAME NUMBER PER 100 RESIDUES . 10 0.8 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I-4), SAME NUMBER PER 100 RESIDUES . 125 9.8 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+2), SAME NUMBER PER 100 RESIDUES . 134 10.5 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+3), SAME NUMBER PER 100 RESIDUES . 276 21.6 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+4), SAME NUMBER PER 100 RESIDUES . 9 0.7 TOTAL NUMBER OF HYDROGEN BONDS OF TYPE O(I)-->H-N(I+5), SAME NUMBER PER 100 RESIDUES . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 *** HISTOGRAMS OF *** . 0 0 0 0 0 3 3 1 2 1 0 3 1 1 0 1 0 0 1 0 1 0 1 1 0 0 0 0 0 2 RESIDUES PER ALPHA HELIX . 2 0 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 PARALLEL BRIDGES PER LADDER . 15 10 7 5 8 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ANTIPARALLEL BRIDGES PER LADDER . 3 3 0 0 1 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 LADDERS PER SHEET . # RESIDUE AA STRUCTURE BP1 BP2 ACC N-H-->O O-->H-N N-H-->O O-->H-N TCO KAPPA ALPHA PHI PSI X-CA Y-CA Z-CA 1 1 A P 0 0 5 0, 0.0 2,-3.8 0, 0.0 3,-0.2 0.000 360.0 360.0 360.0 132.0 74.7 55.7 73.4 2 2 A F - 0 0 115 92,-0.4 93,-0.1 1,-0.1 36,-0.1 -0.206 360.0-142.1 55.7 -62.1 74.7 59.2 74.7 3 3 A V - 0 0 11 -2,-3.8 35,-0.2 91,-0.1 -1,-0.1 0.867 4.9-143.8 70.2 103.3 78.3 59.8 73.7 4 4 A N S S+ 0 0 127 33,-0.3 2,-0.5 -3,-0.2 33,-0.1 0.914 73.7 44.0 -67.5 -53.8 80.1 61.9 76.4 5 5 A K S S- 0 0 94 32,-0.1 2,-0.5 1,-0.0 -1,-0.1 -0.857 79.6-124.0-105.1 133.1 82.5 64.2 74.5 6 6 A Q - 0 0 192 -2,-0.5 2,-0.1 1,-0.1 82,-0.1 -0.568 35.9-150.4 -71.8 118.5 81.6 66.2 71.4 7 7 A F - 0 0 14 -2,-0.5 2,-0.3 80,-0.1 3,-0.1 -0.388 16.9-164.3 -91.4 166.8 84.2 65.3 68.7 8 8 A N > - 0 0 71 -2,-0.1 3,-0.9 1,-0.1 77,-0.0 -0.977 28.9-124.4-143.4 141.5 85.7 67.1 65.7 9 9 A Y T 3 S+ 0 0 17 -2,-0.3 -1,-0.1 1,-0.2 72,-0.1 0.908 109.3 50.7 -57.8 -43.3 87.5 65.3 62.9 10 10 A K T 3 S+ 0 0 141 -3,-0.1 -1,-0.2 70,-0.1 3,-0.1 0.650 77.9 122.5 -70.3 -17.2 90.7 67.4 63.3 11 11 A D S < S- 0 0 45 -3,-0.9 3,-0.1 1,-0.1 2,-0.1 -0.203 77.6 -91.4 -48.0 134.3 91.0 66.8 67.1 12 12 A P - 0 0 99 0, 0.0 -1,-0.1 0, 0.0 -2,-0.1 -0.246 38.0-108.3 -57.6 128.3 94.4 65.3 67.8 13 13 A V + 0 0 41 -3,-0.1 6,-0.2 1,-0.1 4,-0.1 -0.238 38.6 179.2 -51.8 138.5 94.8 61.5 67.8 14 14 A N - 0 0 67 4,-3.7 2,-1.4 2,-0.2 5,-0.2 -0.085 45.1-107.4-144.3 45.7 95.4 60.3 71.4 15 15 A G S S+ 0 0 0 122,-0.4 2,-0.3 3,-0.2 4,-0.2 0.248 100.3 58.5 54.3 -18.1 95.7 56.6 71.7 16 16 A V S S- 0 0 72 -2,-1.4 -2,-0.2 2,-0.5 20,-0.1 -0.996 116.3 -7.4-142.5 145.9 92.2 56.3 73.3 17 17 A D S S+ 0 0 22 -2,-0.3 19,-2.5 18,-0.1 2,-0.2 0.389 136.6 45.3 53.3 -7.2 88.7 57.3 72.3 18 18 A I E S+A 35 0A 6 17,-0.3 -4,-3.7 -11,-0.0 -2,-0.5 -0.649 85.9 128.7-161.1 96.3 90.4 59.0 69.2
Define Secondary Structure of Proteins, DSSP • DSSP defines 8 types of secondary structure • G = 3-turn helix (3-10 helix) • H = 4-turn helix (α-helix) • I = 5-turn helix (π-helix) • T = Hydrogen bonded turn (3, 4 or 5 turn) • E = Extended strand • B = Residue in isolated β-bridge • S = Bend • Rest is C = coil
Required datasets • Training/test • Used for optimization of settings using 10-fold cross-validation • Evaluation • Used for final evaluation, less than 25 % homolog to the training/test dataset.
10-fold Cross Validation • 10-fold Cross Validation • Break dataset into 10 sets of size 1/10 • Train on 9 datasets and test on 1 • Repeat 10 times and take a mean accuracy
Learning / Training dataset • Training set: Cull_1764: • Max. Seq. ID: 25 % • Resolution: ≤ 2.0 Å • R-Factor: ≤ 0.2 • Seq. Length 30-3000 AA • Including X-ray entries only
Learning / Training dataset • Homology reduced towards evaluation set CB513 (302 sequences removed) • Final Training set: • 1764 sequences • 417.978 amino acids • Buried: 55.80 % (233.221 amino acids) • Exposed: 44.20 % (184.757 amino acids)
Learning / Training dataset ---Sequence/residue statistics--- Number of seq.: 1764 Longest seq.: 1T3T.A (1283) Shortest seq.: 1YTV.M(6) Number of amino acids: 417978 ---Assignment category statistics --- B 184757 ( 44.20%) A 233221 ( 55.80%) ---Amino acid statistics--- H 10025 ( 2.40%) G 31743 ( 7.59%) Y 14927 ( 3.57%) V 30171 ( 7.22%) E 27774 ( 6.64%) S 24430 ( 5.84%) P 19589 ( 4.69%) A 35658 ( 8.53%) R 21435 ( 5.13%) Q 15535 ( 3.72%) C 5202 ( 1.24%) K 23054 ( 5.52%) L 38489 ( 9.21%) N 17756 ( 4.25%) T 22998 ( 5.50%) F 17181 ( 4.11%) D 24743 ( 5.92%) I 23550 ( 5.63%) W 6365 ( 1.52%) M 7353 ( 1.76%)
Evaluation dataset • Final Evaluation dataset: • CB513: • 513 non-homologous sequences • Seq. Length 20-754 aa • 84.119 amino acids • Buried: 55.81 % (46.948 amino acids) • Exposed: 44.19 % (37.171 amino acids)
Evaluation dataset ---Sequence/residue statistics--- Number of seq.: 513 Longest seq.: 6acn.all(754) Shortest seq.: 1atpi-1(20) Number of amino acids: 84119 ---Assignment category statistics --- B 37171 ( 44.19%) A 46948 ( 55.81%) ---Amino acid statistics--- R 3812 ( 4.53%) T 5015 ( 5.96%) D 4973 ( 5.91%) C 1381 ( 1.64%) Y 3065 ( 3.64%) G 6657 ( 7.91%) N 3976 ( 4.73%) V 5795 ( 6.89%) I 4642 ( 5.52%) A 7267 ( 8.64%) S 5222 ( 6.21%) K 4976 ( 5.92%) P 3903 ( 4.64%) E 5050 ( 6.00%) L 7134 ( 8.48%) Q 3108 ( 3.69%) M 1710 ( 2.03%) H 1865 ( 2.22%) W 1236 ( 1.47%) F 3268 ( 3.88%) X 19 ( 0.02%) B 31 ( 0.04%) Z 14 ( 0.02%)
Neural Network - Input • Position Specific Scoring Matrices, PSSM A R N D C Q E G H I L K M F P S T W Y V B H 2BEM.A 1 -4 -3 -2 -4 -6 -2 -3 -5 11 -6 -5 -3 -4 -4 -5 -3 -4 -5 -1 -6 A G 2BEM.A 2 -2 -5 -3 -4 -5 -4 -5 7 -5 -7 -6 -4 -5 -6 -5 -3 -4 -5 -6 -6 A Y 2BEM.A 3 -1 1 -4 -3 -5 -4 -4 -4 1 -4 -1 -4 -1 2 -5 0 -1 4 7 -2 A V 2BEM.A 4 -1 -5 -5 -6 -4 -4 -5 -5 -5 4 1 -5 6 -3 -2 -2 0 -5 -4 4 B E 2BEM.A 5 -2 -4 -3 0 -4 -1 3 -2 -4 0 -3 -2 1 -2 -3 3 3 -5 -4 0 4 time iterativ psi-blast against nr70 • Secondary Structure predictions B H 2BEM.A 1 0.003 0.003 0.966 A G 2BEM.A 2 0.018 0.086 0.868 A Y 2BEM.A 3 0.020 0.199 0.752 A V 2BEM.A 4 0.021 0.271 0.679 B E 2BEM.A 5 0.020 0.199 0.752 (sec predictor by Pernille Andersen)
Secondary structure predictor • Developed by Pernille Andersen, incorporated in NetSurfP • Trained on 2,085 sequences using DSSP • H = H, E = E, C = ., G, I, B, S and T • H ~ 30 %, E ~ 20 %, C ~ 50 % • Performance of ~80 % • Maximum theoretical limit is ~88 %
Neural Network - Settings • Window Size: 11-19 • Hidden units: 10, 20, 25, 30, 40, 50, 75, 150, (200) • Learning rate: 0.01 / (0.005) • Epocs (training rounds): 200 • 10-fold cross-validation • 9/10 used for training, 1/10 for testing
Neural network window Sliding window of 7 170 2BEM.A mol:aa CHITIN-BINDING PROTEINHGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFTWKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAFYQAIDVNLSKBAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAAABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAAAAAAAABABB Prediction on middle residue Serine, buried
Neural network window Sliding window of 7 170 2BEM.A mol:aa CHITIN-BINDING PROTEINHGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFTWKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAFYQAIDVNLSKBAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAAABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAAAAAAAABABB Prediction on middle residue Proline, exposed
Neural network window Sliding window of 7 170 2BEM.A mol:aa CHITIN-BINDING PROTEINHGYVESPASRAYQCKLQLNTQCGSVQYEPQSVEGLKGFPQAGPADGHIASADKSTFFELDQQTPTRWNKLNLKTGPNSFTWKLTARHSTTSWRYFITKPNWDASQPLTRASFDLTPFCQFNDGGAIPAAQVTHQCNIPADRSGSHVILAVWDIADTANAFYQAIDVNLSKBAAABBAAAAAAAABBBBABBABBAABBABAABABBBAABBBABBABAAAABBBBABAAABABBBAABABBABAABABAAABABBBBAABAAAAAAABBBABABBBAAABAABBBAAAAAABBBBBABBBABABABAABBABBBAAAAAAAAABBBBBAAAAAAAABABB Prediction on middle residue Alanine, exposed
Wisdom of the crowd Selecting best performing network architectures based on test performance Better than choosing any single network
Results - Classification networks • Training:
Results - Classification networks • Training: • Evaluation:
Results • Evaluation
NetSurfP /usr/cbs/bio/src/NetSurfP/NetSurfP -h
NetDiseaseSNP • Disease-SNP prediction (Morten Bo Johansen) • Without NetSurfP:Cross-validation: MCC= 0.569Cross-Evaluation: MCC= 0.560 • With NetSurfP:Cross-validation: MCC= 0.583Cross-Evaluation: MCC= 0.572
Statistics • Submissions to the webserver from CBS website
As of 12 Jan 2011 136003 sequences submitted from 13494unique IP’s