Protein structure prediction

Protein structure prediction Einat Granot Liron Atedgi

Protein folding • Protein folding determined by A”A sequence Why knowing the folding is importance ? • Determine it’s functionality • Find distant evolutionary relationship • Design drugs

Protein structures • Primary structure • Secondary structure • Tertiary structure

Two prediction methods • PSI-PRED– secondary structure prediction based on PSIBLAST • GenTHREADER– tertiary structure prediction Were developed by the group of David T.Jones,University of Warwick

Methods general format Sequence Alignment + Additional data Neuron networks Structure prediction

Neuron networks

Neuron networks Output Numerical inputs Units Why do we call it neuron network ? Every unit performs weighted calculation

Neuron network hidden layer with the increasing number of added layers the mean square error is lower Hidden layer

Neuron networks training • Network connections and weights determined by training process • Training performs by samples of input and expected output. • The learning algorithm is called back propagation

Network training & testing After training we perform testing • Training and testing groups must be chosen very carefully • What problems can arise ? • Insufficient training or testing • Testing group may be biased

Neuron networks is a “black-box” • The specific algorithm ofa working neuron networkis not known • It’s hard to deduce new biological principles about the solved problem

PSI-PRED Secondary structureprediction

Secondary structure prediction • In DSSP – 8 secondary structures categories • In PSI-PRED – were joined into 3:Strand(E), Helix(H) and Coil(C) AA: RLMPHIKRSAIPVNHGQCRWEDNVDERTNCMIQYVLIMRD Pred: CCCCCHHHCCCCCCEEEEEECCCCCCHHHHEEEEEECCCC

PSI-PRED sequence alignment (Find homologous) Create protein profile Insert to first neuron network Insert to second neuron network Final prediction

sequence alignment • Finding homologous for target protein using PSI-BLAST Reminder … ? What is PSI-BLAST…? Position Specific Iterated Blast,giving output to PSSM.

PSI-BLAST Pros & Cons Pros : • Sensitive to distant homologous • Reliable • Accessible from every workstation Cons : • Sensitive to distant homologous - Result might be biased • Sensitive to repetitive sequences

Solving PSI-BLAST problems • A special DB of 340,000 sequences was constructed for PSI-PRED • This DB contains only unique and unrepetitive sequences

Create protein profile • PSI-PRED uses the PSSM from PSI-BLAST produced after 3 iteration • This matrix is processed by transformation f(x) = , so the final values are between 0 to 1

PSSM – Output of PSI-BLAST Transformation

Create protein profile • The matrix size is M x 20, when M is the sequence length • Addition column is added which defined the N/C terminus -> M x 21 matrix

Networks training & testing • 187 proteins were selected according to CATH and PSI-BLAST • CATH filters proteins according to their folding domains configuration (T-level) • This considered to be a strict selection

First neuron network Every time, a sequence of 15 A”A long is inserted into the first network The output is a matrix 15 x 3

Second neuron network The input for the 2nd network is the output from the 1st one Again, another column is added, indicates the N/C terminus

Why do we need a second network? Let’s examine a possible prediction from the 1st network… What is the problem with this prediction ? Seq VLFLNDNLDDVVIGRPKRTYTAITL Pred EEEECCCCHHHCCCHCCCEEEECC A single A”A helix does not exist The 2nd network maintains the coherency between adjacent A”A and improves the accuracy

Final prediction Image of prediction Degree Of confidence Target sequence Secondary structure

PSI-PRED evaluation • CASP– Critical Assessment of technique for protein Structure Prediction experiments • At CASP3 PSI-PRED achieved the best results from all other methods participated

PSI-PRED evaluation Q3 average : PSI-PRED - 76.3% JPRED – 72.4% DSC - 67.3% Q3 score – percentage of A”A predicted correctly

Reasons for success • The use of PSI-BLAST • More sensitive (iterative algorithm) • More accurate (pairwise local alignments) • Usage of neuron networks • Strict selection for training & testing

Possible improvements • Larger data bases (training & alignment) • Combinations with other methods (JPRED) • Predict more than 3 secondary structure

Bring out the food…

GenTHREADER Tertiary structure Prediction

Threading methods • Trying to thread a target A”A sequence on a template 3D structure M Q S N I L D V R E R A Q T V L C N K

Templates collection • Target sequence is compared against a collection of sequences with known folding • The collection was taken from Brookhaven Protein Data Bank and includes unique sequences

GenTHREADER Sequence alignment Calculate threading potential Insert to neuron network Final prediction

Sequence alignment • The target sequence is aligned against each of the templates twice: • Target profile against template sequence • Target sequence against template profile • The best result is taken

Creating a profile Steps for creating a profile : • Alignment against OWL DB(A DB for coding sequences) • Selection of sequences with E-Value lower than 0.01 • Constructing a profile using BLOSUM50

Creating a profile A L M P H I K R S A I P V N H G Y V I M Q C R W E D N S T K V

Calculate threading potential Threading potential includes : • pairwise potential • solvation potential

Pairwise potential • Potential for interaction between two A”A • Considerate analysis of known structure and favorable energy configuration • Lower pairwise potential indicates a favorable state

Solvation potential • Calculated per A”A and proportional to its degree of burial • Degree of burial (DOB)– The num of other A”A located in a radius of 10Å • Hydrophobic acids - a high DOB is preferred • Hydrophilic acids - a low DOB is preferred

Insert to neuron network • Prediction is very complex therefore a neuron network is used

Neuron network • Again, the 6 input parameters were converted to values between 0 – 1 using the function f(x) = • The output is a value between 0 -1 showing the confidence of the match

Network training & testing • The network was trained using pairs of proteins with known folding patterns • Again the training and testing sets were separated to avoid bias

Protein structure prediction