Class 7: Protein Secondary Structure

Class 7: Protein Secondary Structure .

Protein Structure • Amino-acid chains can fold to form 3-dimensional structures • Proteins are sequencesthat have (more or less) stable 3-dimensional configuration

Why Structure is Important? The structure a protein takes is crucial for its function • Forms “pockets” that can recognize an enzyme substrate • Situates side chain ofspecific groups to co-locate to form areas with desired chemical/electrical properties • Creates firm structures such ascollagen, keratins, fibroins

Determining Structure • X-Ray and NMR methods allow to determine the structure of proteins and protein complexes • These methods are expensive and difficult • Could take several work months to process one proteins • A centralized database (PDB) contains all solved protein structures • XYZ coordinate of atoms within specified precision • ~23,000 proteins have solved structures

Structure is Sequence Dependent • Experiments show that for many proteins, the 3-dimensional structure is a function of the sequence • Force the protein to loose its structure, by introducing agents that change the environment • After sequences put back in water, original conformation/activity is restored • However, for complex proteins, there are cellular processes that “help” in folding

Levels of structure

Secondary Structure -helix -strands

a Helix • Single protein chain • Turn every 3.6 amino acids • Shape maintained byintramolecular H bondingbetween -C=O and H-N-

Hydrogen Bonds in -Helices

Amphipathic -helix • Hydrophilic residues on one side • Hydrophobic residues on other side

-Strands • Alternating 120’ angles • Often form sheets

Anti-parallel parallel -Strands form Sheets These sheets hold together by hydrogen bonds across strands

…which can form a -barrel porin – a membranal transoporter

Angular Coordinates • Secondary structures force specific angles between residues

Ramachandran Plot • We can relate angles to types of structures

Define "secondary structure" 3D protein coordinates may be converted to a 1D secondary structure representation using DSSP or STRIDE DSSP EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TT DSSP= Database of Secondary Structure in Proteins

DSSP symbols H = helix backbone angles (-50,-60) and H-bonding pattern (i-> i+4) E = extended strand backbone angles (-120,+120) with beta-sheet H-bonds (parallel/anti-parallel are not distinguished) S= beta-bridge (isolated backbone H-bonds) T=beta-turn (specific sets of angles and 1 i->i+3 H-bond) G=3-10 helix or turn (i,i+3 H-bonds) I=Pi-helix (i,i+5 Hbonds) (rare!) _= unclassified. None-of-the-above. Generic loop, or beta-strand with no regular H-bonding. L

Labeling Secondary Structure • Using both hydrogen bond patterns and angles, we can label secondary structure tags from XYZ coordinate of amino-acids • These do not lead to absolute definition of secondary structure

Prediction of Secondary Structure Input: • amino-acid sequence Output: • Annotation sequence of three classes: • alpha • beta • other (sometimes called coil/turn) Measure of success: • Percentage of residues that were correctly labeled

Accuracy of 3-state predictions True SS: EEEE_SS_EEEE_GGT__EE_E_HHHHHHHHHHHHHHHGG_TTPrediction: EEEELLLLHHHHHHLLLLEEEEEHHHHHHHHHHHHHHHHHHLL Q3-score = % of 3-state symbols that are correct Measured on a "test set" Test set == An independent set of cases (proteins) that were not used to train, or in any way derive, the method being tested. Best methods: PHD (Burkhard Rost) -- 72-74% Q3 Psi-pred (David T. Jones) -- 76-78% Q3

What can you do with a secondary structure prediction? (1) Find out if a homolog of unknown structure is missing any of the SS (secondary structure) units, i.e. a helix or a strand. (2) Find out whether a helix or strand is extended/shortened in the homolog. (3) Model a large insertion or terminal domain (4) Aid tertiary structure prediction

Statistical Methods • From PDB database, calculate the propensity for a given amino acid to adopt a certain ss-type • Example: #Ala=2,000, #residues=20,000, #helix=4,000, #Ala in helix=500 P(a,aa) = 500/20,000, p(a) = 4,000/20,000, p(aa) = 2,000/20,000 P = 500 / (4,000/10) = 1.25 Used in Chou-Fasman algorithm (1974)

Chou-Fasman: Initiation • Identify regions where 4/6 have propensity P(H) >1.00 • This forms a “alpha-helix nucleus”

Chou-Fasman: Propagation • Extend helix in both directions until a set of four residues have an average P(H) <1.00.

Chou-Fasman Prediction • Predict as -helix segment with • E[P] > 1.03 • E[P] > E[P] • Not including proline • Predict as  -strand segment with • E[P] > 1.05 • E[P] > E[P] • Others are labeled as turns/loops. (Various extensions appear in the literature)

Achieved accuracy: around 50% • Shortcoming of this method: ignoring the context of the sequence when predicting using amino-acids • We would like to use the sequence context as an input to a classifier • There are many ways to address this. • The most successful to date are based on neural networks

A Neuron

Artificial Neuron Input Output a1 W1 W2 a2 … Wk ak • A neuron is a multiple-input -> single output unit • Wi = weights assigned to inputs; b = internal “bias” • f = output function (linear, sigmoid)

Artificial Neural Network Input Hidden Output a1 W1 o1 W2 a2 … … … Wk om a3 • Neurons in hidden layers compute “features” from outputs of previous layers • Output neurons can be interpreted as a classifier

Example: Fruit Classifer Shape Texture Weight Color Apple ellipse hard heavy red Orange round soft light yellow

Si-w o o Si oo ... Si+w ... ... ... Hidden Output Input Qian-Sejnowski Architecture

Neural Network Prediction • A neural network defines a function from inputs to outputs • Inputs can be discrete or continuous valued • In this case, the network defines a function from a window of size 2w+1 around a residue to a secondary structure label for it • Structure element determined by max(o, o, oo)

Training Neural Networks • By modifying the network weights, we change the function • Training is performed by • Defining an error score for training pairs <input,output> • Performing gradient-descent minimization of the error score • Back-propagation algorithm allows to compute the gradient efficiently • We have to be careful not to overfit training data

Smoothing Outputs • The Qian-Sejnowski network assigns each residue a secondary structure by taking max(o, o, oo) • Some sequences of secondary structure are impossible: • To smooth the output of the network, another layer is applied on top of the three output units for each residue: • Neural network • Markov model

Success Rate • Variants of the neural network architecture and other methods achieved accuracy of about 65% on unseen proteins • Depending on the exact choice of training/test sets

Breaking the 70% Threshold • A innovation that made a crucial difference uses evolutionary information to improve prediction Key idea: • Structure is preserved more than sequence • Surviving mutations are not random • Suppose we find homologues (same structure) of the query sequence • The type of replacements at position i during evolution provides us with information about the use of the residue i in the secondary structure

Nearest Neighbor Approach • Select a window around the target residues • Perform local alignment to sequences with known structure • Choice of alignment weight matrix to match remote homologies • Alignment weight takes into account the secondary structure of aligned sequence • Use max (na, nb, nc) or max(sa, sb, sc) • Key: Scoring measure of evolutionary similarity.

PHD Approach Multi-step procedure: • Perform BLAST search to find local alignments • Remove alignments that are “too close” • Perform multiple alignments of sequences • Construct a profile (PSSM) of amino-acid frequencies at each residue • Use this profile as input to the neural network • A second network performs “smoothing”

PHD Architecture

Psi-pred : same idea (Step 1) Run PSI-Blast --> output sequence profile (Step 2) 15-residue sliding window = 315 weights, multiplied by hidden weights in 1st neural net. Output is 3 weights (1 weight for each state H, E or L) per position. (Step 3) 60 input weights, multiplied by weights in 2nd neural network, summed. Output is final 3-state prediction. Performs slightly better than PHD

Other Classification Methods • Neural Networks were used as a classifier in the described methods. • We can apply the same idea, with other classifiers. E.g.: SVM • Advantages: Effectively avoid overfitting • Supplies prediction confidence

SVM based approach • Suggested by S. Hua and Z. Sun, (2001). • Multiple sequence alignment from HSSP database (same as PHD) • Sliding window of w  21 w input dimension • Apply SVM with RBF kernel • Multiclass problem: • Training: one-against-others (e.g. H/~H, E/~E, L/~L), binary (e.g. H/E) • maximum output score • Decision tree method • Jury decision method

No No No E / ~E H / ~H C / ~C No No No C/ H Yes H / E E / C Yes Yes Yes Yes Yes E C H H C H E C E Decision tree

Accuracy on CB513 set

State of the Art • Both PHD and Nearest neighbor get about 72%-74% accuracy • Both predicted well in CASP2 (1996) • PSI-Pred slightly better (around 76%) • Recent trend: combining classification methods • Best predictions in CASP3 (1998) • Failures: • Long term effects: S-S bonds, parallel strands • Chemical patterns • Wrong prediction at the ends of helices/strands

Class 7: Protein Secondary Structure