PROTEOMICS 3D Structure Prediction

PROTEOMICS 3D Structure Prediction

Contents • Protein 3D structure. • Basics • PDB • Prediction approaches • Protein classification.

Protein Structures: Primary Secondary Tertiary Quaternary Packing of secondary elements. Packing of several polypeptide chains Amino acid sequence Alpha helices & Beta sheets, loops.

How Does a Protein Fold • The classical nucleation-propagation model: • the first event (fast) is hydrophobic collapse accompanied by the formation of secondary structures.In this step domains are formed. • the second step (slow) is the precise ordering of the secondary elements: packing of hydrophobic core, domain arrangement, etc. • The 3D structure is assumed to be the most stable structure - minimal free energy. • Local minimum or global minimum?

Prions • Proteins found in mammals. • Responsible for the mad cow disease. • There is no difference in the sequence of a normal prion and an abnormal prion. • The difference lies in the 3D structure. • Disease is assumed to be propagated by the insertion of an abnormal prion, that is capable of changing the configuration of a normal prion to an abnormal prion. • Conclusion: there are several stable configurations for a single protein.

PDB - Protein Data Base • http://www.rcsb.org/pdb/index.html • Contains proteins whose structure has been solved. • Number of solved proteins: 19,225. • Ratio of solved structures / proteins: 1/7 (SwissProt) - 1/40 (TrEMBL) • The entry for each protein consists of the x,y,z coordinates of every atom. • Tutorial http://www.rcsb.org/pdb/query_tut.html

Prion Protein Domain from Mouse – Entry 1AG2: Ribbons Vs. Cylinders

Broad View of the protein world I • Estimation: ~1000-20,000 protein families composed of members that share detectable sequence similarity. • A new sequence is expected to be similar to other sequences in the data base, and can be expected to share structural features with these proteins. • Structure prediction: • >50% sequence identity imply similar structure. • >30% sequence identity imply common structural elements

Broad View of the protein world II • There is a limited number of different 3D structures. • Comparing newly generated structures with previously found structures, the new structure often fold into alpha & beta elements in the same order and in the same spatial configuration as already known structures. • Often there is no sequence similarity. • Totally different sequences can fold into similar structures.

Three Main Approaches • for Structural Prediction: • Ab-Initio. • Comparative Modeling. • Fold Recognition. Example: A pathway for folding a 2-domain protein.

http://www.pdg.cnb.uam.es/cursos/FVi2001DIA1/

The Ab-Initio Method • The Structural Prediction Problem:“Given a protein sequence, compute it’s structure”. • Computation is based on energy calculation stemming from the position of each atom in space and its physical-chemical relations with other atoms. • Theoretically possible. • Astronomical, highly under-constrained search space. • Biophysics complex and incomplete. • Practically, next to impossible.

Comparative (Homology) Modeling • Evolutionary related proteins (homologous) usually have similar structure. • The similarity of structures is very high incore regions (helices & sheets). • However,loops may varyeven in pairs of homologous structures with high degree of sequence similarity. • Thick backbone- known • structure. • Thin lines-modeled • structure. • Some side-chains are not • positioned correctly, • but some look good.

Modeling Performance • Structure similarity predicted from sequence similarity: • Sander & Schneider (1991) aligned all the sequences in PDB. • Developed a formula for structure similarity based on sequence similarity. • Structure similarity depends on the length of the protein.

Modeling Performance - Examples • A protein of 10 amino acids requires 80% identity for a similar structure. • A protein of length > 80 requires • ~30% identity for common sub-structures. • ~50% identity for a similar structure. • ~80% identity for a similar structure in a very good resolution.

Fold Recognition Approaches Fold - a combination of secondary structural units in the same configuration. Protein structural classification uses fold as a basic level of classification.

Fold<->Family Relations • Estimation 1: There are 1500-20,000 protein families, based on homology. Each family contains ~ one fold. • Estimation 2: There are 700-1500 protein folds. • Conclusions: • Many protein families share the same fold. • Different sequences are folded similarly. • The common fold approach to structure prediction: Use the collection of determined structures to predict the structure of a protein.

How Condensed is a Fold? How many different sequences can result in the same fold for an average domain of 150 amino acids? • There are 20150 ~10200 different sequences • about 1038 are less than 20% identical. • Assume that only 1 in a million has a stable fold - 1032. • Expected number of different folds is 1000. • About 1029 different sequences fold similarly.

Fold Recognition • A fold is shared by family members, both close and distant (distance is related to sequence similarity) • the globin fold • For a query protein - if its family members are identified, and their fold is known, we could assign it the same fold. Method 1: Which alignment algorithm detects close and distant relatives? PSI-BLAST

Fold Recognition - Threading • Threading allows for identification of structure similarity without sequence similarity. • The amino acid (aa) sequence of a query protein is examined for compatibility with the structural core of a known protein. “Given a protein structure, what sequences fold into it ?”

Threading • The protein core is a very compact environment composed of alpha and beta secondary structures. • Very hydrophobic, no place for water molecules, other aa, or aa with chemically different side chains. • Side chains have many contacts with neighboring aa for stability. • Threading matches the aa of the query with aa of a known structure: • If threading gives a good score, then the core of the query is assumed to fold similarly.

Threading • Two main methods: • Contact potential method. • Structural profile (Environmental template). • Contact potential method • the number of contact points and proximity between aa is analyzed for every known structure. • The query is checked against all the interactions in the core and their contribution to the stability of the structure. • The fold that results in the most energetically stable structure is chosen.

Threading - Structural Profile • The environment of every aa in known structures is determined, including • the secondary structure, the area of the side-chain that is buried by closeness to other atoms, types of nearby chains, etc. • Each position is classified into one of 18 types • 6 representing increasing levels of residue burial and fraction of surface covered by polar atoms • combined with three classes of secondary structures. • Each aa is assessed for its ability to fit into that type of site in the structure. • Buried group is matched well with hydrophobic aa.

Structural Profile • Profile rows are the residues in the structure according to the 18 different types. • Profile columns are the 20 aa + insertion + deletion. • If residue in inside loop - many substitutions are allowed, as well as insertions and deletions. • The score for a given aa in a residue estimates the fitness of the aa to the residue type. • How shall we find the best fitting region?

Structural Profile • Dynamic programming algorithm finds the best match of a query sequence to a specific fold. • Statistical significance can be computed by doing the above for all sequences in the database. • The same analysis will be repeated for each fold. • The fold with the best statistically significant score is chosen.

Threading - Pros and Cons: • Good results. • Environmental properties may be more accurate then amino acid similarity matrices. • Can lead to effective and fast implementations. • Able to discover structural similarities impossible to detect by sequence searching methods. • Requires the existence of already known proteins with similar structure.

CASP - Critical Assessment of Structure Prediction • Competition among different groups for resolving the 3D structure of proteins that are about to be solved experimentally. • Current state - only fragments are “solved”: • ab-inito - the worst, but greatly improved in the last years. • Modeling - performs very well when homologous sequences with known structures exist. • Fold recognition - PSI-BLAST is used for training the threading procedures. Performs well.

A Clickable Structure Prediction Flowchart: http://www.bmm.icnet.uk/people/rob/CCP11BBS/flowchart2.html

Protein Classification • Proteins are classified to reflect both structural and • evolutionary relatedness.The principal levels are: • Family:Clear evolutionary relationship. • In general, > 30% pairwise residue identity between the proteins. • Superfamily:Probable common evolutionary origin. • Combines families whose member proteins • havelow sequence identities, but whose structural and functional features suggest a common evolutionary origin. • Structurally, superfamily members share a common fold.

SCOP - Structural Classification of Proteins • http://scop.mrc-lmb.cam.ac.uk/scop/ • Hierarchical classification of all proteins with known structures. • Classification: • Class - all alpha, all beta, alpha & beta (a/b), alpha + beta (a + b). • Superfamily. • Family. • Fold - the major structural similarity unit. • PDB entry for a protein.

CATH- Class Architecture Topology Homologous Superfamily • http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html • Another protein structure classification database. • Classification: • Class - all alpha, all beta, alpha & beta (a/b), alpha + beta (a + b). • Architecture - gross orientation of secondary structures, independent of connectivity. • Topology - clusters structures according to their topological connections and numbers of secondary structures. • Homologous superfamilies - clusters proteins with highly similar structures and functions.

PFAM - Protein Families • http://www.sanger.ac.uk/Software/Pfam/ • Database that contains large collection of multiple sequence alignments and profile hidden Markov Models (profile HMMs). • Profile HMM is a probabilistic model which describes a set of sequences. • Widely used to describe related sequences. • Defines domains - areas of homology that have a 3D structure independent of the rest of the protein.

http://protomap.cornell.edu/ Classification of all the proteins in the SWISSPROT and TrEMBL databases, into groups of related proteins.

PROTEOMICS 3D Structure Prediction