Master Thesis

Master Thesis Protein Folding Pathway Prediction by Haitham Ahmad Gamal Supervised by Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny

Agenda • Problem Statement • Motivation • Approach • Previous Work • Biological Background • What Affects Folding • Why is it difficult • Data Set • Methodology (the 4 stages) • Hypothesis (formally stated) • Results • Conclusion

Problem Statement • Proteins are the most vital agents in living bodies. • Their function is what concerns scientists 3D Structure Hydrophobicity Function • Much effort in structure prediction but limited success: • Result are: • premature due to the huge conformations search space. • or, insufficiently accurate due to simplifications.

Problem Statement cont. • In this study we try to limit this search space to the most likely possible conformations of a protein by answering the following questions: • Do angle measures depend on the hydrophobicity of the (neighboring) amino-acids? • If the answer of question (1) is "yes", how many neighbors shall be used? • If the answer of question (1) is "yes", what are the most likely values of the protein final structure angles?

Motivation • Knowledge of how a protein can fold enables us to understand how it is functioning. • With this level of understanding we can affect a protein either by enhancement or by suppression. • Drugs can be built to affect certain proteins directly or through other proteins interacting with the protein under investigation.

Approach • The approach used in this study is a statistical, machine learning approach. We try using this approach to answer the previous questions. Distribution Fitting Clustering

Previous Work • In our study we are not developing a prediction algorithm. • We are proving some hypothesis that can improve several types of prediction algorithms. • Our study fits in the coloured classes across all these criteria. • Prediction algorithms/techniques can be classified based on different criteria. vs. Abintio Homology vs. On-lattice Off-lattice vs. Heuristic Statistics vs. Protein-based Subsequence-based

Biological Background(Primary Structure)

Biological Background

Biological Background • The tertiary structure is the minimum free energy structure of a protein (for single chain proteins)

What affects the folding process • It has been proven that the function of a protein depends on its 3D structure not its primary structure. • The most effective factor is folding proteins (specially globular proteins) is the hydrophobicity of its constituents amino acids. • Amino acids are either charged(soluble) or contains aromatic groups(insoluble). • Hydrophobicity of all the 20 known amino acids is called the Hydrophobicity scale.

What affects the folding process

Why is it difficult tosimulate folding exactly? • An exact simulation of a short peptide folding may take months on a super computer. • The number of possible conformations is huge. • Scientists proved that solving the problem for the HP model (simplified model) is NP-Complete. • Current technologies cannot keep pace with this God created miracle.

Data Set • A collection of more than 1000 proteins is taken randomly from the SCOP protein databank • Each SCOP entry (file) represents one protein with all its features including its exact atom coordinates. • Angles are extracted using the three dimensional coordinates of each Cα atom

Methodology Angle Extraction Chopping to Subsequences K-means Clustering Distribution Fitting

X - coordinate the 3rd residue Atom Serial Number Residue Name Residue Sequence Number Y - coordinate the 4th residue Z - coordinate the 5th residue Continue doing the same until the end

First Stage (Angle extraction) • The angle that lies between each three consecutive Cα atoms is called angle θ. ( , , ) Cαi-1 θ1 θ2 Let (a) be a vector such that: a = (Cαi,Cαi-1) θ3 . . . . Cαi ( , , ) θ can then be calculated using the cosine law: Cαi-1 Let (b) be a vector such that: b = (Cαi,Cαi+1) Cαi+1 • As shown in the figure the angles are calculated at each Cα atom starting from Cα1 until CαL-1, such that (L)is the protein length. Cαi+1 ( , , ) θ Cαi

Second Stage (Chopping) • After all the angles of all of the proteins are extracted in each protein sequence is divided into subsequences of length n. • A subsequence must contain an odd number of residues. • A sliding window technique is used to chop the whole protein sequence into pieces. • The value of n is crucial in our study as will be shown in the results section.

Second Stage (Chopping) Let’s take n = 5 as an example aa7 aa1 aa5 aa3 Θ3 Θ7 Θ4 Θ2 Θ0 aa4 Θ1 Θ6 aa0 aa8 aa2 aa6 The first subsequence starts from aa0 to aa4 and the effect of this subsequence on the central angle Θ1 is what concerns us in this study. Similarity the effect of all the next subsequences starting generally from aai to aai+n-1 on the measurement of the central angle Θi+floor(n/2)-1 is studied.

Third Stage (Clustering) Let’s take n = 3 as an example • Since hydrophobicity is the main factor affecting protein folding. The centroids were determined accordingly. • The choice of centroids is meant to cover all the possible hydrophobicity patterns of a subsequence of length n. No. of initial centroids is 2n All Hydrophillic Hydrophobic Hydrophillic All Hydrophobic

Fourth Stage (Distribution Fitting) • Clustered as well as the unclustered data are compared using Kolmogrov-Smirnov test against 66 continuous probability distributions, which are: • Beta, Burr, Burr (4P), Cauchy, Chi-Squared, Chi-Squared (2P), Dagum, Dagum (4P), Erlang, Erlang (3P), Error, Error Function, Exponential, Exponential (2P), Fatigue Life, Fatigue Life (3P), Frechet, Frechet (3P), Gamma, Gamma (3P), Gen. Extreme Value, Gen. Gamma, Gen. Gamma (4P), Gen. Logistic, Gen. Pareto, Gumbel Max, Gumbel Min, Hypersecant, Inv. Gaussian, Inv. Gaussian (3P), Johnson SB, Johnson SU, Kumaraswamy, Laplace, Levy, Levy (2P), Log-Gamma, Log-Logistic, Log-Logistic (3P), Log-Pearson 3, Logistic, Lognormal, Lognormal (3P), Nakagami, Normal, Pareto, Pareto 2, Pearson 5, Pearson 5 (3P), Pearson 6, Pearson 6 (4P), Pert, Phased Bi-Exponential, Phased Bi-Weibull, Power Function, Rayleigh, Rayleigh (2P), Reciprocal, Rice, Student's t, Triangular, Uniform, Wakeby, Weibull and Weibull (3P).

Hypothesis(Formally Stated) Through conducting this study we try to argue about two assumptions: The first part of the hypothesis suggests that the angles measurements of a protein sequences follow some sort of pattern based on the hydrophobicity of the surrounding local amino acid residues. The second part suggests that the reliability of these patterns increases as the number of neighboring amino acid residues taken into consideration increases.

Results (clustering results)

Results(Proving Hypothesis I ) Tricky KS-statistic value are not enough for complete interpretation

Results(Proving Hypothesis I ) cont. The number of rejected critical values shows that the fits of Un-clustered data are fake fits Number of tested critical values is 5

Results(Proving Hypothesis II ) Obviously the KS-statistic shows that the larger the value of n the better the fit.

Results(Proving Hypothesis II ) cont. Looking deeper at the rejected value test, all the 5 test values are rejected for n = 3 while n = 7 gives ZERO rejected values, the thing that emphasizes the truth of our hypothesis.

Conclusion • it is now clear that there exists a direct relationship between the hydrophobicity of the residues of a subsequence (local neighbours) and the measurements of the backbone angles. Classifying a subsequence into one of the available clusters will give a good insight of the angles measurements and consequently the structure of the subsequence. • Also the length of the subsequence is an effective factor in angle measurement prediction process. Longer subsequences achieve better fits in one of the standard continuous probability distributions.

Future Work • These results can be used to guide the search process in a complete protein structure prediction algorithm. • Local angle-hydrophobicity relationship can be used combined with heuristic techniques like genetic algorithm to restrict the initial population to statistically familiar conformation. • Approximations of our results can be applied to crystalline lattices protein models like cube octahedron lattice model which allows the use of several possible angles 60", 90", 120" and 180". • it is possible to investigate applying the same approach on subsequences of length more than 7 residues and try to minimize the required processing time.

Published Paper Title A CENTRAL-3-RESIDUES-BASED CLUSTERING APPROACH FOR STUDYING THE EFFECT OF HYDROPHOBICITY ON PROTEIN BACKBONE ANGLES Authors Prof. Ibrahim M.El-Henawy Dr. Ahmed H.Kamal Dr. Hisham Al-Shishiny Haitham Gamal Has been published in Egyptian Computer Science Journal (ECS Journal), ISSN-1110-2586, Volume 32, Number 1, May, 2009

Thank you

Master Thesis

Master Thesis

Presentation Transcript

Master thesis

Master thesis writing

Diplomarbeit / Master thesis / Studienarbeit

Master Thesis

Titel of Master thesis Max Mustermann Master thesis, Spring Semester 2014

Master Thesis

Master Thesis

Master Thesis

Master thesis Jakob Beetz

Master thesis information meeting

Master Thesis Harald Groen

NKNU phys. Master Thesis

Master Thesis Project

Master Thesis Preparation

Master thesis project

Master Thesis Seminar , 2010

Master of Science Thesis

Master thesis