1 / 43

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT. Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 4: Applications Lecture 3: Protein Structure Prediction.

Download Presentation

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. G54DMT – Data Mining Techniques and Applicationshttp://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 4: Applications Lecture 3: Protein Structure Prediction Some material taken from “Arthur Lesk, Introduction to Bioinformatics, 2nd edition, Oxford University Press, 2005, Livingstone, C.D., Barton, G.J.: Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Computer Applications in the Biosciences 9 (1993) 745-756 ”.

  2. Outline • Brief introduction to protein structure • Motivation and definition of PSP • PSP: A family of problems • Data Mining protein’s structural aspects • Dimensionality reduction for protein datasets • Summary

  3. Protein Structure: Introduction • Proteins are molecules of primary importance for the functioning of life • Structural Proteins (collagen nails hair etc.) • Enzymes • Transmembrane proteins • Proteins are polypeptide chains constructed by joining a certain kind of peptides amino acids in a linear way • The chain of amino acids however folds to create very complex 3D structures • There is a general consensus that the end state of the folding process depends on the amino acid composition of the chain

  4. Amino Acids

  5. Protein Structure: Introduction • Different amino acids have different properties • These properties will affect the protein structure and function • Hydrophobicity, for instance, is the main driving force (but not the only one) of the folding process

  6. Global Interactions Local Interactions Protein Structure: Hierarchical nature of protein structure Primary Structure = Sequence of amino acids MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTLPFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQREKIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKKHLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYLIKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE Secondary Structure Tertiary

  7. Motivation for PSP • The function of a protein depends greatly on its structure • The structure that a protein adopts is vital to it’s chemistry • Its structure determines which of its amino acids are exposed to carry out the protein’s function • Its structure also determines what substrates it can react with • However the structure of a protein is very difficult to determine experimentally and in some cases almost impossible

  8. Protein Structure Prediction • That is why we have to predict it • PSP aims to predict the 3D structure of a protein based on its primary sequence

  9. Prediction types of PSP • There are several kinds of prediction problems within the scope of PSP • The main one of course is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) based on its primary sequence • There are many structural properties of individual residues within a protein that can be predicted for instance: • The secondary structure state of the residue • If a residue is buried in the core of the protein or exposed in the surface • Accurate predictions of these sub-problems can simplify the general 3D PSP problem

  10. Prediction types of PSP • There is an important distinction between the two classes of prediction • The 3D PSP is generally treated as an optimisation problem • The prediction of structural aspects of protein residues are generally treated as machine learning problems

  11. DATA MINING PROTEIN’S STRUCTURAL ASPECTS

  12. Prediction of structural aspects of protein residues • Many of these features are due to local interactions of an amino acid and its immediate neighbours • Can it be predicted using information from the closest neighbours in the chain? • In this simplified example to predict the SS state of residue i we would use information from residues i-1 i and i+1. That is a window of ±1 residues around the target Ri-5 SSi-5 Ri-4 SSi-4 Ri-3 SSi-3 Ri-2 SSi-2 Ri-1 SSi-1 Ri SSi Ri+2 SSi+2 Ri+3 SSi+3 Ri+4 SSi+4 Ri+5 SSi+5 Ri+1 SSi+1 Ri-1 Ri Ri+1 SSi Ri Ri+1 Ri+2 SSi+1 Ri+1 Ri+2 Ri+3  SSi+2

  13. ARFF file for a simple PSP dataset @relation AA+CN_Q2 @attribute AA_-4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} @attribute AA_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute class {0,1} @data X,X,X,X,A,E,I,K,H,0 X,X,X,A,E,I,K,H,Y,0 X,X,A,E,I,K,H,Y,Q,0 X,A,E,I,K,H,Y,Q,F,0 A,E,I,K,H,Y,Q,F,N,0 E,I,K,H,Y,Q,F,N,V,0 I,K,H,Y,Q,F,N,V,V,0 K,H,Y,Q,F,N,V,V,M,1 H,Y,Q,F,N,V,V,M,T,0 Y,Q,F,N,V,V,M,T,C,1

  14. What information do we include for each residue? • Early prediction methods used just the primary sequence  the AA types of the residues in the window • However the primary sequence has limited amount of information • It does not contain any evolutionary information it does not say which residues are conserved and which are not • Position-Specific Scoring Matrices which is a product of a Multiple Sequence Alignment • Some of these predictions can be used as input for other predictions

  15. Position-Specific Scoring Matrices (PSSM) • For each residue in the query sequence compute the distribution of amino acids of the corresponding residues in all aligned sequences (discarding those too similar to the query) • This distributions will tell us which mutations are likely and which mutations are less likely for each residue in the query sequence • A PSSM profile will also tell us which residues are more conserved (important) • In practical terms: Instead of representing each AA as 1 discrete variable with 20(+1) values, now we are representing it as 20 continuous variables

  16. Secondary Structure Prediction • The most usual way is to predict whether a residue belongs to an α helix a β sheet or is in coil state • Several programs can determine the actual SS state of a protein from a PDB file. The most common of them is DSSP • Typically, a window of ±7 amino acids (15 in total) is used. This means 300 attributes (when using PSSM). • A dataset with 1000 proteins with ~250AA/protein would have ~250000 instances

  17. Coordination Number Prediction • Two residues of a chain are said to be in contact if their distance is less than a certain threshold (e.g. 8Å) • CN of a residue : count of contacts that a certain residue has • CN gives us a simplified profile of the density of packing of the protein Native State Contact Primary Sequence

  18. CN as a classification problem • The number of contacts, depending on the definition can be either an integer or a continuous number • To treat this problem (and some others I will mention later) as classification problems, we need to discretise the output • Unsupervised methods are applied • Uniform length and uniform frequency disc. UF UL

  19. Example of a rule set generated by GAssist for CN prediction • All AA types associated to the central residue are hydrophobic (core of a protein) • D E consistently do not appear in the predicates. They are negatively charges residues (surface of a protein)

  20. Other predictions • Other kinds of residue structural aspects that can be predicted • Solvent accessibility: Amount of surface of each residue that is exposed to solvent • Recursive Convex Hull: A metric that models a protein as an onion and assigns each residue to a layer. Formally each layer is a convex hull of points • These features (and others) are predicted in a similar was as done for SS or CN

  21. PSP datasets are good ML benchmarks • These problems can be modelled in may ways: • Regression or classification problems • Low/high number of classes • Balanced/unbalanced classes • Adjustable number of attributes • Ideal benchmarks !! • http://www.infobiotic.net/PSPbenchmarks/

  22. Contact Map prediction • Prediction given two residues from a chain whether these two residues are in contact or not • This problem can be represented by a binary matrix. 1= contact 0 = non contact • Plotting this matrix reveals many characteristics from the protein structure helices sheets

  23. Steps for CM prediction (Nottingham method) • Prediction of • Secondary structure (using PSIPRED) • Solvent Accessibility • Recursive Convex Hull • Coordination Number • Integration of all these predictions plus other sources of information • Final CM prediction (using BioHEL) Using BioHEL [Bacardit et al., 09]

  24. Prediction of RCH, SA and CN • We selected a set of 3262 protein chains from PDB-REPRDB with: • A resolution less than 2Å • Less than 30% sequence identify • Without chain breaks nor non-standard residues • 90% of this set was used for training (~490000 residues) • 10% for test

  25. Prediction of RCH, SA and CN • All three features were predicted based on a window of ±4 residues around the target • Evolutionary information (as a Position-Specific Scoring Matrix) is the basis of this local information • Each residue is characterised by a vector of 180 values • The domain for all three features was partitioned into 5 states

  26. Characterisation of the contact map problem • Three types of input information were used • Detailed information of three different windows of residues centered around • The two target residues (2x) • The middle point between them • Information about the connecting segment between the two target residues and • Global protein information. 1 3 2

  27. Contact Map dataset • From the original set of 3262 proteins we kept all that had <250 AA and a randomly selected 20% of larger proteins • Still, the resulting training set contained 32 million pairs of AA and 631 attributes • Less than 2% of those are actual contacts • +60GB of disk space

  28. Samples and ensembles Training set • 50 samples of 660K examples are generated from the training set with a ratio of 2:1 non-contacts/contacts • BioHEL is run 25 times for each sample • Prediction is done by a consensus of 1250 rule sets • Confidence of prediction is computed based on the votes distribution in the ensemble. • Whole training process took about 25K CPU hours x50 Samples x25 Rule sets Consensus Predictions

  29. Contact Map prediction in CASP • CASP = Critical Assessment of Techniques for Protein Structure Prediction. • Biannual community-wide experiment to assess the state-of-the-art in PSP • Predictor groups are asked to submit a list of predicted contacts and a confidence level for each prediction • The assessors then rank the predictions for each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10} • From these L/x top ranked contacts two measures are computed • Accuracy: TP/(TP+FP) • Xd: difference between the distribution of predicted distance and a random distribution

  30. CASP9 results These two groups derived contact predictions from 3D models http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf

  31. Understanding the rule sets • Each rule set has in average 135 rules • We have a total of 168470 rules • Impossible to read all of them individually, but we can extract useful statistics • For instance, how often was each attribute used in the rules?

  32. Distribution of frequency of use of attributes • All 631 attributes are actually used (min frequency=429) • However, some of them are used much more frequently than others

  33. Top 10 attributes The four kind of residue’s predictions are highly ranked

  34. DIMENSIONALITY REDUCTION FOR PROTEIN DATASETS

  35. Motivation PSP is a very costly process As an example, one of the best PSP methods CASP8, Rosetta@Home could dedicate up to 104 computing years to predict a single protein’s 3D structure One of the possible ways to alleviate this computational cost is to simplify the representation used to model the proteins

  36. Target for reduction: the primary sequence • The primary sequence of a protein is an usual target for such simplification • It is composed of a quite high cardinality alphabet of 20 symbols, which share commonalities between them • One example of reduction widely used in the community is the hydrophobic-polar (HP) alphabet, reducing these 20 symbols to just two • HP representation usually is too simple, too much information is lost in the reduction process [Stout et al., 06] • Can we automatically generate these reduced alphabets and tailor them to the specific problem at hand?

  37. Automated Alphabet Reduction [Bacardit et al., 09] • We will use an automated information theory-driven method to optimize alphabet reduction policies for PSP datasets • An optimization algorithm will cluster the AA alphabet into a predefined number of new letters • Fitnessfunction of optimization is based on the Mutual Information (MI) metric. A metric that quantifies the interrelationship between two discrete variables • Aim is to findthereducedrepresentationthatmaintains as muchrelevantinformation as possible for thefeaturebeingpredicted • Afterwards we will feed the reduced dataset into a learning method to verify if the reduction was proper

  38. Alphabet Reduction protocol Size = N Test set Dataset Card=N Dataset Card=20 Ensemble of rule sets BioHEL ECGA Accuracy Mutual Information

  39. Automated Alphabet Reduction • Competent 5-letter alphabet (similar performance to the AA alphabet) • Different alphabets for CN and SA domains • Unexpected explanations: Alphabet reduction clustered AA types that experts did not expect

  40. Automated Alphabet Reduction • Our method produces better reduced alphabets than other reduced alphabets from the literature and than other expert-designed ones Alphabets from the literature Expert designed alphabets

  41. Efficiency gains from the alphabet reduction We have extrapolated the reduced alphabet to the much larger and richer Position-Specific Scoring Matrices (PSSM) representation Accuracy difference is still less than 1% Obtained rule sets are simpler and training process is much faster Performance levels are similar to recent works in the literature [Kinjo et al., 05][Dor and Zhou, 07] Won the bronze medal of the 2007 Humies awards

  42. Summary: Data Mining in PSP • DM can greatly help improve the quality of PSP methods • Very important problem in biology • Protein datasets need to be greatly preprocessed before we can apply DM • Break sequences into windows • Defining the classes with discretisation • Using sub-predictions as input for other predictions • Generating new variables from the original primary sequence • Reducing alphabets • Very challenging datasets because of size • Interpretability and visualisation is important • Find the best way of convening the results of DM to the end users

  43. Resources • Books • Introduction to Bioinformatics. A. Tramontano • Structural Bioinformatics. An Algorithmic Approach. F.J. Burkowski • My papers on PSP • Coordination number prediction using Learning Classifier Systems: Performance and interpretability • Prediction of Recursive Convex Hull Class Assignments for Protein Residues • Automated Alphabet Reduction for Protein Datasets • Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features

More Related