G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

G54DMT – Data Mining Techniques and Applicationshttp://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 4: Applications Lecture 3: Protein Structure Prediction Some material taken from “Arthur Lesk, Introduction to Bioinformatics, 2nd edition, Oxford University Press, 2005, Livingstone, C.D., Barton, G.J.: Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Computer Applications in the Biosciences 9 (1993) 745-756 ”.

Outline • Brief introduction to protein structure • Motivation and definition of PSP • PSP: A family of problems • Data Mining protein’s structural aspects • Dimensionality reduction for protein datasets • Summary

Protein Structure: Introduction • Proteins are molecules of primary importance for the functioning of life • Structural Proteins (collagen nails hair etc.) • Enzymes • Transmembrane proteins • Proteins are polypeptide chains constructed by joining a certain kind of peptides amino acids in a linear way • The chain of amino acids however folds to create very complex 3D structures • There is a general consensus that the end state of the folding process depends on the amino acid composition of the chain

Amino Acids

Protein Structure: Introduction • Different amino acids have different properties • These properties will affect the protein structure and function • Hydrophobicity, for instance, is the main driving force (but not the only one) of the folding process

Global Interactions Local Interactions Protein Structure: Hierarchical nature of protein structure Primary Structure = Sequence of amino acids MKYNNHDKIRDFIIIEAYMFRFKKKVKPEVDMTIKEFILLTYLFHQQENTLPFKKIVSDLCYKQSDLVQHIKVLVKHSYISKVRSKIDERNTYISISEEQREKIAERVTLFDQIIKQFNLADQSESQMIPKDSKEFLNLMMYTMYFKNIIKKHLTLSFVEFTILAIITSQNKNIVLLKDLIETIHHKYPQTVRALNNLKKQGYLIKERSTEDERKILIHMDDAQQDHAEQLLAQVNQLLADKDHLHLVFE Secondary Structure Tertiary

Motivation for PSP • The function of a protein depends greatly on its structure • The structure that a protein adopts is vital to it’s chemistry • Its structure determines which of its amino acids are exposed to carry out the protein’s function • Its structure also determines what substrates it can react with • However the structure of a protein is very difficult to determine experimentally and in some cases almost impossible

Protein Structure Prediction • That is why we have to predict it • PSP aims to predict the 3D structure of a protein based on its primary sequence

Prediction types of PSP • There are several kinds of prediction problems within the scope of PSP • The main one of course is to predict the 3D coordinates of all atoms of a protein (or at least the backbone) based on its primary sequence • There are many structural properties of individual residues within a protein that can be predicted for instance: • The secondary structure state of the residue • If a residue is buried in the core of the protein or exposed in the surface • Accurate predictions of these sub-problems can simplify the general 3D PSP problem

Prediction types of PSP • There is an important distinction between the two classes of prediction • The 3D PSP is generally treated as an optimisation problem • The prediction of structural aspects of protein residues are generally treated as machine learning problems

DATA MINING PROTEIN’S STRUCTURAL ASPECTS

Prediction of structural aspects of protein residues • Many of these features are due to local interactions of an amino acid and its immediate neighbours • Can it be predicted using information from the closest neighbours in the chain? • In this simplified example to predict the SS state of residue i we would use information from residues i-1 i and i+1. That is a window of ±1 residues around the target Ri-5 SSi-5 Ri-4 SSi-4 Ri-3 SSi-3 Ri-2 SSi-2 Ri-1 SSi-1 Ri SSi Ri+2 SSi+2 Ri+3 SSi+3 Ri+4 SSi+4 Ri+5 SSi+5 Ri+1 SSi+1 Ri-1 Ri Ri+1 SSi Ri Ri+1 Ri+2 SSi+1 Ri+1 Ri+2 Ri+3  SSi+2

ARFF file for a simple PSP dataset @relation AA+CN_Q2 @attribute AA_-4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_-1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} @attribute AA_1 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_2 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_3 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute AA_4 {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,X,Y} @attribute class {0,1} @data X,X,X,X,A,E,I,K,H,0 X,X,X,A,E,I,K,H,Y,0 X,X,A,E,I,K,H,Y,Q,0 X,A,E,I,K,H,Y,Q,F,0 A,E,I,K,H,Y,Q,F,N,0 E,I,K,H,Y,Q,F,N,V,0 I,K,H,Y,Q,F,N,V,V,0 K,H,Y,Q,F,N,V,V,M,1 H,Y,Q,F,N,V,V,M,T,0 Y,Q,F,N,V,V,M,T,C,1

What information do we include for each residue? • Early prediction methods used just the primary sequence  the AA types of the residues in the window • However the primary sequence has limited amount of information • It does not contain any evolutionary information it does not say which residues are conserved and which are not • Position-Specific Scoring Matrices which is a product of a Multiple Sequence Alignment • Some of these predictions can be used as input for other predictions

Position-Specific Scoring Matrices (PSSM) • For each residue in the query sequence compute the distribution of amino acids of the corresponding residues in all aligned sequences (discarding those too similar to the query) • This distributions will tell us which mutations are likely and which mutations are less likely for each residue in the query sequence • A PSSM profile will also tell us which residues are more conserved (important) • In practical terms: Instead of representing each AA as 1 discrete variable with 20(+1) values, now we are representing it as 20 continuous variables

Secondary Structure Prediction • The most usual way is to predict whether a residue belongs to an α helix a β sheet or is in coil state • Several programs can determine the actual SS state of a protein from a PDB file. The most common of them is DSSP • Typically, a window of ±7 amino acids (15 in total) is used. This means 300 attributes (when using PSSM). • A dataset with 1000 proteins with ~250AA/protein would have ~250000 instances

Coordination Number Prediction • Two residues of a chain are said to be in contact if their distance is less than a certain threshold (e.g. 8Å) • CN of a residue : count of contacts that a certain residue has • CN gives us a simplified profile of the density of packing of the protein Native State Contact Primary Sequence

CN as a classification problem • The number of contacts, depending on the definition can be either an integer or a continuous number • To treat this problem (and some others I will mention later) as classification problems, we need to discretise the output • Unsupervised methods are applied • Uniform length and uniform frequency disc. UF UL

Example of a rule set generated by GAssist for CN prediction • All AA types associated to the central residue are hydrophobic (core of a protein) • D E consistently do not appear in the predicates. They are negatively charges residues (surface of a protein)

Other predictions • Other kinds of residue structural aspects that can be predicted • Solvent accessibility: Amount of surface of each residue that is exposed to solvent • Recursive Convex Hull: A metric that models a protein as an onion and assigns each residue to a layer. Formally each layer is a convex hull of points • These features (and others) are predicted in a similar was as done for SS or CN

PSP datasets are good ML benchmarks • These problems can be modelled in may ways: • Regression or classification problems • Low/high number of classes • Balanced/unbalanced classes • Adjustable number of attributes • Ideal benchmarks !! • http://www.infobiotic.net/PSPbenchmarks/

Contact Map prediction • Prediction given two residues from a chain whether these two residues are in contact or not • This problem can be represented by a binary matrix. 1= contact 0 = non contact • Plotting this matrix reveals many characteristics from the protein structure helices sheets

Steps for CM prediction (Nottingham method) • Prediction of • Secondary structure (using PSIPRED) • Solvent Accessibility • Recursive Convex Hull • Coordination Number • Integration of all these predictions plus other sources of information • Final CM prediction (using BioHEL) Using BioHEL [Bacardit et al., 09]

Prediction of RCH, SA and CN • We selected a set of 3262 protein chains from PDB-REPRDB with: • A resolution less than 2Å • Less than 30% sequence identify • Without chain breaks nor non-standard residues • 90% of this set was used for training (~490000 residues) • 10% for test

Prediction of RCH, SA and CN • All three features were predicted based on a window of ±4 residues around the target • Evolutionary information (as a Position-Specific Scoring Matrix) is the basis of this local information • Each residue is characterised by a vector of 180 values • The domain for all three features was partitioned into 5 states

Characterisation of the contact map problem • Three types of input information were used • Detailed information of three different windows of residues centered around • The two target residues (2x) • The middle point between them • Information about the connecting segment between the two target residues and • Global protein information. 1 3 2

Contact Map dataset • From the original set of 3262 proteins we kept all that had <250 AA and a randomly selected 20% of larger proteins • Still, the resulting training set contained 32 million pairs of AA and 631 attributes • Less than 2% of those are actual contacts • +60GB of disk space

Samples and ensembles Training set • 50 samples of 660K examples are generated from the training set with a ratio of 2:1 non-contacts/contacts • BioHEL is run 25 times for each sample • Prediction is done by a consensus of 1250 rule sets • Confidence of prediction is computed based on the votes distribution in the ensemble. • Whole training process took about 25K CPU hours x50 Samples x25 Rule sets Consensus Predictions

Contact Map prediction in CASP • CASP = Critical Assessment of Techniques for Protein Structure Prediction. • Biannual community-wide experiment to assess the state-of-the-art in PSP • Predictor groups are asked to submit a list of predicted contacts and a confidence level for each prediction • The assessors then rank the predictions for each protein and take a look at the top L/x ones, where L is the length of the protein and x={5,10} • From these L/x top ranked contacts two measures are computed • Accuracy: TP/(TP+FP) • Xd: difference between the distribution of predicted distance and a random distribution

CASP9 results These two groups derived contact predictions from 3D models http://www.predictioncenter.org/casp9/doc/presentations/CASP9_RR.pdf

Understanding the rule sets • Each rule set has in average 135 rules • We have a total of 168470 rules • Impossible to read all of them individually, but we can extract useful statistics • For instance, how often was each attribute used in the rules?

Distribution of frequency of use of attributes • All 631 attributes are actually used (min frequency=429) • However, some of them are used much more frequently than others

Top 10 attributes The four kind of residue’s predictions are highly ranked

DIMENSIONALITY REDUCTION FOR PROTEIN DATASETS

Motivation PSP is a very costly process As an example, one of the best PSP methods CASP8, Rosetta@Home could dedicate up to 104 computing years to predict a single protein’s 3D structure One of the possible ways to alleviate this computational cost is to simplify the representation used to model the proteins

Target for reduction: the primary sequence • The primary sequence of a protein is an usual target for such simplification • It is composed of a quite high cardinality alphabet of 20 symbols, which share commonalities between them • One example of reduction widely used in the community is the hydrophobic-polar (HP) alphabet, reducing these 20 symbols to just two • HP representation usually is too simple, too much information is lost in the reduction process [Stout et al., 06] • Can we automatically generate these reduced alphabets and tailor them to the specific problem at hand?

Automated Alphabet Reduction [Bacardit et al., 09] • We will use an automated information theory-driven method to optimize alphabet reduction policies for PSP datasets • An optimization algorithm will cluster the AA alphabet into a predefined number of new letters • Fitnessfunction of optimization is based on the Mutual Information (MI) metric. A metric that quantifies the interrelationship between two discrete variables • Aim is to findthereducedrepresentationthatmaintains as muchrelevantinformation as possible for thefeaturebeingpredicted • Afterwards we will feed the reduced dataset into a learning method to verify if the reduction was proper

Alphabet Reduction protocol Size = N Test set Dataset Card=N Dataset Card=20 Ensemble of rule sets BioHEL ECGA Accuracy Mutual Information

Automated Alphabet Reduction • Competent 5-letter alphabet (similar performance to the AA alphabet) • Different alphabets for CN and SA domains • Unexpected explanations: Alphabet reduction clustered AA types that experts did not expect

Automated Alphabet Reduction • Our method produces better reduced alphabets than other reduced alphabets from the literature and than other expert-designed ones Alphabets from the literature Expert designed alphabets

Efficiency gains from the alphabet reduction We have extrapolated the reduced alphabet to the much larger and richer Position-Specific Scoring Matrices (PSSM) representation Accuracy difference is still less than 1% Obtained rule sets are simpler and training process is much faster Performance levels are similar to recent works in the literature [Kinjo et al., 05][Dor and Zhou, 07] Won the bronze medal of the 2007 Humies awards

Summary: Data Mining in PSP • DM can greatly help improve the quality of PSP methods • Very important problem in biology • Protein datasets need to be greatly preprocessed before we can apply DM • Break sequences into windows • Defining the classes with discretisation • Using sub-predictions as input for other predictions • Generating new variables from the original primary sequence • Reducing alphabets • Very challenging datasets because of size • Interpretability and visualisation is important • Find the best way of convening the results of DM to the end users

Resources • Books • Introduction to Bioinformatics. A. Tramontano • Structural Bioinformatics. An Algorithmic Approach. F.J. Burkowski • My papers on PSP • Coordination number prediction using Learning Classifier Systems: Performance and interpretability • Prediction of Recursive Convex Hull Class Assignments for Protein Residues • Automated Alphabet Reduction for Protein Datasets • Contact map prediction using a large-scale ensemble of rule sets and the fusion of multiple predicted structural features

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

G54DMT – Data Mining Techniques and Applications cs.nott.ac.uk/~jqb/G54DMT

Presentation Transcript

DCS 802 Data Mining Apriori Algorithm

DATA MINING Introductory and Advanced Topics Part II

Knime: a data mining platform

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 6 —

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining

COEN 252 Computer Forensics

Data Mining Classification: Basic Concepts,

Data Mining Chapter 1

Data Mining: Concepts and Techniques — Chapter 5 — Mining Frequent Patterns

Data Mining Algorithms for Recommendation Systems

Weka – A Data Mining Toolkit

Data Mining: Concepts and Techniques

CENG 464 Introduction to Data Mining

Spatial Data Mining: Accomplishments and Research Needs

Data Mining: Concepts and Techniques

DATA WAREHOUSING AND DATA MINING

BUDT 725: Models and Applications in Operations Research

Association Rule Mining

COEN 152/252 Computer Forensics

Data Mining: Concepts and Techniques Cluster Analysis Li Xiong