230 likes | 356 Views
Bayesian Classification of Protein Data. Thomas Huber huber@maths.uq.edu.au Computational Biology and Bioinformatics Environment ComBinE Department of Mathematics The University of Queensland. Today’s talk. Protein score functions from mining protein data Bayesian classification A toy example
E N D
Bayesian Classification of Protein Data Thomas Huberhuber@maths.uq.edu.auComputational Biology andBioinformatics EnvironmentComBinEDepartment of MathematicsThe University of Queensland
Today’s talk • Protein score functions from mining protein data • Bayesian classification • A toy example • A protein scoring function for fold recognition • Where are score/energy functions useful? • A few examples
Why do we care about Protein Structures/Prediction? • Academic curiosity? • Understanding how nature works • Urgency of prediction • 104 structures are determined • insignificant compared to all proteins • sequencing = fast & cheap • structure determination = hard & expensive TrEMBL sequences (computer annotated) Transistors in Intel processors SwissProt sequences (annotated) structures in PDB
Three basic choices in (molecular) modelling • Representation • Which degrees of freedom are treated explicitly • Scoring • Which scoring function (force field) • Searching • Which method to search or sample conformational space
Protein Scoring Functions from Mining Protein Data • Classification Theory • Find a set of classes and their descriptors (a classification) for n data q attributes (shape, amino acid type, etc.) • Theory of finite mixtures • Class attribute probability distribution of all members
Bayesian approach • Simplifications • Stating a simplified model • Assume attributes are independently distributed • P(Xicj|S) requires class description • Expectation Maximization (EM)
How many classes • Again Bayes’ rule • P(m) favours smaller number of classes • No over-fitting of data (like with maximum likelihood methods)
A Toy ExampleDihedral preference of Valine • Four interesting degrees of freedom • -,-dihedral • angle • Adjacent amino • acid types i-1 i+1 • Data:893 non-redundant proteins • 12074 four-dimensional data points
Valine Data Classification • AutoClass classification • Model: Gaussian distribution for /, discrete probabilities for amino acids • Total of 50 tries with #classes [2:11] • Each try refined until fully converged • Best classification has 5 classes
Amino Acid Attribute vectors of -helix Classes • Log-Preferences
Re-invention of the Wheel • Textbook secondary structure pattern • Helices are likely on outside of proteins • I, I+3 and I+4 hydrophobic interface • From C.-I. Branden and J. Tooze, Introduction to Protein Structure
Fragment-based Protein Scoring • Find classification for fragments of size 7 residues • 237566 fragments (1494 non-redundant protein chains) • 28 descriptors • 7 amino acid type • 14 -/-dihedral angles • 7 number of neighbours of each amino acid • 200 CPU hours on National Facility computers • 325 classes (modelling the probability distribution of native fragments) • Use this classification to evaluate likelihood of a fragment sequence-structure match • Total score = fragment scores
Fold Recognition = Computer Matchmaking • Structure Disco
Does it work? • Discrimination (TIM 1amk_) • Generalisation 1 3 2 4 1 2 5 3 5 4
Sequence-Structure MatchingThe search problem • Gapped alignment = combinatorial nightmare
Why is Fold Recognition better than Sequence Comparison? • Comparison is done in structure space not in sequence space
Finding Remote Homologueswith sausage • 572 sequence-structure pairs • Structures are similar (FSSP) • > 70% structurally aligned • < 20% sequence identity
A Real Case ExampleRNA-dependent RNA polymerases • Dengue virus • Bacteriophage 6
Is this Yet Another Profile Method? • Yes, but a much more general profile method • Profile is not residue based (like profile-like threading force fields) • Profiles not for protein families (like in HMMs or -Blast) • BUT local sequence profiles for optimally chosen classes of fragments • Local profiles can be arbitrarily assembled • Extreme flexibility • Sequence-structure alignment (=assembling best profile matches) • Deterministic, using dynamic programming
People • sausage • Andrew Torda (RSC) • Oliver Martin (RSC) • GlnB/GlnK, RdR polymerases • Subhash Vasudevan (JCU) • Sausage and Cassandra freely available • http://rsc.anu.edu.au/~torda • huber@maths.uq.edu.au