Bayesian Classification of Protein Data

Bayesian Classification of Protein Data Thomas Huberhuber@maths.uq.edu.auComputational Biology andBioinformatics EnvironmentComBinEDepartment of MathematicsThe University of Queensland

Today’s talk • Protein score functions from mining protein data • Bayesian classification • A toy example • A protein scoring function for fold recognition • Where are score/energy functions useful? • A few examples

Why do we care about Protein Structures/Prediction? • Academic curiosity? • Understanding how nature works • Urgency of prediction • 104 structures are determined • insignificant compared to all proteins • sequencing = fast & cheap • structure determination = hard & expensive TrEMBL sequences (computer annotated) Transistors in Intel processors SwissProt sequences (annotated) structures in PDB

Three basic choices in (molecular) modelling • Representation • Which degrees of freedom are treated explicitly • Scoring • Which scoring function (force field) • Searching • Which method to search or sample conformational space

Protein Scoring Functions from Mining Protein Data • Classification Theory • Find a set of classes and their descriptors (a classification) for n data q attributes (shape, amino acid type, etc.) • Theory of finite mixtures • Class  attribute probability distribution of all members

Bayesian approach • Simplifications • Stating a simplified model • Assume attributes are independently distributed • P(Xicj|S) requires class description • Expectation Maximization (EM)

How many classes • Again Bayes’ rule • P(m) favours smaller number of classes • No over-fitting of data (like with maximum likelihood methods)

A Toy ExampleDihedral preference of Valine • Four interesting degrees of freedom • -,-dihedral • angle • Adjacent amino • acid types   i-1 i+1 • Data:893 non-redundant proteins • 12074 four-dimensional data points

Valine Data Classification • AutoClass classification • Model: Gaussian distribution for /, discrete probabilities for amino acids • Total of 50 tries with #classes [2:11] • Each try refined until fully converged • Best classification has 5 classes

Amino Acid Attribute vectors of -helix Classes • Log-Preferences

Re-invention of the Wheel • Textbook secondary structure pattern • Helices are likely on outside of proteins • I, I+3 and I+4 hydrophobic interface • From C.-I. Branden and J. Tooze, Introduction to Protein Structure

Fragment-based Protein Scoring • Find classification for fragments of size 7 residues • 237566 fragments (1494 non-redundant protein chains) • 28 descriptors • 7 amino acid type • 14 -/-dihedral angles • 7 number of neighbours of each amino acid • 200 CPU hours on National Facility computers • 325 classes (modelling the probability distribution of native fragments) • Use this classification to evaluate likelihood of a fragment sequence-structure match • Total score =  fragment scores

Fold Recognition = Computer Matchmaking • Structure Disco

Does it work? • Discrimination (TIM 1amk_) • Generalisation 1 3 2 4 1 2 5 3 5 4

Sequence-Structure MatchingThe search problem • Gapped alignment = combinatorial nightmare

Why is Fold Recognition better than Sequence Comparison? • Comparison is done in structure space not in sequence space

Finding Remote Homologueswith sausage • 572 sequence-structure pairs • Structures are similar (FSSP) • > 70% structurally aligned • < 20% sequence identity

RNA-dependent RNA Polymerases

A Real Case ExampleRNA-dependent RNA polymerases • Dengue virus • Bacteriophage 6

Is this Yet Another Profile Method? • Yes, but a much more general profile method • Profile is not residue based (like profile-like threading force fields) • Profiles not for protein families (like in HMMs or -Blast) • BUT local sequence profiles for optimally chosen classes of fragments • Local profiles can be arbitrarily assembled • Extreme flexibility • Sequence-structure alignment (=assembling best profile matches) • Deterministic, using dynamic programming

People • sausage • Andrew Torda (RSC) • Oliver Martin (RSC) • GlnB/GlnK, RdR polymerases • Subhash Vasudevan (JCU) • Sausage and Cassandra freely available • http://rsc.anu.edu.au/~torda • huber@maths.uq.edu.au

Bayesian Classification of Protein Data

Bayesian Classification of Protein Data

Presentation Transcript

Protein Classification

Protein classification

Protein structure Classification

Protein Classification II

Bayesian Classification

Nonparametric Bayesian Classification

Protein structure classification

PROTEIN STRUCTURE CLASSIFICATION

Bayesian Nonparametric Classification and Applications

Protein Classification

Protein Classification

Bayesian Classification Using P-tree

Protein Classification

Integrating Protein-Protein Interactions: Bayesian Networks

Bayesian Classification

Bayesian Classification

Bayesian Refinement of Protein Functional Site Matching

Classification Techniques: Bayesian Classification

Classification Bayesian Classifiers

SCOP – Protein structure classification CATH – Protein structure classification

Protein classification

Bayesian Classification Using P-tree