Machine Learning for Protein Classification: Kernel Methods

Machine Learning for Protein Classification: Kernel Methods CS 374 Rajesh Ranganath 4/10/2008

Outline • Biological Motivation and Background • Algorithmic Concepts • Mismatch Kernels • Semi-supervised methods

Proteins

The Protein Problem • Primary Structure can be easily determined • 3D structure determines function • Grouping proteins into structural and evolutionary families is difficult • Use machine learning to group proteins

How to look at amino acid chains • Smith-Waterman Idea • Mismatch Idea

Families • Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) • Families are further subdivided into Proteins • Proteins are divided into Species • The same protein may be found in several species Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU

Superfamilies • Proteins which are (remote) evolutionarily related • Sequence similarity low • Share function • Share special structural features • Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU

Folds • Proteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold • No evolutionary relation between proteins Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU

Protein Classification • Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods • BLAST / PsiBLAST • Profile HMMs • Supervised Machine Learning methods Fold Superfamily new protein ? Family Proteins

Machine Learning Concepts • Supervised Methods • Discriminative Vs. Generative Models • Transductive Learning • Support Vector Machines • Kernel Methods • Semi-supervised Methods

Discriminative and Generative Models Discriminative Generative

Transductive Learning • Most Learning is Inductive • Given (x1,y1) …. (xm,ym), for any test input x* predict the label y* • Transductive Learning • Given (x1,y1) …. (xm,ym) and all the test input {x1*,…, xp*} predict label {y1*,…, yp*}

Support Vector Machines • Popular Discriminative Learning algorithm • Optimal geometric marginal classifier • Can be solved efficiently using the Sequential Minimal Optimization algorithm • If x1 … xn training examples, sign(iixiTx) “decides” where x falls • Train i to achieve best margin

Support Vector Machines (2) • Kernalizable: The SVM solution can be completely written down in terms of dot products of the input. {sign(iiK(xi,x) determines class of x)}

Kernel Methods • K(x, z) = f(x)Tf(z) • f is the feature mapping • x and z are input vectors • High dimensional features do not need to be explicitly calculated • Think of the kernel function similarity measure between x and z • Example:

Mismatch Kernel • Regions of similar amino acid sequences yield a similar tertiary structure of proteins • Used as a kernel for an SVM to identify protein homologies

X Y k-mer based SVMs • For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches • Define normalized mismatch kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) • SVM can be learned by supplying this kernel function A B A C A R D I K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1 A B R A D A B I

Disadvantages • 3D structure of proteins is practically impossible • Primary sequences are cheap to determine • How do we use all this unlabeled data? • Use semi-supervised learning based on the cluster assumption

Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples

Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples • SVMs and other discriminative methods may make significant mistakes due to lack of data

Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples Attempt to “contract” the distances within each cluster while keeping intracluster distances larger

Cluster Kernels • Semi-supervised methods • Neighborhood • For each X, run PSI-BLAST to get similar seqs  Nbd(X) • Define Φnbd(X) = 1/|Nbd(X)| X’  Nbd(X)Φoriginal(X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” • Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’  Nbd(X)Y’  Nbd(Y) K(X’, Y’) • Next bagged mismatch

Bagged Mismatched Kernel • Final method • Bagged mismatch • Run k-means clustering n times, giving p = 1,…,n assignments cp(X) • For every X and Y, count up the fraction of times they are bagged together Kbag(X, Y) = 1/n p 1(cp(X) = cp (Y)) • Combine the “bag fraction” with the original comparison K(.,.) Knew(X, Y) = Kbag(X, Y) K(X, Y)

O. Jangmin

What works best? Transductive Setting

References • C. Leslie et al. Mismatch string kernels for discriminative protein classification. Bioinformatics Advance Access. January 22, 2004. • J. Weston et al. Semi-supervised protein classification using cluster kernels.2003. • Images pulled under wikiCommons

Machine Learning for Protein Classification: Kernel Methods

Machine Learning for Protein Classification: Kernel Methods

Presentation Transcript

An Introduction to Machine Learning with Perl

Evolution of the Windows Kernel Architecture

K-nearest neighbor methods

Methods of Protein Purification

Introduction to Machine Learning

Protein Homology Modelling

Kernel – Based Methods

Text Classification

Some Useful Machine Learning Tools

National Dong Hwa University, Taiwan

TCS for Machine Learning Scientists

Kernel – Based Methods

Chapter 11 Supervised Learning: STATISTICAL METHODS

Machine Learning for Analyzing Brain Activity

Support Vector Machine 支持向量機

A few methods for learning binary classifiers

Protein interactions and Pathways

Using Statistical Machine Learning in Cloud Computing

Data Mining: Classification and Prediction

Protein Chemistry Basics