310 likes | 579 Views
Machine Learning for Protein Classification: Kernel Methods. CS 374 Rajesh Ranganath 4/10/2008. Outline. Biological Motivation and Background Algorithmic Concepts Mismatch Kernels Semi-supervised methods. Proteins. The P rotein P roblem. Primary Structure can be easily determined
E N D
Machine Learning for Protein Classification: Kernel Methods CS 374 Rajesh Ranganath 4/10/2008
Outline • Biological Motivation and Background • Algorithmic Concepts • Mismatch Kernels • Semi-supervised methods
The Protein Problem • Primary Structure can be easily determined • 3D structure determines function • Grouping proteins into structural and evolutionary families is difficult • Use machine learning to group proteins
How to look at amino acid chains • Smith-Waterman Idea • Mismatch Idea
Families • Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) • Families are further subdivided into Proteins • Proteins are divided into Species • The same protein may be found in several species Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU
Superfamilies • Proteins which are (remote) evolutionarily related • Sequence similarity low • Share function • Share special structural features • Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU
Folds • Proteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold • No evolutionary relation between proteins Fold Superfamily Family Proteins Morten Nielsen,CBS, BioCentrum, DTU
Protein Classification • Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods • BLAST / PsiBLAST • Profile HMMs • Supervised Machine Learning methods Fold Superfamily new protein ? Family Proteins
Machine Learning Concepts • Supervised Methods • Discriminative Vs. Generative Models • Transductive Learning • Support Vector Machines • Kernel Methods • Semi-supervised Methods
Discriminative and Generative Models Discriminative Generative
Transductive Learning • Most Learning is Inductive • Given (x1,y1) …. (xm,ym), for any test input x* predict the label y* • Transductive Learning • Given (x1,y1) …. (xm,ym) and all the test input {x1*,…, xp*} predict label {y1*,…, yp*}
Support Vector Machines • Popular Discriminative Learning algorithm • Optimal geometric marginal classifier • Can be solved efficiently using the Sequential Minimal Optimization algorithm • If x1 … xn training examples, sign(iixiTx) “decides” where x falls • Train i to achieve best margin
Support Vector Machines (2) • Kernalizable: The SVM solution can be completely written down in terms of dot products of the input. {sign(iiK(xi,x) determines class of x)}
Kernel Methods • K(x, z) = f(x)Tf(z) • f is the feature mapping • x and z are input vectors • High dimensional features do not need to be explicitly calculated • Think of the kernel function similarity measure between x and z • Example:
Mismatch Kernel • Regions of similar amino acid sequences yield a similar tertiary structure of proteins • Used as a kernel for an SVM to identify protein homologies
X Y k-mer based SVMs • For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches • Define normalized mismatch kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) • SVM can be learned by supplying this kernel function A B A C A R D I K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1 A B R A D A B I
Disadvantages • 3D structure of proteins is practically impossible • Primary sequences are cheap to determine • How do we use all this unlabeled data? • Use semi-supervised learning based on the cluster assumption
Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples
Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples • SVMs and other discriminative methods may make significant mistakes due to lack of data
Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples
Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples
Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples
Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples Attempt to “contract” the distances within each cluster while keeping intracluster distances larger
Semi-Supervised Methods • Some examples are labeled • Assume labels vary smoothly among all examples
Cluster Kernels • Semi-supervised methods • Neighborhood • For each X, run PSI-BLAST to get similar seqs Nbd(X) • Define Φnbd(X) = 1/|Nbd(X)| X’ Nbd(X)Φoriginal(X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” • Knbd(X, Y) = 1/(|Nbd(X)|*|Nbd(Y)) X’ Nbd(X)Y’ Nbd(Y) K(X’, Y’) • Next bagged mismatch
Bagged Mismatched Kernel • Final method • Bagged mismatch • Run k-means clustering n times, giving p = 1,…,n assignments cp(X) • For every X and Y, count up the fraction of times they are bagged together Kbag(X, Y) = 1/n p 1(cp(X) = cp (Y)) • Combine the “bag fraction” with the original comparison K(.,.) Knew(X, Y) = Kbag(X, Y) K(X, Y)
What works best? Transductive Setting
References • C. Leslie et al. Mismatch string kernels for discriminative protein classification. Bioinformatics Advance Access. January 22, 2004. • J. Weston et al. Semi-supervised protein classification using cluster kernels.2003. • Images pulled under wikiCommons