470 likes | 656 Views
Project and Presentation. Final projectDue on April 26thLength: 20?30 pagesHand in a hard copy, as well as an electronic copy of the report and any related source codes (blackboard).Presentation (April 26th and May 1st)5-minute presentation for each studentEmail TA the slides one day before t
E N D
1.
Source: Gert Lanckriet’s slides In this… KM / SDPIn this… KM / SDP
2. Project and Presentation Final project
Due on April 26th
Length: 20—30 pages
Hand in a hard copy, as well as an electronic copy of the report and any related source codes (blackboard).
Presentation (April 26th and May 1st)
5-minute presentation for each student
Email TA the slides one day before the presentation
jianhuic@gmail.com
3. Overview Find a mapping f such that, in the new feature space, problem solving is easier (e.g. linear).
SVM, PCA, LDA, CCA, etc
The kernel is defined as the inner product between data points in this new feature space.
Similarity measure
Valid kernels
Kernel construction
Kernels from pairwise similarities
Diffusion kernels for graphs
Kernels for vectors
Kernels for abstract data
Documents (string)
Protein Sequence
New kernels from existing kernels
4. How to choose the optimal kernel? Many different types of kernels for the same data.
Different kernels for protein sequences
Many different kernels from different types of data
Different data sources in bioinformatics
Question:
How to choose the optimal kernel?
Active research in machine learning
Simple approach: kernel alignment (with the kernel based on label)
5. Outline of lecture Introduction
Kernel based learning
Kernel design for different data sources
Learning the optimal Kernel
Experiments
6. During the past decade, a heterogeneous spectrum of data became available describing the genome:
- Seq. Data -> similarities between proteins / genes
- mRNA expression levels associated with a gene: under different experimental conditionsDuring the past decade, a heterogeneous spectrum of data became available describing the genome:
- Seq. Data -> similarities between proteins / genes
- mRNA expression levels associated with a gene: under different experimental conditions
7. Membrane protein prediction
8. Different data sources are likely to contain different and thus partly independent information about the task at hand
Protein-protein interactions are best expressed as graphsDifferent data sources are likely to contain different and thus partly independent information about the task at hand
Protein-protein interactions are best expressed as graphs
9. Kernel-based learning methods have already proven to be a very useful tool in bioinformaticsKernel-based learning methods have already proven to be a very useful tool in bioinformatics
10. Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specified. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure betwene data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding.
A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specified. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure betwene data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding.
A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….
11. Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specified. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure betwene data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding.
A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specified. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure betwene data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding.
A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….
12. Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specfied. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure between data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding.
A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….Kernel methods work by embedding data items (genes, proteins, etc.) into a (possibly high dimensional) Euclidean vector space. The embedding is performed implicitely: instead of giving explicit coordinates, the inner product is specfied. This can be done by defining a kernel function that specifies the inners product between any pair of data items (whichever needed), this function can be regarded as similarity measure between data items. When the amount of data is finite (e.g. here: finite number of genes under consideration), the kernel values between all pair of data points can be organized in a kernel matrix: Gram matrix, which fully describes that embedding.
A matrix that is symmm and pos def is a valid kernel matrix in a sense that a mapping/embedding exists….
14. Find good hyperplane
(w,b) Rd+1
that classifies this and future data points as good as possible
16. Intuition (Vapnik, 1965) if linearly separable:
Separate the data
Place hyerplane “far” from the data: large margin
17. If not linearly separable:
Allow some errors
Still, try to place hyerplane “far” from each class
20. Hand-writing recognition (e.g., USPS)
Computational biology (e.g., micro-array data)
Text classification
Face detection
Face expression recognition
Time series prediction (regression)
Drug discovery (novelty detection)
21. Kernel-based learning methods represent data by means of a kernel matrix or function, which defines similarities between pairs of genes, proteins, … Such similarities can be established using a broad spectrum of data (examples later on), as long as the corresponding kernel matrix is positive definite: it’s ok! In that case, we can interprete it is the inner products in some high dimensional space, in which we can train our favorite linear classification algorithm…
Kernel matrix <-> kernel function
So we can have a very heterogeneous set of data up here, and every kernel function/matrix is geared towards a specific type of data, thus extracting a specific type of information from a data set.
Just like this each set of data describes the genome partially, in a heteregeneous way, so does each kernel, but in a homogeneous way (all compatible matrices). So, here we have a chance to fuse the many partial descriptions of the data: by combining/fusing/mixing those compatible kernel matrices in a way that is statistically optimal, computationally efficient and robust, we can try to find a kernel K that best represents all of the information available for a given learning task. We’ll explain further on how this can be accomplished.Kernel-based learning methods represent data by means of a kernel matrix or function, which defines similarities between pairs of genes, proteins, … Such similarities can be established using a broad spectrum of data (examples later on), as long as the corresponding kernel matrix is positive definite: it’s ok! In that case, we can interprete it is the inner products in some high dimensional space, in which we can train our favorite linear classification algorithm…
Kernel matrix <-> kernel function
So we can have a very heterogeneous set of data up here, and every kernel function/matrix is geared towards a specific type of data, thus extracting a specific type of information from a data set.
Just like this each set of data describes the genome partially, in a heteregeneous way, so does each kernel, but in a homogeneous way (all compatible matrices). So, here we have a chance to fuse the many partial descriptions of the data: by combining/fusing/mixing those compatible kernel matrices in a way that is statistically optimal, computationally efficient and robust, we can try to find a kernel K that best represents all of the information available for a given learning task. We’ll explain further on how this can be accomplished.
23. Each matrix entry is an mRNA expression measurement.
Each column is an experiment.
Each row corresponds to a gene.
24. Normalized scalar product
Similar vectors receive high values, and vice versa.
25. Use general similarity measurement for vector data: Gaussian kernel
30. Pairwise interactions can be represented as a graph or a matrix.
The simplest kernel counts the number of shared interactions between each pair.
31. A general method for establishing similarities between nodes of a graph.
Based upon a random walk.
Efficiently accounts for all paths connecting two nodes, weighted by path lengths.
32. Integral plasma membrane proteins serve several functions. Often, one divides them into four classes: \emph{transporters}, \emph{linkers}, \emph{enzymes} and \emph{receptors}.
- The transporters serve as gates through the cellmembrane, generally for charged or polar molecules that otherwise could not pass the
hydrophobic lipid bilayer the plasma membrane consists of.
Linkers have a structural function in the cell membrane.
Some membrane proteins are merely enzymes, moderating biochemical
reactions inside or outside the cell.
- Receptors are capable of receiving biochemical signals from inside or outside the cell, thus triggering a reaction on the other side of the membrane. In particular, inside the membrane receptors often interact with kinases\footnote{Kinase is a generic name for enzymes that attach a phosphate to a protein, opposite in action to phosphatases; these enzymes are important metabolic regulators.}, thus initiating a signaling pathway in the cell triggered by an extracellular stimulus.Integral plasma membrane proteins serve several functions. Often, one divides them into four classes: \emph{transporters}, \emph{linkers}, \emph{enzymes} and \emph{receptors}.
- The transporters serve as gates through the cellmembrane, generally for charged or polar molecules that otherwise could not pass the
hydrophobic lipid bilayer the plasma membrane consists of.
Linkers have a structural function in the cell membrane.
Some membrane proteins are merely enzymes, moderating biochemical
reactions inside or outside the cell.
- Receptors are capable of receiving biochemical signals from inside or outside the cell, thus triggering a reaction on the other side of the membrane. In particular, inside the membrane receptors often interact with kinases\footnote{Kinase is a generic name for enzymes that attach a phosphate to a protein, opposite in action to phosphatases; these enzymes are important metabolic regulators.}, thus initiating a signaling pathway in the cell triggered by an extracellular stimulus.
33. We will develop a kernel motivated by the low-frequency alternation of hydrophobic and hydrophilic regions in membrane proteins. However, we also demonstrate that the hydropathy profile only provides partial info: additional info is gained from sequence homology and prot-prot interactionsWe will develop a kernel motivated by the low-frequency alternation of hydrophobic and hydrophilic regions in membrane proteins. However, we also demonstrate that the hydropathy profile only provides partial info: additional info is gained from sequence homology and prot-prot interactions
34. Dir. Inc. …. -> known to be usefull in identifying membrane proteinsDir. Inc. …. -> known to be usefull in identifying membrane proteins
35. Kernel-based learning methods represent data by means of a kernel matrix or function, which defines similarities between pairs of genes, proteins, … Such similarities can be established using a broad spectrum of data (examples later on), as long as the corresponding kernel matrix is positive definite: it’s ok! In that case, we can interprete it is the inner products in some high dimensional space, in which we can train our favorite linear classification algorithm…
Kernel matrix <-> kernel function
So we can have a very heterogeneous set of data up here, and every kernel function/matrix is geared towards a specific type of data, thus extracting a specific type of information from a data set.
Just like this each set of data describes the genome partially, in a heteregeneous way, so does each kernel, but in a homogeneous way (all compatible matrices). So, here we have a chance to fuse the many partial descriptions of the data: by combining/fusing/mixing those compatible kernel matrices in a way that is statistically optimal, computationally efficient and robust, we can try to find a kernel K that best represents all of the information available for a given learning task. We’ll explain further on how this can be accomplished.Kernel-based learning methods represent data by means of a kernel matrix or function, which defines similarities between pairs of genes, proteins, … Such similarities can be established using a broad spectrum of data (examples later on), as long as the corresponding kernel matrix is positive definite: it’s ok! In that case, we can interprete it is the inner products in some high dimensional space, in which we can train our favorite linear classification algorithm…
Kernel matrix <-> kernel function
So we can have a very heterogeneous set of data up here, and every kernel function/matrix is geared towards a specific type of data, thus extracting a specific type of information from a data set.
Just like this each set of data describes the genome partially, in a heteregeneous way, so does each kernel, but in a homogeneous way (all compatible matrices). So, here we have a chance to fuse the many partial descriptions of the data: by combining/fusing/mixing those compatible kernel matrices in a way that is statistically optimal, computationally efficient and robust, we can try to find a kernel K that best represents all of the information available for a given learning task. We’ll explain further on how this can be accomplished.
36. Let’s forget about everything and consider learning the optimal kernel? How can we do this?
Convex: local = global optimumLet’s forget about everything and consider learning the optimal kernel? How can we do this?
Convex: local = global optimum
37. Let’s forget about everything and consider learning the optimal kernel? How can we do this?Let’s forget about everything and consider learning the optimal kernel? How can we do this?
38. Learning the Optimal Kernel
39. - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights
- for SVMs, maximum margin classifiers: - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights
- for SVMs, maximum margin classifiers:
40. Learning the optimal Kernel
41. Learning the optimal Kernel
42. - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights
- for SVMs, maximum margin classifiers: - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights
- for SVMs, maximum margin classifiers:
43. - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights
- for SVMs, maximum margin classifiers: - Convex subset: good for us: we want a subset obtained by mixing our kernels somehow --- here we take: linear subspace in the cone, spanned by those kernels, where we wanna learn the weights
- for SVMs, maximum margin classifiers:
46. Next class Student presentation
Schedule is online
Send the ppt slides to TA one day earlier
Email: jianhuic@gmail.com
47. Survey Clustering
Classification
Regression
Semi-supervised learning
Dimensionality reduction
Manifold learning
Kernel learning