240 likes | 252 Views
A Transductive Framework of Distance Metric Learning by Spectral Dimensionality Reduction. Fuxin Li 1 , Jian Yang 2 and Jue Wang 1 1 Insitute of Automation, Chinese Academy of Sciences 2 Beijing University of Technology. Metric Learning: What does it do?. Given an apriori metric,
E N D
A Transductive Framework of Distance Metric Learning by Spectral Dimensionality Reduction Fuxin Li1, Jian Yang2 and Jue Wang1 1 Insitute of Automation, Chinese Academy of Sciences 2 Beijing University of Technology
Metric Learning: What does it do? • Given an apriori metric, metric learning tries to adapt it to the information coming from training items. • For example, items from the same class can be pushed together, while items from different classes can be pull apart.
What’s good? • Many pattern recognition and machine learning methods depend on a good choice of metric, which may not be available every time. • As a compromise, Euclidean metric is often used as the default one. Questionable! • Metric learning can be used to make things better.
Endless Learning Cycle • Moreover, the learned metric can be used again as an apriori metric in the next time.
How to learn? • The basic idea in many metric learning algorithms is to balance the information from the training data and the apriori metric. • Meanwhile, many methods seek to learn a Mahalanobis metric in the original space or in the feature space induced by kernels. • (Xing, Ng., Jordan, Russell 2003) • (Kwok and Tsang 2003) • (Weinberger, Blitzer and Saul 2006)
Wait a minute… • We can get a Mahalanobis metric by linearly transforming the input data. • Basically, when we say we learn a Mahalanobis metric, we are looking for “good” directions in the input space and projecting on them. • This reminds us of another field of research: dimensionality reduction.
Dimensionality Reduction • Linear or nonlinear. • Try to find a low-dimensional representation of the original data. • Nonlinear dimensionality reduction quite popular in recent years. • Mainly unsupervised.
And Metric Learning? • Dimensionality reduction learns metrics. • Metric learning = a supervised or semi-supervised version of dimensionality reduction? • Under certain assumptions and formulations, yes. • A combined view of these two topics can give something interesting.
A Metric Learning Formulation • cijk is positive if dijshould be larger than dik, and xj, xk are neighbors of xi. • Intuitively, minimizing the first part of the criterion moves labeled items with the same label together and items with different label apart. • The second part ensures the apriori structure is preserved.
Graph Transduction • A possible way to specify the second term is to link the items as a graph and use the Laplacian as the penalty P. • Intuitively, when dragging an item, the items linked to it will follow.
The Euclidean Assumption • Further assume that the metric is Euclidean. • Means that there exists a Euclidean space that we can place all the input items in there as points, and our metric is just the Euclidean metric in the space. • This assumption was first used in metric learning in (Zhang 2003) but they did not make a connection with spectral methods.
And Kernels • To be Euclidean implies that there exists an inner product in the space. • The distance is completely decided by the inner product, so it suffices to learn the inner product only. • For a finite sample, this means that it suffices to learn the Gram matrix. Thus the metric learning problem in this case is the same with the kernel learning problem.
Learning a Kernel • The following semi-supervised optimization problem learns a kernel matrix G of order n+m: • Ci and P are penalty matrices from the training data and the apriori structure, respectively. With suitable choices of these matrices, this problem is the same with the previous one. with G = XXT. The rows of X give the coordinates of each training item in a low- dimensional Euclidean space.
Dimensionality Reduction Kernel PCA/Kernel MDS/Laplacian Eigenmaps/Other Spectral Methods (Semi-)Supervision + Ci P (Bengio et al, 2004) Metric Learning Under the Euclidean assumption = With the particular loss function, it retained as an eigenvalue problem in the semi-supervised case. If we use a sparse graph Laplacian penalty, the algorithm has a time complexity of O((n+m)2).
More to give: RKHS regularization • In the one-dimensional case, it is proved to be equivalent with an RKHS regularization problem: • Interesting because different from common regularization problems, it penalizes pairwise differences, a function of ( f (xi), f (xj)) instead of ( f (xi), yi). • Also gives a natural out-of-sample extension.
Moving yto the weights • The information in y moved into the weights. • Set wij = t(d(xi, xj), d(yi,yj)), possible to solve problems with output y in any metric space. • Example: multi-class classification:
The parameter λ • λ controls the strength of our prior belief. • Intuitively, a “learned” guy should have stronger prior beliefs not easily changed by what he sees. While an newbie might have weaker beliefs that changes rapidly from observations. • However, in semi-supervised settings it is difficult to decide a good value of λ. In our experiments it is decided by grid search.
Experiments: MNIST • In the MNIST experiments we experimented with Tangent Distance (Simard et al. 1998) to see if a better apriori knowledge can result in better results. • The result shows that Tangent Distance is much better than Euclidean Distance, while our algorithm has only marginal improvements over Laplacian Eigenmaps in this case.
Conclusion • The work is preliminary but a framework for distance metric learning is useful. • Under the Euclidean assumption, the distance metric learning problem can be done by adding label information to spectral dimensionality reduction methods. • It also gives a regularization problem that is different from others and quite interesting.
Ongoing Work • We are currently experimenting with different loss functions other than the current one. • Some particularly interesting loss functions include hinge loss and exponential loss. • However optimizing on these loss functions require semi-definite programming or convex optimization, and the scalability is not as good as the current algorithm. We are working to find fast solvers of the optimization problems.
Beyond Euclidean • What will happen when the Euclidean assumption is not satisfied? • A particularly interesting scenario is that the Euclidean assumption only satisfies locally. • Then the dataset is locally homeomorphic to a subset on a Euclidean space, which means it lies on a topological manifold. • Different from “locally linear” techniques in current manifold learning, it is possible to design “locally Euclidean” techniques grounding from the current framework.