240 likes | 367 Views
Learning Instance Specific Distance Using Metric Propagation. De-Chuan Zhan, Ming Li, Yu-Feng Li, Zhi-Hua Zhou LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China {zhandc, lim, liyf, zhouzh}@lamda.nju.edu.cn. Distance metric learners are introduced….
E N D
Learning Instance Specific Distance Using Metric Propagation De-Chuan Zhan, Ming Li, Yu-Feng Li, Zhi-Hua Zhou LAMDA Group National Key Lab for Novel Software Technology Nanjing University, China {zhandc, lim, liyf, zhouzh}@lamda.nju.edu.cn
Distance metric learners are introduced… Distance based classification K-nearest neighbor classification SVM with Gaussian kernels Is the distance reliable? Are there any more natural measurements?
Any more natural measurements? When sky is compared to other pictures… Color, probably texture features When Phelps II is compared to other athletes… Speed of swimming, shape of feet… Can we assign a specific distance measurement for each instance, both labeled and unlabeled? … our work
Outline • Introduction • Our Methods • Experiments • Conclusion
Introduction Distance Metric Learning • Many machine learning algorithms rely on the distance metric for input data patterns. • Classification • Clustering • Retrieval There are many metric learning algorithms developed [Yang, 2006] Problem: Focus on learning a uniform Mahalanobis distance for ALL instances
Introduction Other distance functions • Instead of applying a uniform distance metric for every example, it is more natural to measure distances according to specific properties of data • Some researchers define distance from sample’s own perspective • QSim [Zhou and Dai, ICDM’06] [Athitsos et al., TDS’07] • Local distance functions [Frome et al., NIPS’06, ICCV’07]
Introduction Query sensitive similarity Actually, instance specific similarities or query specific similarities are studied in other fields before: In content-based image retrieval, there has been a study which tries to compute query sensitive similarities. The similarities among different images are decided after receiving a query image. [Zhou and Dai, ICDM’06] The problem: Query similarity is based on pure heuristics.
Introduction Local distance functions • [Frome et al. NIPS’06] The distance from the j-th instance to the i-th instance is larger than that from the j-th to the k-th Dji>Djk 1. Cannot generalize directly 2.The local distance defined is not directly comparable. • [Frome et al. ICCV’07] Dij>Dkj All constraints can be tired together. Requiring more heuristics for testing. The problem: Local distance functions for unlabeled data are N/A.
Introduction Our Work Can we assign a specific distance measurement for each instance, both labeled and unlabeled? Yes, we learn Instance Specific Distance via Metric Propagation
Outline • Introduction • Our Methods • Experiments • Conclusion
Our Methods Intuition • Focus on learning instance specific distance for both labeled and unlabeled data. • For labeled data: the pair of examples come from the same class should be closer to each other • For unlabeled data: Metric propagation on a relationship graph
The Loss function for labeled data Induced by the labels of instances, provides the side information A regularization term responsible for the implicit metric propagation Inspired by [Zhu 2003], the regularization term can be defined as: is a convex loss function, such as hinge loss in classification or least square loss in regression Our Methods The ISD Framework • Instead of directly conducting metric propagation while learning the distances for labeled examples, we formulate the metric propagation with a regularized framework. The j-th instance belongs to a class other than the i-th, or the j-th instance is a neighbor of i-th instance, i.e., all Cannot-links and some of the must-links are considered
Replaced with high-order side information, such as triplets information L is set to identity matrix Our Methods The ISD Framework – relationship to FSM Although only pair-wised side information is investigated in our work, the ISD Framework is a common frame… FSM [Frome et al. NIPS’06] is a special case of ISD
Given structure Predefined graph In new ISD space Initialize Final ISD Weights Weights Updated Graph Graph Our Methods The ISD Framework – update graph
Introducing slack variables Solving it respect to all w simultaneously is of great challenge. The computational cost is too expensive. Our Methods ISD with L1-loss Convex problem we employ the alternating descent method to solve it, i.e. to sequentially solve one w for one instance at each time by fixing other ws till converges or maxiters reached.
Dual: Our Methods ISD with L1-loss (con’t) Primal:
Inspired by nu-SVM, we probably can obtain a more efficient method: Our Methods Acceleration: ISD with L2-loss • For acceleration: • The alternating descent method is used to solve the problem • Reduce the number of constraints by considering some must- links However, the number of inequality constraints may be large
drop Dual: A linear equality constraint We will project the solution back to the feasible region after we get the optimization results: Thus, this dual variable can be efficiently solved using Sequential Minimal Optimization. Our Methods Acceleration: ISD with L2-loss
Outline • Introduction • Our Methods • Experiments • Conclusion
Experiments Configurations • Data sets: • 15 UCI data sets • COREL image dataset (20 classes, 100 images/class) • 2/3 labeled training set; 1/3 unlabeled for testing, 30 runs • Compared methods • ISD-L1/L2 • FSM/FSSM (Frome et al. 2006 & 2007) • LMNN (Weinberger et al. 2005) • DNE (Zhang et al, 2007) • Parameters are selected via cross validation
The win/tie/loss counts ISD vs. other methods t-test, 95% significance level 12 11 Experiments Classification Performance Comparison of test error rates (mean±std.)
Experiments Influence of the number of iteration rounds Updating rounds Starting from Euclidean The error rates of ISD-L2 reduce on some datasets. However, on others, the performance are degenerated. Overfitting – L2-loss is more sensitive to noise The error rates of ISD-L1 are reduced on most datasets as the number of update increasing The error rates of ISD-L2 reduce on some datasets.
Experiments Influence of the amount of labeled data ISD is less sensitive to the influence of the amount of labeled data When the amount of labeled samples is limited, the superiority of ISD is more apparent
Conclusion Main contribution: • A method for learning instance-specific distance for labeled as well as unlabeled instances. Future work: • The construction of the initial graph • Label propagation, metric propagation, … any more properties to propagate? Thanks!