300 likes | 308 Views
A Survey on Distance Metric Learning (Part 2). Gerry Tesauro IBM T.J.Watson Research Center. Acknowledgement. Lecture material shamelessly adapted from the following sources: Kilian Weinberger: “Survey on Distance Metric Learning” slides IBM summer intern talk slides (Aug. 2006)
E N D
A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center
Acknowledgement • Lecture material shamelessly adapted from the following sources: • Kilian Weinberger: • “Survey on Distance Metric Learning” slides • IBM summer intern talk slides (Aug. 2006) • Sam Roweis slides (NIPS 2006 workshop on “Learning to Compare Examples”) • Yann LeCun talk slides (CVPR 2005, 2006)
Outline – Part 2 • Neighbourhood Components Analysis (Golderberger et al.), Metric Learning by Collapsing Classes (Globerson & Roweis) • Metric Learning for Kernel Regression (Weinberger & Tesauro) • Metric learning for RL basis function construction (Keller et al.) • Similarity learning for image processing (LeCun et al.)
Neighborhood Component Analysis Distance metric for visualization and kNN (Goldberger et. al. 2004)
Weinberger & Tesauro, AISTATS 2007 Metric Learning for Kernel Regression
Killing three birds with one stone: We construct a method for linear dimensionality reduction that generates a meaningful distance metric optimally tuned for distance-based kernel regression
Kernel Regression • Given training set {(xj , yj), j=1,…,N} where x is -dim vector and y is real-valued, estimate value of a test point xi by weighted avg. of samples: where kij = kD (xi, xj) is a distance-based kernel function using distance metric D
Choice of Kernel • Many functional forms for kijcan be used in MLKR; our empirical work uses the Gaussian kernel where σ is a kernel width parameter (can set σ=1 W.L.O.G. since we learn D) softmax regression estimate similar to Roweis’ softmax classifier
Distance Metric for Nearest Neighbor Regression Learn a linear transformation that allows to estimate the value of a test point from its nearest neighbors
Mahalanobis Metric Distance function is a pseudo Mahalanobis metric (Generalizes Euclidean distance)
General Metric Learning Objective • Find parmaterized distance function Dθ that minimizes total leave-one-out cross-validation loss function • e.g. params θ = elements Aij of A matrix • Since we’re solving for A not M, optimization is non-convex use gradient descent
Gradient Computation where xij = xi – xj • For fast implementation: • Don’t sum over all i-j pairs, only go up to ~1000 nearest neighbors for each sample i • Maintain nearest neighbors in a heap-tree structure, update heap tree every 15 gradient steps • Ignore sufficiently small values of kij ( < e-34 ) • Even better data structures: cover trees, k-d trees
Learned Distance Metric example orig. Euclidean D < 1 learned D < 1
“Twin Peaks” test Training: n=8000 we added 3 dimensions with 1000% noise we rotated 5 dimensions randomly
Input Variance Noise Signal
Output Variance Signal Noise
DimReduction with MLKR • FG-NET face data: 82 persons, 984 face images w/age
DimReduction with MLKR • FG-NET face data: 82 persons, 984 face images w/age
DimReduction with MLKR PowerManagement data (d=21) • Force A to be rectangular • Project onto eigenvectors of A • Allows visualization of data
Robot arm results (8,32dim) regression error
Resource Arbiter App Manager App Manager Server Server Server Server Server Server Server Server App Manager Unity Data Center Prototype • Objective: Learn long-range resource value estimates for each application manager • State Variables (~48): • Arrival rate • ResponseTime • QueueLength • iatVariance • rtVariance • Action: # of servers allocated • by Arbiter • Reward: SLA(Resp. Time) Maximize Total SLA Revenue 5 sec Demand (HTTP req/sec) Demand (HTTP req/sec) Value(#srvrs) Value(#srvrs) Value(#srvrs) SLA SLA SLA Value(RT) WebSphere 5.1 Value(#srvrs) WebSphere 5.1 Value(RT) DB2 DB2 Trade3 Batch Trade3 8 xSeries servers (Tesauro, AAAI 2005; Tesauro et al., ICAC 2006)
Power & Performance Management • Objective: Managing systems to multi-discipline objectives: minimize Resp. Time and minimize Power Usage • State Variables (21): • Power Cap • Power Usage • CPU Utilization • Temperature • # of requests arrived • Workload intensity (# Clients) • Response Time • Action: Power Cap • Reward: SLA(Resp. Time) – Power Usage (Kephart et al., ICAC 2007)
IBM Regression Results TEST ERROR MLKR 14/47 3/5 10/22
Metric Learning for RL basis function construction (Keller et al. ICML 2006) • RL Dataset of state-action-reward tuples {(si, ai, ri), i=1,…,N}
Value Iteration • Define an iterative “bootstrap” calculation: • Each round of VI must iterate over all states in the state space • Try to speed this up using state aggregation (Bertsekas & Castanon, 1989) • Idea: Use NCA to aggregate states: • project states into lower-dim rep; keep states with similar Bellman error close together • use projected states to define a set of basis functions {} • learn linear value function over basis functions: V = θii
Chopra et. al. 2005 Similarity metric for image verification. Problem: Given a pair of face-images, decide if they are from the same person.
Chopra et. al. 2005 Similarity metric for image verification. Problem: Given a pair of face-images, decide if they are from the same person. Too difficult for linear mapping!