A Survey on Distance Metric Learning (Part 2)

A Survey on Distance Metric Learning (Part 2) Gerry Tesauro IBM T.J.Watson Research Center

Acknowledgement • Lecture material shamelessly adapted from the following sources: • Kilian Weinberger: • “Survey on Distance Metric Learning” slides • IBM summer intern talk slides (Aug. 2006) • Sam Roweis slides (NIPS 2006 workshop on “Learning to Compare Examples”) • Yann LeCun talk slides (CVPR 2005, 2006)

Outline – Part 2 • Neighbourhood Components Analysis (Golderberger et al.), Metric Learning by Collapsing Classes (Globerson & Roweis) • Metric Learning for Kernel Regression (Weinberger & Tesauro) • Metric learning for RL basis function construction (Keller et al.) • Similarity learning for image processing (LeCun et al.)

Neighborhood Component Analysis Distance metric for visualization and kNN (Goldberger et. al. 2004)

Weinberger & Tesauro, AISTATS 2007 Metric Learning for Kernel Regression

Killing three birds with one stone: We construct a method for linear dimensionality reduction that generates a meaningful distance metric optimally tuned for distance-based kernel regression

Kernel Regression • Given training set {(xj , yj), j=1,…,N} where x is -dim vector and y is real-valued, estimate value of a test point xi by weighted avg. of samples: where kij = kD (xi, xj) is a distance-based kernel function using distance metric D

Choice of Kernel • Many functional forms for kijcan be used in MLKR; our empirical work uses the Gaussian kernel where σ is a kernel width parameter (can set σ=1 W.L.O.G. since we learn D) softmax regression estimate similar to Roweis’ softmax classifier

Distance Metric for Nearest Neighbor Regression Learn a linear transformation that allows to estimate the value of a test point from its nearest neighbors

Mahalanobis Metric Distance function is a pseudo Mahalanobis metric (Generalizes Euclidean distance)

General Metric Learning Objective • Find parmaterized distance function Dθ that minimizes total leave-one-out cross-validation loss function • e.g. params θ = elements Aij of A matrix • Since we’re solving for A not M, optimization is non-convex  use gradient descent

Gradient Computation where xij = xi – xj • For fast implementation: • Don’t sum over all i-j pairs, only go up to ~1000 nearest neighbors for each sample i • Maintain nearest neighbors in a heap-tree structure, update heap tree every 15 gradient steps • Ignore sufficiently small values of kij ( < e-34 ) • Even better data structures: cover trees, k-d trees

Learned Distance Metric example orig. Euclidean D < 1 learned D < 1

“Twin Peaks” test Training: n=8000 we added 3 dimensions with 1000% noise we rotated 5 dimensions randomly

Input Variance Noise Signal

Test data

Output Variance Signal Noise

DimReduction with MLKR • FG-NET face data: 82 persons, 984 face images w/age

DimReduction with MLKR PowerManagement data (d=21) • Force A to be rectangular • Project onto eigenvectors of A • Allows visualization of data

Robot arm results (8,32dim) regression error

Resource Arbiter App Manager App Manager Server Server Server Server Server Server Server Server App Manager Unity Data Center Prototype • Objective: Learn long-range resource value estimates for each application manager • State Variables (~48): • Arrival rate • ResponseTime • QueueLength • iatVariance • rtVariance • Action: # of servers allocated • by Arbiter • Reward: SLA(Resp. Time) Maximize Total SLA Revenue 5 sec Demand (HTTP req/sec) Demand (HTTP req/sec) Value(#srvrs) Value(#srvrs) Value(#srvrs) SLA SLA SLA Value(RT) WebSphere 5.1 Value(#srvrs) WebSphere 5.1 Value(RT) DB2 DB2 Trade3 Batch Trade3 8 xSeries servers (Tesauro, AAAI 2005; Tesauro et al., ICAC 2006)

Power & Performance Management • Objective: Managing systems to multi-discipline objectives: minimize Resp. Time and minimize Power Usage • State Variables (21): • Power Cap • Power Usage • CPU Utilization • Temperature • # of requests arrived • Workload intensity (# Clients) • Response Time • Action: Power Cap • Reward: SLA(Resp. Time) – Power Usage (Kephart et al., ICAC 2007)

IBM Regression Results TEST ERROR MLKR 14/47 3/5 10/22

Metric Learning for RL basis function construction (Keller et al. ICML 2006) • RL Dataset of state-action-reward tuples {(si, ai, ri), i=1,…,N}

Value Iteration • Define an iterative “bootstrap” calculation: • Each round of VI must iterate over all states in the state space • Try to speed this up using state aggregation (Bertsekas & Castanon, 1989) • Idea: Use NCA to aggregate states: • project states into lower-dim rep; keep states with similar Bellman error close together • use projected states to define a set of basis functions {} • learn linear value function over basis functions: V =  θii

Chopra et. al. 2005 Similarity metric for image verification. Problem: Given a pair of face-images, decide if they are from the same person.

Chopra et. al. 2005 Similarity metric for image verification. Problem: Given a pair of face-images, decide if they are from the same person. Too difficult for linear mapping!

A Survey on Distance Metric Learning (Part 2)