Presented by: Peng Zhang 4/15/2011

Low-Rank Kernel Learning with Bregman Matrix DivergencesBrian Kulis, Matyas A. Sustik and Inderjit S. DhillonJournal of Machine Learning Research 10 (2009) 341-376 Presented by: Peng Zhang 4/15/2011

Outline • Motivation • Major Contributions • Preliminaries • Algorithms • Discussions • Experiments • Conclusions

Motivation • Low-rank matrix nearness problems • Learning low-rank positive semidefinite (kernel) matrices for machine learning applications • Divergence (distance) between data objects • Find suitable divergence measures to certain matrices • Efficiency • Positive semidefinite (PSD, or low rank) matrix is common in machine learning with kernel methods • Current learning techniques require positive semidefinite constraint, resulting in expensive computations • Bypass such constraint, find divergences with automatic enforcement of PSD

Major Contributions • Goal • Efficient algorithms that can find a PSD (kernel) matrix as ‘close’ as possible to some input PSD matrix under equality or inequality constraints • Proposals • Use LogDet divergence/von Neumann divergence constraints in PSD matrix learning • Use Bregman projections for the divergences • Computationally efficient, scaling linearly with number of data points n and quadratically with rank of input matrix • Properties of the proposed algorithms • Range-space preserving property (rank of output = rank of input) • Do not decrease rank • Computationally efficient • Running times are linear in number of data points and quadratic in the rank of the kernel (for one iteration)

Preliminaries • Kernel methods • Inner products in feature space • Only information needed is kernel matrix K • K is always PSD • If is low rank • Use low rank decomposition to improve computational efficiency Low rank kernel matrix learning

Intuitively these can be thought of as the difference between the value of F at point x and the value of the first-order Taylor expansion of F around point y evaluated at point x. Preliminaries • Bregman vector divergences • Extension to Bregman matrix divergences

Preliminaries • Special Bregman matrix divergences • The von Neumann divergence (DvN) • The LogDet divergence (Dld) All for full rank matrices

Preliminaries • Important properties of DvN and Dld • X is defined over positive definite matrices • No explicit constrain as positive definite • Range-space preserving property • Scale-invariance of LogDet • Transformation invariance • Others • Beyond transductive setting, evaluate kernel function over new data points

Preliminaries • Spectral Bregman matrix divergence • Generating convex function • Function of eigenvalues and convex function • Bregman matrix divergence by eigenvalues and eigenvectors

Preliminaries • Kernel matrix learning problem of this paper • Non-convex • Convex when using LogDet/von Neumann, because rank is implicitly enforced • Interested in constraint as squared Euclidean distance between points • A is rank one, and the problem can be: • Learn a kernel matrix over all data points from side information (labels or constraints)

Preliminaries • Bregman projections • A method to solve the ‘no rank constraint’ version of the previous problem • Choose one constraint each time • Perform Bragman projection so that current solution satisfies that constraint • Using LogDet and von Neumann divergences, projections can be computed efficiently • Convergence guaranteed, but may require many iterations

Preliminaries • Bregman divergences for low rank matrices • Deal with matrices with 0 eigenvalues • Infinite divergences might occur because • These imply a rank constraint if the divergence is finite Range … Rank …

Preliminaries • Rank deficient LogDet and von Neumann Divergences • Rank deficient Bregman projections • von Neumann: • LogDet:

Algorithm Using LogDet • Cyclic projection algorithm using LogDet divergence • Update for each projection • Can be simplified to • Range space is unchanged, no eigen-decomposition required • (21) costs O(n^2) operations per iteration • Improving update efficiency with factored n x r matrix G • This update can be done using Cholesky rank-one update • O(r^3) complexity • Further improve update efficiency to O(r^2) • Combines Cholesky rank-one update with matrix multiplication

Algorithm Using LogDet • G = LLT; G = G0 B; B is the product of all L matrices from every iteration and X0 = G0G0T • L can be determined implicitly

Algorithm Using LogDet • What’re the constraints? Convergence? O(cr^2) Convergence is checked by how much v has changed May require large number of iterations O(nr^2)

Algorithm Using von Neumann • Cyclic projection algorithm using von Neumann divergence • Update for each projection • This can be modified to • To calculate , find the unique root of the function

Algorithm Using von Neumann • Slightly slower than Algorithm 2 Root finder, slows down the process O(r^2)

Discussions • Limitations of Algorithm 2 and Algorithm 3 • The initial kernel matrix must be low-rank • Not applicable for dimensionality reduction • Number of iterations may be large • This paper only optimized the computations for each iteration • Reducing the total number of iterations is future topic • Handling new data points • Transductive setting • All data points are up front • Some of the points have labels or other supervisions • When new data point is added, re-learn the entire kernel matrix • Circumvent • View B as linear transformation • Apply B to new points

Discussions • Generalizations to more constraints • Slack variables • When number of constraints is large, no feasible solution to Bregman divergence minimization problem • Introduce slack variables • Allows constraints to be violated but penalized • Similarity constraints • , or • Distance constraints • O(r^2) per projection • If arbitrary linear constraints are applied, O(nr)

Discussions • Special cases • DefiniteBoost optimization problem • Online-PCA • Nearest correlation matrix problem • Minimizing LogDet divergence and semidefinite programming (SDP) • SDP relaxation of min-balanced-cut problem • Can be solved by LogDet divergence

Experiments • Transductive learning and clustering • Data sets • Digits • Handwritten samples of digits 3,8 and 9 from UCI repository • GyrB • Protein data set with three bacteria species • Spambase • 4601 email messages with 57 attributes, spam/not spam labels • Nursery • 12960 instances with 8 attributes and 5 class labels • Classification • k-nearest neighbor classifier • Clustering • Kernel k-means algorithm • Use normalized mutual information (NMI) measure

Experiments • Learn a kernel matrix only using constraints • Low rank kernels learned by proposed algorithms attain accurate clustering and classification • Use original data to get initial kernel matrix • The more constraints used, the more accurate results • Convergence • von Neumann divergence • Convergence was attained in 11 cycles fo 30 constraints and 105 cycles for 420 constraints • LogDet divergence • Between 17 and 354 cycles

Simulation Results Significant improvements 0.948 classification accuracy For DefiniteBoost, 3220 cycles to convergence

Simulation Results Rank 57 Rank 8 LogDet needs fewer constraints LogDet converges much more slowly (Future work) But often it has fewer overall running time

Simulation Results • Metric learning and large scale experiments • Learning a low-rank kernel with same range-space is equivalent to learning linear transformation of input data • Compare proposed algorithms with metric learning algorithms • Metric learning by collapsing classes (MCML) • Large-margin nearest neighbor metric learning (LMNN) • Squared Euclidean Baseline

Conclusions • Developed LogDet/von Neumann divergence based algorithms for low-rank matrix nearness problems • Running times are linear in number of data points and quadratic in the rank of the kernel • The algorithms can be used in conjunction with a number of kernel-based learning algorithms

Thank you

Presented by: Peng Zhang 4/15/2011