230 likes | 444 Views
Evaluation of Distance Metrics for Recognition Based on Non-Negative Matrix Factorization. David Guillamet, Jordi Vitrià Pattern Recognition Letters 24:1599-1605, June, 2003 John Galeotti Advanced Perception March 23, 2004. Actually, Two ICPR’02 Papers.
E N D
Evaluation of Distance Metrics for Recognition Based on Non-Negative Matrix Factorization David Guillamet, Jordi Vitrià Pattern Recognition Letters 24:1599-1605, June, 2003 John Galeotti Advanced Perception March 23, 2004
Actually, Two ICPR’02 Papers Analyzing Non-Negative Matrix Factorization for Image Classification David Guillamet, Bernt Schiele, Jordi Vitrià Determining a Suitable Metric When using Non-negative Matrix Factorization David Guillamet, Jordi Vitrià
Non-Negative Matrix Factorization • TLA: NMF • Used for dimensionality reduction • Vnxm ≈ WnxrHrxm, r < nm/(n+m) • V has non-negative training samples as its columns • W contains the non-negative basis vectors • H contains the non-negative coefficients to approximate each column of V using W • Results similar in concept to PCA, but with non-negative “basis vectors”
NMF Distinguishing Properties • Requires positive data • Computationally expensive • Part-based decomposition • Because only additive combinations of original data are allowed • Not an orthonormal basis
Different Decomposition Types 20 Dimensions of Numeric Digits PCA NMF 50 Dimensions of Numeric Digits PCA NMF
Why not just use PCA? • PCA is optimal for reconstruction • PCA is not optimal for separation and recognition of classes
NMF Issues Addressed • If/when is NMF better at dimensionality reduction than PCA for classification? • Can combining PCA and NMF lead to better performance? • What is the best distance metric to use with the nonorthonormal basis of NMF?
How NMF Works • Vnxm ≈ WnxrHrxm, r < nm/(n+m) • Begin with a nxm matrix of training data V • Each column is a vectorized data point • Randomly initialize W and H with positive values • Iterate according to update rules:
How NMF Works • In general, NMF requires the non-linear optimization of an objective function • The update rules just given correspond to a popular objective function, and are guaranteed to converge. • That objective function relates to the probability of generating the images in V from the bases W and encodings H:
NMF vs. PCA Experiments • Dataset: 10 classes of natural textures • Clouds, grass, ice, trees, sand, sky, etc. • 932 color images total • Each image tessellated into 10x10 patches • 1000 patches for training, 1000 for testing • Each patch classified as a single texture • Raw feature vectors: Color histograms • Each region histogrammed into 8 bins per color, 16 colors 512 dimensional vectors
NMF vs. PCA Experiments • Learn both NMF and PCA subspaces for each class of histogram • For both NMF and PCA: • Project queries onto the learned subspaces of each class • Label each query by the subspace that best reconstructs the query • This seems like a poor scheme for NMF • (Other experiments allow better schemes)
NMF vs. PCA Results • NMF works best for dispersed classes • PCA works best for compact classes • Both seem useful…try combining them • But, why are less than half of the sky vectors best reconstructed by PCA when for sky PCA has a mean reconstruction error less than 1/4 that of NMF? Mistakes?
NMF+PCA Experiments • During training, we learned whether NMF or PCA worked best for each class • Project a query to a class using only the method that works best for that class • Result: 2.3% improvement in the recognition rate over NMF alone (PCA: 5.8%), but is this significant at 60%?
Hierarchy Experiments • At level k of the hierarchy, project the query onto each original class’ NMF or PCA subspace • But, to choose the direction to descend the hierarchy, we only care about the level k super-class containing the matching class • Furthermore, for each class the choice of PCA vs. NMF can be independently set at each level of the hierarchy
Hierarchy Results • 2% improvement in recognition rate • I really suspect that this is insignificant, and resulting only from the additional degrees of freedom • They employ various additional neighborhood-based hacks to increase their accuracy further, but I don’t see any relevance to NMF specifically
Need for a better metric • Want to classify based on nearest neighbor, rather than reprojection error • Unfortunately, NMF generates a nonorthonormal basis, and so the relative distance to a base depends on the uniqueness of that base • Bases will share a lot of pixels in common areas
Earth Movers Distance (EMD) • Defined as the minimal amount of “work” that must be performed to transform one feature distribution into the other • A special case of the “transportation problem” from linear optimization • Let I=set of suppliers, J=set of consumers, cij=cost to ship from I to J, fij=amount shipped from I to J • Distance = cost to make datasets equal
Earth Movers Distance (EMD) • Based on finding a measure of correlation between bases to define its cost matrix • The cost matrix weights the transition of one basis (bi) to another (bj) • cij = distangle(bi,bj) = -( x • y )/( ||x|| ||y|| )
EMD: Transportation Problem • fij = quant. shipped from ij • Consumers don’t ship • Don’t exceed demand • Don’t exceed supply • Demand must equal supply for EMD to be a metric
EMD vs. “Other” Experiments • Digit recognition from MNIST digit database • 60,000 training images + 10,000 for test • Classify by NN and 5NN in the subspace • Result: EMD works best in low-dimensional subspaces, but in high-dimensional subspaces EMD does not work well • More specificly, EMD works well when the bases contain some intersecting pixels
Occlusion Experiments • Randomly occlude either 1 or 2 of the 4 quadrants of an image (25% and 50% occlusion) • Why does distangle do so well?
Demo • NMF difficulties • EMD experiments instead • Demonstrate using existing code within the desired framework of a cost matrix • Their code: http://robotics.stanford.edu/~rubner/emd/default.htm • My code: http://www.vialab.org/john/Pres9-code/
Conclusion • NMF is a parts-based alternative to PCA • NMF and PCA should be combined for minimum-reprojection-error classification • For nearest-neighbor classification, NMF needs a better metric • When the subspace dimensionality is chosen appropriately for good bases, NMF+EMD or NMF+distangle have the highest recognition rates