190 likes | 599 Views
Explaining High-Dimensional Data. Hoa Nguyen, Rutgers University Mentors: Ofer Melnik Kobbi Nissim. High-Dimensional Data. A great deal of data from different domains (medicine, finance, science) is high-dimensional. High-dimensional data is hard to visualize and understand.
E N D
Explaining High-Dimensional Data Hoa Nguyen, Rutgers University Mentors: Ofer Melnik Kobbi Nissim
High-Dimensional Data • A great deal of data from different domains (medicine, finance, science) is high-dimensional. • High-dimensional data is hard to visualize and understand.
Example: Given a set of images (represented by 8x8 pixel matrices), we could consider each image as a point in 64-dimensional space.
Analyzing Data • Instead of visualizing, we find properties of data to describe it. • Statistics, Data Mining, Machine Learning: • Finding properties of data directly • Finding models to capture data
Classifier • Given a set of data points with assigned labels. • Build a model for the data. • Use this model to label new unclassified data points.
Example: Given a set of data points in 2 dimensional-space with a + or – label for each point: (xi,yi)± We are interested in classifiers that take all the points of a class, and enclose them in a convex shape. The Geometrical View of Classifiers: _ _ _ _ + + + _ + + + + _ _ _
Goal To understandthe properties of the data enclosed by a convex shape.
Convex Shapes The convex hull: The convex hull C of a set of points is the smallest convex set that includes all the points. Problem: It is difficult to study the convex hull directly. Our solution: Instead of looking at the convex hull, we use the simpler convex shape to approximate the hull – the ellipsoid.
Example: Advantage: Use an ellipsoid to approximate the convex region, or to bound the geometry of the convex hull. MVE (Minimum Volume Ellipsoid) + + + + + + + + + +
Example: Using MVE to approximate the convex hull [John 1948] has shown that if we shrink the minimum volume outer ellipsoid of a convex set C by a factor k about its center, we obtain an ellipsoid contained in C. (k is the dimension of the space) + + + + + + + + + +
Calculate the MVE The MVE is described by the equation: v’ -1 v = k V={v1,v2,…,vh} Rk. v is an exterior point : the scatter matrix = wiviviT the eigen vectors of correspond to the directions of the ellipsoid axes. the eigen values of correspond to the half-lengths or radii of the axes. k: a constant equal to the dimension of the space wi: the weight of a point vi. h i=1
Calculate the MVE (cont.) [Titterington 1978] An algorithm to calculate the weights of MVE: • A point has a positive weight if it lies on the surface of the ellipsoid. • At least k+1 points have non-zero weights. • At most k(k+3)/2+1 points have non-zero weights.
Use MVE for data analysis: • Finding extreme points by looking at points on the ellipsoid surface. • Finding the subspace of data by looking at the directions where the ellipsoid is thin.
Points on the surface of the ellipsoid i.e., points with non-zero weights. Example:In our hand-written digit file, there are 376 points which belong to class “0”. By using MVE, we find that there are about 178 points with non-zero weights, i.e., these points lie on the surface of the MVE. The mean-zero Some “0” points on the surface of the MVE
Directions where the ellipsoid is thin • The directions and size of an ellipsoid’s axes correspond to the eigen vectors and values of its scatter matrix. • Direction of thinness: A short axis defines a direction in which the data does not extend. • If V is a zero-valued eigen vector, then it defines a constraint for any data point x: Vx=0
A simple Null Space • Any basis for the Null Space of the scatter matrix is an equivalent set of constraints. • In order to understand the data, we would like to find constraints that are easy to interpret. • Goal: simplify the null space basis, e.g.,find a basis with many zeros.
The Null Space Problem (NSP) • The Null Space Problem is defined as finding the basis with the maximal number of zeros. • It is an NP-hard problem. [Pothen, Coleman 1986] • An approach: Find a heuristic algorithm to simplify an existing basis of the null space of Class “0”: e.g., using Gaussian elimination to get a null space basis with more 0 components.
The null space basis = the set of eigenvectors with 0-eigenvalues The null space basis after using Gaussian elimination
Summary Data analysis Classifier with convex shape MVE Points on the surface of the MVE Simple basis for the null space