410 likes | 586 Views
Estimating Intrinsic Dimension. Justin Eberhardt UMD, Mathematics and Statistics Advisor: Dr. Kang James. Outline. Introduction Nearest Neighborhood Estimators Regression Estimator Maximum Likelihood Estimator Revised Maximum Likelihood Estimator Comparison Summary. 2.
E N D
Estimating Intrinsic Dimension Justin Eberhardt UMD, Mathematics and Statistics Advisor: Dr. Kang James
Outline • Introduction • Nearest Neighborhood Estimators • Regression Estimator • Maximum Likelihood Estimator • Revised Maximum Likelihood Estimator • Comparison • Summary 2
Intrinsic Dimension Definition • The least number of parameters required to generate a dataset • Minimum number of dimensions that describes a dataset without significant loss of feature 3
z x y Ex 1: Intrinsic Dimension Flatten (Unroll) y x Int Dim = 2 4
Ex 2: Intrinsic Dimension 1 28 56 28 X 28 One Image: 784 Dimensional
No Loop Top & Bottom Loop Ex 2: Intrinsic Dimension [Isomap Project, J. Tenenbaum & J. Langford, Stanford] Int Dim = 2 6
Applications • Biometrics • Facial Recognition, Fingerprints, Iris • Genetics 7
Why do we need to reduce dimensionality? • Low dimensional datasets are more efficient • Not even supercomputers can handle very high-dimensional matrices • Data in 1,2 and 3 dimensions can be visualized 8
Ex: Facial Recognition in MN • 5 Million People • 2 Images per Person (Front and Profile) • 1028 X 1028 Pixels per Image (1 Megapixel) • Total Memory Required: • n = 5,000,000 • p = (2)(1028)(1028)= 2.11 Million Dimensions • Matrix Size: (5 x 106)(2.11 x 106) = 10 billion cells • Memory: 2(10 x 1012) = 20 x 1012 = 20 Terabytes
Intrinsic Dimension Estimators Objective: To find a simple formula that uses nearest neighbor (NN) information to quickly estimate intrinsic dimension 10
Intrinsic Dimension Estimators Project Description: Through simulation, we will compare the effectiveness of three proposed NN intrinsic dimension estimators. 11
Intrinsic Dimension Estimators Note: Traditional methods for estimating Intrinsic Dimension, such as PCA, fail on non-linear manifolds. 12
Intrinsic Dimension Estimators Nearest-Neighbor Methods • Regression Estimator K. Pettis, T. Bailey, A. Jain & R. Dubes, 1979 • Maximum Likelihood Estimator E. Levina, & P. Bickel, 2005 D. MacKay and Z. Ghahramani, 2005 13
Distance Matrix The distance from x2to x3 Di,j: Euclidean distance from xi to xj 14
Nearest Neighbor Matrix The distance between x2 and the kth NN to x2 Ti,k: Euclidean distance between xi and the kth NN to xi 15
Notation • m: Intrinsic Dimension • p: Dimension of the Raw Dataset • n: Number of Observations • f(x): density pdf for observation x • Tx,k or Tk: distance from observation x to kth NN • N(t,x): # obs within dist t of observation x 16
N(t,x) = 3 t Notation p = 2 m = 1 N = 12 t2 x t1 t3 17
NN Regression Estimator Density of Distance to kth NN (Single Observation, appx as Poisson) 1 Expected Distance to kth NN (Single Observation) 2a Sample-Averaged Distance to kth NN 2b Expected Distance to Sample-Averaged kth NN 3
Trinomial Distribution Binomial Distribution Regression Estimator Distance to Kth NN pdf • Assumptions • f(x) is constant • n is large • f(x)Vt is small 19
Regression Estimator Approximate as Poisson Expected distance to Kth NN
Gk,m Cn Estimate m using simple linear regression 21
Ex: Swiss Roll Dataset m=0.49 22
Datasets Gaussian Sphere Raw Dim = 3 Int Dim = 3 Swiss Roll Raw Dim = 3 Int Dim = 2 Dbl Swiss Roll Raw Dim = 3 Int Dim = 2 Faces: Raw Dimension = 4096, Int Dim ~ 3 to 5 23
ResultsRegression Estimator ~ 3.0 ~ 2.0 ~ 2.0 ~ 3.5 FACES K = N / 100 24
NN Maximum Likelihood Estimator Counting Process Binomial (appx as Poisson) 1 Joint Counting Probability Joint Occurrence Density 2 Log-likelihood Function 3 4
Maximum Likelihood Estimator N(t,x) = # Counts within Distance t of x # Counts btw Distance r and s is BIN 26
E. Levina & P. Bickel Averaging over N observations Averaging inverses over N observations (Using MLE) D. MacKay & Z. Ghahramani 30
ResultsMLE Estimator (Revised MacKay & Ghahramani) ~ 3.0 ~ 2.0 ~ 2.1 ~ 3.5 FACES K = N / 100 31
Comparison 32
Comparison 33
Comparison 34
Comparison 35
Comparison 36
Comparison 37
Isomap 38
Summary • The regression and revised MLE estimators share similar characteristics when intrinsic dimension is small • As intrinsic dimension increases, the estimators become more dependent on K • Distribution type does not appear to be highly influential when the intrinsic dimension is small 39
Thank You! • Dr. Kang James & Dr. Barry James • Dr. Steve Trogdon
Example Swiss Roll Data Int Dim = 2