& J. Douglas Carroll Rutgers Business School – Newark and New Brunswick

NONLINEAR MAPPING: APPROACHES BASED ON OPTIMIZING AN INDEX OF CONTINUITY AND APPLYING CLASSICAL METRIC MDS TO REVISED DISTANCESBy Ulas Akkucuk & J. Douglas Carroll Rutgers Business School – Newark and New Brunswick

Outline • Introduction • Nonlinear Mapping Algorithms • Parametric Mapping Approach • ISOMAP Approach • Other Approaches • Experimental Design and Methods • Error Levels • Evaluation of Mapping Performance • Problem of Similarity Transformations • Results • Discussion and Future Direction

Introduction • Problem: To determine a smaller set of variables necessary to account for a larger number of observed variables • PCA and MDS are useful when relationship is linear • Alternative approaches needed when the relationship is highly nonlinear

Shepard and Carroll (1966) • Locally monotone analysis of proximities: Nonmetric MDS treating large distances as missing • Worked well if the nonlinearities were not too severe (in particular if the surface is not closed such as a circle or sphere) • Optimization of an index of “continuity” or “smoothness” • Incorporated into a computer program called “PARAMAP” and tested on various sets of data

20 points on a circle

62 regularly spaced points on a sphere, and the azimuthal equidistant projection of the world

49 points regularly spaced on a torus embedded in four dimensions

In all cases the local structure is preserved except points at which the shape is “cut open” or “punctured” • Results were successful, but severe local minimum problem existed • Addition of error to the regular spacing made the local minimum problem worse • Current work is stimulated by two articles on nonlinear mapping (Tenenbaum, de Silva, & Langford, 2000; Roweis & Saul, 2000)

Nonlinear Mapping Algorithms • n : number of objects • M : dimensionality of the input coordinates, in other words of the configuration for which we would like to find an underlying lower dimensional embedding. • R : dimensionality of the space of recovered configuration, where R<M • Y : n  M input matrix • X : n  R output matrix

The distances between point i and point j in the input and output spaces respectively are calculated as:  [ ij ] D [ dij ]

Parametric Mapping Approach • Works via optimizing an index of “continuity” or “smoothness” • Early application in the context of time-series data (von Neuman, Kent, Bellison, & Hart, 1941; von Neuman, 1941)

A more general expression for the numerator is: • Generalizing to the multidimensional case we reach 

Several modifications needed for the minimization procedure: • d2ij + Ce2 is substituted for d2ij , C is a constant equal to 2 / (n - 1) and e takes on values between 0 and 1 • e has a practical effect on accelerating the numerical process • Can be thought of as an extra “specific” dimension, as e gets closer to 0 points are made to approach “common” part of space

In the numerator the constant z, and in the denominator [2/n(n1)]2 • Final form of function:

Implemented in C++ (GNU-GCC compiler) • Program takes as input e, number of repetitions, dimensionality R to be recovered, and number of random starts or starting input configuration • 200 iterations each for 100 different random configurations yields reasonable solutions • Then this resulting best solution can be further fine tuned by performing more iterations

ISOMAP Approach • Tries to overcome difficulties in MDS by replacing the Euclidean metric by a new metric • Figure (Lee, Landasse, & Verleysen, 2002)

To approximate the “geodesic” distances ISOMAP constructs a neighborhood graph that connects the closer points • This is done by connecting the k closest neighbors or points that are close to each other by  or less distance • A shortest path procedure is then applied to the resulting matrix of modified distances • Finally classical metric MDS is applied to obtain the configuration in the lower dimensionality

Other Approaches • Nonmetric MDS: Minimizes a cost function • Needed to implement locally monotone MDS approach of Shepard (Shepard & Carroll, 1966)

Sammon’s mapping: Minimizes a mapping error function • Kruskal (1971) indicated certain options used with nonmetric MDS programs would give the same results

Multidimensional scaling by iterative majorization (Webb, 1995) • Curvilinear Distance Analysis (CDA) (Lee et al., 2002), analogue of ISOMAP, omits the MDS step replacing it by a minimization step • Self organizing map (SOM) (Kohonen 1990, 1995) • Auto associative feedforward neural networks (AFN) (Baldi & Hornik, 1989; Kramer, 1991)

Experimental Design and Methods • Primary focus: 62 located at the intersection of 5 equally spaced parallels and 12 equally spaced meridians • Two types of error A and B • A: 0%, 10%, 20% • B: ±0.00, ±0.01, ±0.05, ±0.10, ±0.20 • Control points being irregularly spaced and being inside or outside the sphere respectively

To evaluate mapping performance:We calculate “rate of agreement in local structure”abbreviated “agreement rate” or A • Similar to RAND index used to compare partitions (Rand, 1971; Hubert & Arabie, 1985) • Let ai stand for the number of points that are in the k-nearest neighbor list for point i in both X and Y. A will be equal to

Example of calculating agreement rate k=2,Agreement rate = 2/10 or 20 %

Problem of similarity transformations: We use standard software to rotate the different solutions into optimal congruence with a landmark solution (Rohlf & Slice 1989) • We use the solution for the error free and regularly spaced sphere as the landmark • We report also VAF

The VAF results may not be very good • Similarity transformation step is not enough • An alternating algorithm is needed that reorders the points on each of the five parallels and then finds the optimal similarity transformation • We also provide Shepard-like diagrams

Why similarity transformation is not enough?

Results • Agreement rate for the regularly spaced and errorless sphere 82.9%, k=5 • Over 1000 randomizations of the solution: Average, and standard deviation of the agreement rate 8.1% and 1.9% respectively. Minimum and maximum are 3.5% and 16.7%

We can use Chebychev’s inequality stated as: • 82.9 is about 40 standard deviations away from the mean, an upper bound of the probability that this event happens by chance is 1/402 or 0.000625, very low!

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

(m) (n) (o)

A=48.1 % ISOMAP A=82.9% PARAMAP

Shepard-like Diagrams

SWISS Roll Data – 130 points • Agreement rate=ISOMAP 59.7%, PARAMAP 70.5%

Discussion and Future Direction • Disadvantage of PARAMAP: Run time • Advantage of ISOMAP: Noniterative procedure, can be applied to very large data sets with ease • Disadvantage of ISOMAP: Bad performance in closed data sets like the sphere

Improvements in computational efficiency of PARAMAP should be explored: • Use of a conjugate gradient algorithm instead of straight gradient algorithm • Use of conjugate gradient with restarts algorithm • Possible combination of straight gradient and conjugate gradient approaches • Improvements that could both benefit ISOMAP and PARAMAP: • A wise selection of landmarks and an interpolation or extrapolation scheme to recover the rest of the data

& J. Douglas Carroll Rutgers Business School – Newark and New Brunswick