1 / 20

Self-Organizing Maps

Self-Organizing Maps. Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering Prototypes lie in a one- or two-dimensional manifold (constrained topological map; Teuvo Kohonen, 1993)

helki
Download Presentation

Self-Organizing Maps

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Self-Organizing Maps • Projection of p dimensional observations to a two (or one) dimensional grid space • Constraint version of K-means clustering • Prototypes lie in a one- or two-dimensional manifold (constrained topological map; Teuvo Kohonen, 1993) • K prototypes: Rectangular grid, hexagonal grid • Integer pair lj  Q1 x Q2, where Q1=1, …, q1 & Q2=1,…,q2 (K = q1 x q2) • High-dimensional observations projected to the two-dimensional coordinate system

  2. SOM Algorithm • Prototype mj, j =1, …, K, are initialized • Each observation xi is processed one at a time to find the closest prototype mj in Euclidean distance in the p-dimensional space • All neighbors of mj, say mk, move toward xi as mk mk + a (xi – mk) • Neighbors are all mk such that the distance between mj and mk are smaller than a threshold r (neighbor includes itself) • Distance defined on Q1 x Q2,not on the p-dimensional space • SOM performance depends on learning rate a and threshold r • Typically, a and r are decreased from 1 to 0 and from R (predefined value) to 1 at each iteration over, say, 3000 iterations

  3. SOM properties • If r is small enough, each neighbor contains only one point  spatial connection between prototypes is lost  converges at a local minima of K-means clustering • Need to check the constraint reasonable: compute and compare reconstruction error e=||x-m||2 for both methods (SOM’s e is bigger, but should be similar)

  4. Tamayo et al. (1999; GeneCluster) • Self-organizing maps (SOM) on microarray data   • -Hematopoietic cell lines (HL60, U937, Jurkat, and NB4): 4x3 SOM • -Yeast data in Eisen et al. reanalyzed by 6x5 SOM

  5. Principal Component Analysis • Data xi, i=1,…,n, are from the p-dimensional space (n  p) • Data matrix: Xnxp • Singular decomposition X = ULVT, where • L is a non-negative diagonal matrix with decreasing diagonal entries of eigen values (or singular value) li, • Unxp with orthogonal columns (uituj = 1 if ij, =0 if i=j), and • Vpxp is an orthogonal matrix • The principal components are the columns of XV (=UL) • X and V have the same rank, at most p of non-zero eigen values

  6. PCA properties • The first column of XV or DU is the 1st principal component, which represents the direction with the largest variance (the first eigen value represents its magnitude) • The second column is for the second largest variance uncorrelated with the first, and so on. • The first q columns, q < p, of XV are the linear projection of X into q diensions with the largest variance • Let x = ULqVT, whereLq is the diagonal matrix of L with q non-zero diagonals  x is best possible approximate of X with rank q

  7. Traditional PCA • Variance-Covariance matrix S from data Xnxp • Eigen value decomposition: S = CDCT, with C an orthogonal matrix • (n-1)S = XTX = (ULVT)T ULVT = VLUT ULVT = VL2VT • Thus, D =L2/(n-1) andC = V

  8. Principal Curves and Surfaces • Let f(l) be a parameterized smooth curve on the p-dimensional space • For data x, let lf(x) define the closest point on the curve to x • Then f(l) is the principal curve for random vector X, if f(l) = E[X| lf(X) = l] • Thus, f(l) is the average of all data points that project to it

  9. Algorithm for finding the principal curve • Let f(l) have its coordinate f(l) = [f1(l), f2(l), …, fp(l)] where random vector X = [X1, X2, …, Xp] • Then iterate the following alternating steps until converge: • (a) fj(l)  E[Xj|l(X) = l], j =1, …, p • (b) lf(x)  argminl ||x – f(l)||2 • The solution is the principal curve for the distribution of X

  10. Multidimensional scaling (MDS) • Observations x1, x2, …, xn in the p-dimensional space with all pair-wise distances (or dissimilarity measure) dij • MDS tries to preserve the structure of the original pair-wise distances as much as possible • Then, seek the vectors z1, z2, …, zn in the k-dimensional space (k << p) by minimizing “stress function” • SD(z1,…,zn) = ij [(dij - ||zi – zj||)2]1/2 • Kruskal-Shephard scaling (or least squares) • A gradient descent algorithm is used to find the solution

  11. Other MDS approaches • Sammon mapping minimizes • ij [(dij - ||zi – zj||)2] / dij • Classical scaling is based on similarity measure sij • Often inner product sij = <xi-m, xj-m> is used • Then, minimize i j [(sij - <zi-m, zj-m>)2]

More Related