170 likes | 278 Views
Lecture 07: Data Transform II. September 28, 2010 COMP 150-12 Topics in Visual Analytics. Lecture Outline. Data Retrieval Methods for increasing retrieval speed: Pre-computation Pre-fetching and Caching Levels of Detail (LOD) Hardware support Data transform (pre-processing)
E N D
Lecture 07:Data Transform II September 28, 2010 COMP 150-12Topics in Visual Analytics
Lecture Outline • Data Retrieval • Methods for increasing retrieval speed: • Pre-computation • Pre-fetching and Caching • Levels of Detail (LOD) • Hardware support • Data transform (pre-processing) • Aggregate (clustering) • Sampling (sub-sampling, re-sampling) • Simplification (dimension reduction) • Appropriate representation (finding underlying mathematical representation)
Dimension Reduction • Lots of possibilities, but can be roughly categorized into two groups: • Linear dimension reduction • Non-linear dimension reduction • Related to machine learning…
Dimension Reduction • Can think of clustering as a dimension reduction mechanism: • Assume the dataset has n dimensions • Using clustering, results in k dimensions (k < n) • Instead of representing the data as n dimensional vector • Present the data using the k-dimensions
Some Common Techniques • Principle Component Analysis • demo • Multi-Dimensional Scaling • draw • Kohonen Maps / Self-Organizing Maps • demo • Isomap • draw
Principle Component Analysis height GPA 0.5*GPA + 0.2*age + 0.3*height = ? • Quick Refresher of PCA • Find most dominant eigenvectors as principle components • Data points are re-projected into the new coordinate system • For reducing dimensionality • For finding clusters • Problem: PCA is easy to understand mathematically, but difficult to understand “semantically”. age
Principle Component Analysis • Pseudo code • Pose data such that each column is a dimension, and each row is a data entry (a nxm matrix, n = rows, m = cols) • Subtract the mean of a dimension from each value • Compute the covariance matrix (M) • Compute the eigenvectors and eigenvalues of (M) • Use singular value decomposition (SVD) • where and are mxm matrices, • is an mxndiagnoal matrix (of positive real numbers) • Sort the eigenvectors in based on their associated eigenvalues in from highest eigenvalue to lowest • Project your original data onto the first (highest) eigenvectors
Multi-Dimensional Scaling • Minimize distances between low-d and high-d representations. • Where is the position of point i in low dimensional space, and is the distance between two points i and j in n dimensions
Multi-Dimensional Scaling Image courtesy of Jing Yang
Self-Organizing Maps • Pseudo code • Assume input of n rows of m dimensional data • Define some number of nodes (e.g. 40x40 grid) • Give each node m values (vector of size m) • Randomize those values • Loop k number of times: • Select one of the n rows of data as “input vector” • Find within the 40x40 grid nodes the one most similar to the input vector (call this node Best Matching Unit – BMU) • Find the neighbors of the BMU on the grid • Update the BMU and its neighbors based on the following equation: • where is the gaussian function of distance (decays over time) • is the learning function (decays over time) • is the input vector, and is the grid node’s vector
Isomap Image courtesy of Wikipedia: Nonlinear Dimensionality Reduction
Many Others! • To name a few: • Latent Semantic Indexing • Support Vector Machine • Linear Discriminant Analysis (LDA) • Locally Linear Embedding • “manifold learning” • Etc. • Consider the characteristics of the data, and choose the appropriate. • e.g. are the data labeled? Apply supervised vs. unsupervised methods.