500 likes | 577 Views
Data & Vis Basics. Visualization Pipeline. Data Definition. A typical dataset in visualization consists of n records (r 1 , r 2 , r 3 , … , r n ) Each record r i consists of m (m >=1) observations or variables (v 1 , v 2 , v 3 , … , v m )
E N D
Data Definition • A typical dataset in visualization consists of n records • (r1, r2, r3, … , rn) • Each record ri consists of m (m >=1) observations or variables • (v1, v2, v3, … , vm) • A variable may be either independent or dependent • Independent variable (iv) is not controlled or affected by another variable • For example, time in a time-series dataset • Dependent variable (dv) is affected by a variation in one or more associated independent variables • For example, temperature in a region • Formal definition: • ri = (iv1, iv2, iv3, … , ivmi, dv1, dv2, dv3, … , dvmd) • where m = mi + md
Basic Data Types Def: A set of not-ordered and non-numeric values For example: Categorical (finite) data {apple, orange, pear} {red, green, blue} Arbitrary (infinite) data {“12 Main St. Boston MA”, “45 Wall St. New York NY”, …} {“John Smith”, “Jane Doe”, …} • Nominal • Ordinal • Scale / Quantitative • Interval • ratio
Basic Data Types Def: A tuple (an ordered set) For example: Numeric <2, 4, 6, 8> Binary <0, 1> Non-numeric <G, PG, PG-13, R> • Nominal • Ordinal • Scale / Quantitative • Interval • ratio
Basic Data Types Def: A numeric range Interval Ordered numeric elements on a scale that can be mathematically manipulated, but cannot be compared as ratios For example: date, current time (Sept 14, 2010 cannot be described as a ratio of Jan 1, 2011) Ratio where there exists an “absolute zero” For example: height, weight • Nominal • Ordinal • Scale / Quantitative • Interval • ratio
Dimensionality • Scalar • A single value • Vector • A collection of scalars • Matrix • 2-dimensional array • Tensor • A collection of matrices
Dimensionality (Programming) • Scalar • 0-dimensional array • Vector • 1-dimensional array • Matrix • 2-dimensional array • Tensor • 3 or more dimensional array
Dimensionality (Technically) • Scalar • 0th order tensor • Vector • 1st order tensor • Matrix • 2nd order tensor • Tensor • n-d tensor
General Concept • If the data size is truly too large, we can find ways trim the data: • By reducing the number of rows • Subsampling • Clustering • By reducing the number of columns • Dimension reduction • Fit an underlying representation (linear and non-linear)
Challenge • How to maintain the general “characteristics” of the original data • How much can be trimmed? • Analysis based on the trimmed data, does it still apply to the original raw data?
Keim’s visual analytics model interactions Pre-process input interactions Image source: Visual Analytics Definition, Process, and Challenges, Keim et al, LNCS vol 4950, 2008
Dirty Data • Missing values, or data with uncertainty • Discard bad records • Assign a sentinel value (e.g. -1) • Assign the average value • Assign value based on nearest neighbors • Matrix completion problem • e.g. assuming a low rank matrix
8 Visual Variables • Position • Mark • Size • Brightness • Color • Orientation • Texture • Motion
Jacques Bertin “Semiology of Graphics” [1967]
Jacques Bertin 标记形式 点 线 面 通道 位置 尺寸 灰阶值 纹理 色彩 方向 形状
Why Dimension Reduction Computation: The complexity grows exponentially with the dimension. Visualization: projection of high-dimensional data to 2D or 3D. Interpretation: the intrinsic dimension maybe small.
Dimension Reduction • Lots of possibilities, but can be roughly categorized into two groups: • Linear dimension reduction • Non-linear dimension reduction • Related to machine learning…
* * Second principal component * * First principal component * * * * * * * * * * * * * * * * * * * * Data points Principal Components Analysis (PCA) Original axes Principal Components Analysis (PCA): approximating a high-dimensional data setwith a lower-dimensional linear subspace
y bar-y x bar-x • imagine a two dimensional scatter of points that show a high degree of correlation … orthogonal regression…
Why bother? • more “efficient” description • 1st var. captures max. variance • 2nd var. captures the max. amount of residual variance, at right angles (orthogonal) to the first • the 1st var. may capture so much of the information content in the original data set that we can ignore the remaining axis
Principal Components Analysis (PCA) why: • clarify relationships among variables • clarify relationships among cases when: • significant correlations exist among variables how: • define new axes (components) • examine correlation between axes and variables • find scores of cases on new axes
Philosophy of PCA • A PCA is concerned with explaining the variance-covariance sturcture of a set of variables through a few linear combinations. • We typically have a data matrix of n observations on p correlated variables x1,x2,…xp • PCA looks for a transformation of the xiinto p new variables yithat are uncorrelated. • Want to present x1,x2,…xp with a few yi’s without lossing much information.
PCA • Looking for a transformation of the data matrix X (nxp) such that Y= TX=1 X1+ 2 X2+..+ p Xp • Where =(1 , 2 ,.., p)Tis a column vector of wheights with 1²+ 2²+..+ p²=1
Maximize the variance of the projection of the observations on the Y variables • Find so that Var(T X)= T Var(X) is maximal • Var(X) is the covariance matrix of the Xivariables
PCA gives • New variables Yi that are linear combination of the original variables (xi): • Yi= ei1x1+ei2x2+…eipxp ; i=1..p • The new variables Yiare derived in decreasing order of importance; • they are called ‘principal components’
Principle Component Analysis • Pseudo code • Pose data such that each column is a dimension, and each row is a data entry (a nxm matrix, n = rows, m = cols) • Subtract the mean of a dimension from each value • Compute the covariance matrix (M) • Compute the eigenvectors and eigenvalues of (M) • Use singular value decomposition (SVD) • where and are mxm matrices, • is an mxndiagnoal matrix (of positive real numbers) • Sort the eigenvectors in based on their associated eigenvalues in from highest eigenvalue to lowest • Project your original data onto the first (highest) eigenvectors
Multidimensional scaling (MDS) Suppose we are giving the distance structure of the following 10 cities. And we have no knowledge of the city location/map of the US. Can we map these cities to a 2D space to best present their distance structure?
Multidimensional scaling (MDS) MDS deals with the following problem: for a set of observed similarities (or distances) between every pair of N items, find a representation of the items in few dimensions such that the interitem proximities “nearly match” the original similarities (or distance). The numerical measure of how close the original distances and the distances at lower dimensional coordinate is called stress.
MDS Mapping to 3D is possible but more difficult to visualize and interpret.
MDS • MDS attempts to map objects to a visible 2D or 3D Euclidean space. The goal is to best preserve the distance structure after the mapping. • The original data can be of high-dimensional or even non-metric space. The method only cares the distance (dissimilarity) structure. • The resulting mapping is not unique. Any rotation or reflection of a mapping solution is also a solution. • It could be shown that the results of PCA are exactly those of classical MDS if the distances calculated from the data matrix are Euclidean.
Self-Organizing Maps • Pseudo code • Assume input of n rows of m dimensional data • Define some number of nodes (e.g. 40x40 grid) • Give each node m values (vector of size m) • Randomize those values • Loop k number of times: • Select one of the n rows of data as “input vector” • Find within the 40x40 grid nodes the one most similar to the input vector (call this node Best Matching Unit – BMU) • Find the neighbors of the BMU on the grid • Update the BMU and its neighbors based on the following equation: • where is the gaussian function of distance (decays over time) • is the learning function (decays over time) • is the input vector, and is the grid node’s vector
Isomap Image courtesy of Wikipedia: Nonlinear Dimensionality Reduction
Many Others! • To name a few: • Latent Semantic Indexing • Support Vector Machine • Linear Discriminant Analysis (LDA) • Locally Linear Embedding • “manifold learning” • Etc.