Lecture 06: Data Transform I

Lecture 06:Data Transform I September 23, 2010 COMP 150-12Topics in Visual Analytics

Lecture Outline • Data Retrieval • Methods for increasing retrieval speed: • Pre-computation • Pre-fetching and Caching • Levels of Detail (LOD) • Hardware support • Data transform (pre-processing) • Aggregate (clustering) • Sampling (sub-sampling, re-sampling) • Simplification (dimension reduction) • Appropriate representation (finding underlying mathematical representation)

Problem Statement • All tricks from lecture 5 have been implemented. • However, the sheer size of the data is so large that the tricks themselves alone cannot support <0.1 fps interactivity. • Example: all search queries at Google, all transactions at Bank of America.

General Concept • If the data size is truly too large, we can find ways trim the data: • By reducing the number of rows • Subsampling • Clustering • By reducing the number of columns • Dimension reduction • Fit an underlying representation (linear and non-linear)

Challenge • How to maintain the general “characteristics” of the original data • How much can be trimmed? • Analysis based on the trimmed data, does it still apply to the original raw data?

Disclaimer • Many of these methods are related to or based on machine learning. • Often referred to as “automated analysis” • As opposed to “interactive analysis”

Keim’s visual analytics model interactions Pre-process input interactions Image source: Visual Analytics Definition, Process, and Challenges, Keim et al, LNCS vol 4950, 2008

Dirty Data • Missing values, or data with uncertainty • Discard bad records • Assign a sentinel value (e.g. -1) • Assign the average value • Assign value based on nearest neighbors • Matrix completion problem • e.g. assuming a low rank matrix

From Lecture 3: Data Definition • A typical dataset in visualization consists of n records • (r1, r2, r3, … , rn) • Each record ri consists of m (m >=1) observations or variables • (v1, v2, v3, … , vm) • A variable may be either independent or dependent • Independent variable (iv) is not controlled or affected by another variable • For example, time in a time-series dataset • Dependent variable (dv) is affected by a variation in one or more associated independent variables • For example, temperature in a region • Formal definition: • ri = (iv1, iv2, iv3, … , ivmi, dv1, dv2, dv3, … , dvmd) • where m = mi + md

Rank vs. Dimensionality • How many dimensions is in your data? • What is its true rank? • Example… Pig Chewing

Example • Adobe Photoshop Content-Aware Fill • http://www.youtube.com/watch?v=NH0aEp1oDOI • Netflix challenge

Questions?

Aggregation / Clustering • Very much related to LOD and its supporting structures. • The idea is to “group” similar data items together

Clustering Algorithms • There are numerous clustering algorithms out there… • Here we look at two popular ones • K-means • Agglomerative hierarchical • Clustering always needs a distance function

K-Means (2) (1) (3) (4) • Inputs: • K: number of clusters • distance function: d(xi, xj)

K-Means • http://www.youtube.com/watch?v=74rv4snLl70 • Notes about k-means: • Convergence could be slow (but it’s guaranteed to converge!) • Need to specify k • Adaptive k-means • Lots of variations

Questions?

Agglomerative Hierarchical Clustering Input: distance function: d(xi, xj)

Dendrogram

Variations • Agglomerative: a bottom-up approach • Divisive: a top-down approach • Linkage of two clusters A and B: • Complete Link: • Single Link: • Average Link:

Examples and Intuitions • Shape of single-link vs. shape of complete-link • What happens with single link? • What happens with complete link?

Questions?

Sampling • Challenge • Can we find a smaller population n’ in the original population n such that n’ exhibits the same (or similar) characteristic as n? • Re-sampling • Sub-sampling

Re-Sampling • Given the original data, create a new (smaller) dataset that replaces the original • Image Processing: • Linear Interpolation • Bilinear Interpolation • Nonlinear (cubic) Interpolation

Linear Interpolation 10 2 15 8 x=0.0 x=0.3333 x=0.6666 x=1.0 10 ?? ?? ?? 4 x=0.0 x=0.25 x=0.5 x=0.75 x=1.0 Example:

Bilinear Interpolation 10 2 15 8 ?? ?? ?? 4 25 9 18 Similar to linear interpolation. The new sampled values are a weighted average of the surrounding 4 vertices

Catmull-Rom Interpolation

Sub-Sampling • Use random sampling • Simple random sampling • Systematic sampling • Etc. • Key point is that each element must have an equal non-zero chance of being selected • e.g. Selecting individuals from households • Remember that there could still be potential sampling error

Sub-Sampling • If we assume that the population follows a normal distribution • Further assume that the variability of the population is known (as measured by standard deviation σ) • Then the standard error of the sample mean is given by: • (where n = sampling size)

Questions?

Lecture 06: Data Transform I