Large Two-way Arrays

Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota doug@stat.umn.edu

What are ‘large’ arrays? • # of rows in at least hundreds and/or • # of columns in at least hundreds

Challenges/Opportunities • Logistics of handling data more tedious • Standard graphic methods work less well • More opportunity for assumptions to fail but • Parameter estimates more precise • Fewer model assumptions maybe possible

Settings • Microarray data • Proteomics data • Spectral data (fluorescence, absorption…)

Common problems seen • Outliers/Heavy-tailed distributions • Missing data • Large # of variables hurts some methods

The ovarian cancer data • Data set as I have it: • 15154 variables (M/Z values), % relative intensity recorded • 91 controls (clinical normals) • 162 ovarian cancer patients

The normals • Give us an array of 15154 rows, 91 columns. • Qualifies as ‘large’ • Spectrum very ‘busy’

not to mention outlier-prone • Subtracting off a median for each MZ and making a normal probability plot of the residuals

Comparing cases, controls • First pass at a rule to distinguish normal controls from cancer cases: • Calculate two-sample t between groups for each distinct M/Z

Good news / bad news • Several places in spectrum with large separation (t=24 corresponds to around 3 sigma of separation) • Visually seem to be isolated spikes • This is due to large # of narrow peaks

Variability also differs

Big differences in mean and variability • suggest conventional statistical tools of • Linear discriminant analysis • Logistic regression • Quadratic or regularized discriminant analysis using a selected set of features. Off-the-shelf software doesn’t like 15K variables, but methods very do-able.

Return to beginning • Are there useful tools for extracting information from these arrays? • Robust singular value decomposition (RSVD) one that merits consideration (see our two NISS tech reports)

Singular value approximation • Some philosophy from Bradu (1984) • Write X for nxp data array. • First remove structure you don’t want to see • k-term SVD approximation is

The rit are ‘row markers’ You could use them as plot positions for the proteins • The cjt are ‘column markers’. You could use them as plot positions for the cases. They match their corresponding row markers. • The eij are error terms. They should mainly be small

Fitting the SVD • Conventionally done by principal component analysis. • We avoid this for two reasons: • PCA is highly sensitive to outliers • It requires complete data (an issue in many large data sets, if not this one) • Standard approach would use 15K square covariance matrix.

Alternating robust fit algorithm • Take trial values for the column markers. Fit the corresponding row markers using robust regression on available data. • Use resulting row markers to refine column markers. • Iterate to convergence. • For robust regression we use least trimmed squares (LTS) regression.

Result for the controls • First run, I just removed a grand median. • Plots of the first few row markers show fine structure like that of mean spectrum and of the discriminators

But the subsequent terms capture the finer structure

Uses for the RSVD • Instead of feature selection, we can use cases’ c scores as variables in discriminant rules. Can be advantageous in reducing measurement variability and avoids feature selection bias. • Can use as the basis for methods like cluster analysis.

Cluster analysis use • Consider methods based on Euclidean distance between cases (k-means / Kohonen follow similar lines)

The first term is sum of squared difference in column markers, weighted by squared Euclidean norm of row markers. • Second term noise. Adds no information, detracts from performance • Third term, cross-product, approximates zero because of independence.

This leads to… • r,c scale arbitrary. Make column lengths 1 absorbing eigenvalue into c • Replace column Euclidean distance with squared distance between column markers. This removes random variability. • Similarly, for k-means/Kohonen, replace column profile with its SVD approximation.

Special case • If a one term SVD suffices, we get an ordination of the rows and columns. • Row ordination doesn’t make much sense for spectral data • Column ordination orders subjects ‘rationally’.

The cancer group • Carried out RSVD of just the cancer • But this time removed row median first • Corrects for overall abundance at each MZ • Robust singular values are 2800, 1850, 1200,… • suggesting more than one dimension.

No striking breaks in sequence. • We can cluster, but get more of a partition of a continuum. • Suggests that severity varies smoothly

Back to the two-group setting • An interesting question (suggested by Mahalanobis-Taguchi strategy) – are cancer group alike? • Can address this by RSVD of cancer cases and clustering on column markers • Or use the controls to get multivariate metric and place the cancers in this metric.

Do a new control RSVD • Subtract row medians. • Get canonical variates for all versus just controls • (Or, as we have plenty of cancer cases, conventionally, of cancer versus controls) • Plot the two groups

Supports earlier comment re lack of big ‘white space’ in the cancer group – a continuum, not distinct subpopulations • Controls look a lot more homogeneous than cancer cases.

Summary • Large arrays – challenge and opportunity. • Hard to visualize or use graphs. • Many data sets show outliers / missing data / very heavy tails. • Robust-fit singular value decomposition can handle these; provides large data condensation.

Some references

Large Two-way Arrays

Large Two-way Arrays

Presentation Transcript

Opportunistic Large Arrays

Two-Dimensional Arrays

Two Dimensional Arrays

Two-Dimensional Arrays

Two Dimensional Arrays and ArrayList

18. Two-Dimensional Arrays

Exercise: Two-dimensional Arrays

Two Dimensional Arrays Rohit Khokher

Two-Dimensional Arrays

L19. Two-Dimensional Arrays

Two types of arrays

Two Dimensional Arrays

Example Two-Dimensional Arrays

TWO WAY

TWO WAY

Two Dimensional Arrays

Two Dimensional Arrays

Two Dimensional Arrays

Two-Dimensional Arrays

Two Dimensional Arrays

Two Dimensional Arrays

Two Dimensional Arrays