400 likes | 522 Views
Large Two-way Arrays. Douglas M. Hawkins School of Statistics University of Minnesota doug@stat.umn.edu. What are ‘large’ arrays?. # of rows in at least hundreds and/or # of columns in at least hundreds. Challenges/Opportunities. Logistics of handling data more tedious
E N D
Large Two-way Arrays Douglas M. Hawkins School of Statistics University of Minnesota doug@stat.umn.edu
What are ‘large’ arrays? • # of rows in at least hundreds and/or • # of columns in at least hundreds
Challenges/Opportunities • Logistics of handling data more tedious • Standard graphic methods work less well • More opportunity for assumptions to fail but • Parameter estimates more precise • Fewer model assumptions maybe possible
Settings • Microarray data • Proteomics data • Spectral data (fluorescence, absorption…)
Common problems seen • Outliers/Heavy-tailed distributions • Missing data • Large # of variables hurts some methods
The ovarian cancer data • Data set as I have it: • 15154 variables (M/Z values), % relative intensity recorded • 91 controls (clinical normals) • 162 ovarian cancer patients
The normals • Give us an array of 15154 rows, 91 columns. • Qualifies as ‘large’ • Spectrum very ‘busy’
not to mention outlier-prone • Subtracting off a median for each MZ and making a normal probability plot of the residuals
Comparing cases, controls • First pass at a rule to distinguish normal controls from cancer cases: • Calculate two-sample t between groups for each distinct M/Z
Good news / bad news • Several places in spectrum with large separation (t=24 corresponds to around 3 sigma of separation) • Visually seem to be isolated spikes • This is due to large # of narrow peaks
Big differences in mean and variability • suggest conventional statistical tools of • Linear discriminant analysis • Logistic regression • Quadratic or regularized discriminant analysis using a selected set of features. Off-the-shelf software doesn’t like 15K variables, but methods very do-able.
Return to beginning • Are there useful tools for extracting information from these arrays? • Robust singular value decomposition (RSVD) one that merits consideration (see our two NISS tech reports)
Singular value approximation • Some philosophy from Bradu (1984) • Write X for nxp data array. • First remove structure you don’t want to see • k-term SVD approximation is
The rit are ‘row markers’ You could use them as plot positions for the proteins • The cjt are ‘column markers’. You could use them as plot positions for the cases. They match their corresponding row markers. • The eij are error terms. They should mainly be small
Fitting the SVD • Conventionally done by principal component analysis. • We avoid this for two reasons: • PCA is highly sensitive to outliers • It requires complete data (an issue in many large data sets, if not this one) • Standard approach would use 15K square covariance matrix.
Alternating robust fit algorithm • Take trial values for the column markers. Fit the corresponding row markers using robust regression on available data. • Use resulting row markers to refine column markers. • Iterate to convergence. • For robust regression we use least trimmed squares (LTS) regression.
Result for the controls • First run, I just removed a grand median. • Plots of the first few row markers show fine structure like that of mean spectrum and of the discriminators
Uses for the RSVD • Instead of feature selection, we can use cases’ c scores as variables in discriminant rules. Can be advantageous in reducing measurement variability and avoids feature selection bias. • Can use as the basis for methods like cluster analysis.
Cluster analysis use • Consider methods based on Euclidean distance between cases (k-means / Kohonen follow similar lines)
The first term is sum of squared difference in column markers, weighted by squared Euclidean norm of row markers. • Second term noise. Adds no information, detracts from performance • Third term, cross-product, approximates zero because of independence.
This leads to… • r,c scale arbitrary. Make column lengths 1 absorbing eigenvalue into c • Replace column Euclidean distance with squared distance between column markers. This removes random variability. • Similarly, for k-means/Kohonen, replace column profile with its SVD approximation.
Special case • If a one term SVD suffices, we get an ordination of the rows and columns. • Row ordination doesn’t make much sense for spectral data • Column ordination orders subjects ‘rationally’.
The cancer group • Carried out RSVD of just the cancer • But this time removed row median first • Corrects for overall abundance at each MZ • Robust singular values are 2800, 1850, 1200,… • suggesting more than one dimension.
No striking breaks in sequence. • We can cluster, but get more of a partition of a continuum. • Suggests that severity varies smoothly
Back to the two-group setting • An interesting question (suggested by Mahalanobis-Taguchi strategy) – are cancer group alike? • Can address this by RSVD of cancer cases and clustering on column markers • Or use the controls to get multivariate metric and place the cancers in this metric.
Do a new control RSVD • Subtract row medians. • Get canonical variates for all versus just controls • (Or, as we have plenty of cancer cases, conventionally, of cancer versus controls) • Plot the two groups
Supports earlier comment re lack of big ‘white space’ in the cancer group – a continuum, not distinct subpopulations • Controls look a lot more homogeneous than cancer cases.
Summary • Large arrays – challenge and opportunity. • Hard to visualize or use graphs. • Many data sets show outliers / missing data / very heavy tails. • Robust-fit singular value decomposition can handle these; provides large data condensation.