170 likes | 420 Views
A Robust Approach for Dealing with Missing Values in Compositional Data. Karel Hron, Matthias Templ, Peter Filzmoser ICORS ’08, Antalya, 8. 9. 200 8. Compositional data (CoDa). ... D- part composition
E N D
A Robust Approach for Dealingwith Missing Values in Compositional Data Karel Hron, Matthias Templ, Peter Filzmoser ICORS’08, Antalya, 8. 9. 2008
Compositional data (CoDa) • ... D-partcomposition • and contain essentially the same information • simplex – sample space of D-part compositions • D-1 dimensionality of compositions
Standard statistics and CoDa • difficulties when applying standard statistical methods (like correlation analysis and PCA) • the results can be completely useless • reason: sample space of CoDa, induces different geometrical structure (Aitchison geometry) • solution: family of logratio transformations from the simplex to real space (Aitchison, 1986) • in case of missing values in CoDa allow for a reasonable imputation
Isometric logratio transformations • shortly ilr (Egozcue et al., 2003), result in D-1 dimensional real space • regularity of transformed data is provided, necessary for robust statistical methods • isometry
Ilr and balances • interpretation of ilr coordinates (balances) in the sense of original compositional parts is not possible • reason: definition of CoDa • solution: split the parts into separated groups and order balances • this construction is provided using a special procedure, called sequential binary partition
Ilr and balances • result of a special choice of sequential binary partition (SBP)
Outliers and CoDa 1) caused by Aitchison geometry: • provide measure of differences between the compositions in a natural way, respecting their relative scale property • distinguish between the following two differences within compositional parts, 0.500 and 0.501 vs. 0.001 and 0.002 • consequence: the error term in the parts is not the same for values close to the baricentre or to the border of the simplex
Outliers and CoDa • solution: using ilr transformation and outlier detection (Filzmoser and Hron, 2008)
Outliers and CoDa 2) caused by definition of CoDa: • each observed composition is a member of the corresponding equivalence class • every two compositions from the same class have zero Aitchison distance • low and high values of c can simultaneously cause high Euclidean distance
Missing values in CoDa sets • most statistical methods cannot be directly applied on data sets with missing information • removing incomplete observations can cause an unacceptable loss of information • most of imputation methods use assumptions like missing at random (MAR) and normality of the data • outliers could have a dramatical influence on the estimation of missing values
Missing values in CoDa sets • with robust imputation methods the estimation of missings is based on the majority of the data • existing robust methods may not deal with compositional data (another geometry of the data and wrong identification of outliers) => a more effective way of dealing with CoDa for imputation, with respect to the Aitchison geometry, is needed
Robust imputation of missing values for CoDa • we propose an iterative procedure to estimate the missing values • initialization of the missings: fast kNN (Aitchison) • compositional part with highest amount of missings is chosen and the data are transformed using proper ilr transformation – missing values from the chosen part (x1) appear in one ilr variable and does not contaminate the others
Robust imputation of missing values for CoDa • consequently, fast LTS regression (able to deal also with large data sets) of z1 on z2 ,…,zD-1 is prefered, but also other robust methods can be considered • missing values are imputed for any variable (starting from the highest amount of missings) • procedure is repeated in an iterative manner till convergence
References • Aitchison, J., 1986, The statistical analysis of compositional data. Chapman and Hall, London. • Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueraz, G., Barceló-Vidal, C., 2003, Isometric logratio transformationsfor compositional data analysis. Math. Geol., vo. 35, no. 3, p. 279-300. • Filzmoser, P., Hron, K., 2008, Outlier detection for compositional data using robust methods. Math. Geosci., vo. 40, no. 3, p. 233-248.