1 / 17

A Robust Approach for Dealing with Missing Values in Compositional Data

A Robust Approach for Dealing with Missing Values in Compositional Data. Karel Hron, Matthias Templ, Peter Filzmoser ICORS ’08, Antalya, 8. 9. 200 8. Compositional data (CoDa). ... D- part composition

nitesh
Download Presentation

A Robust Approach for Dealing with Missing Values in Compositional Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Robust Approach for Dealingwith Missing Values in Compositional Data Karel Hron, Matthias Templ, Peter Filzmoser ICORS’08, Antalya, 8. 9. 2008

  2. Compositional data (CoDa) • ... D-partcomposition • and contain essentially the same information • simplex – sample space of D-part compositions • D-1 dimensionality of compositions

  3. Standard statistics and CoDa • difficulties when applying standard statistical methods (like correlation analysis and PCA) • the results can be completely useless • reason: sample space of CoDa, induces different geometrical structure (Aitchison geometry) • solution: family of logratio transformations from the simplex to real space (Aitchison, 1986) • in case of missing values in CoDa allow for a reasonable imputation

  4. Isometric logratio transformations • shortly ilr (Egozcue et al., 2003), result in D-1 dimensional real space • regularity of transformed data is provided, necessary for robust statistical methods • isometry

  5. Ilr and balances • interpretation of ilr coordinates (balances) in the sense of original compositional parts is not possible • reason: definition of CoDa • solution: split the parts into separated groups and order balances • this construction is provided using a special procedure, called sequential binary partition

  6. Ilr and balances • result of a special choice of sequential binary partition (SBP)

  7. Outliers and CoDa 1) caused by Aitchison geometry: • provide measure of differences between the compositions in a natural way, respecting their relative scale property • distinguish between the following two differences within compositional parts, 0.500 and 0.501 vs. 0.001 and 0.002 • consequence: the error term in the parts is not the same for values close to the baricentre or to the border of the simplex

  8. Outliers and CoDa • solution: using ilr transformation and outlier detection (Filzmoser and Hron, 2008)

  9. Outliers and CoDa 2) caused by definition of CoDa: • each observed composition is a member of the corresponding equivalence class • every two compositions from the same class have zero Aitchison distance • low and high values of c can simultaneously cause high Euclidean distance

  10. Outliers and CoDa

  11. Missing values in CoDa sets • most statistical methods cannot be directly applied on data sets with missing information • removing incomplete observations can cause an unacceptable loss of information • most of imputation methods use assumptions like missing at random (MAR) and normality of the data • outliers could have a dramatical influence on the estimation of missing values

  12. Missing values in CoDa sets • with robust imputation methods the estimation of missings is based on the majority of the data • existing robust methods may not deal with compositional data (another geometry of the data and wrong identification of outliers) => a more effective way of dealing with CoDa for imputation, with respect to the Aitchison geometry, is needed

  13. Robust imputation of missing values for CoDa • we propose an iterative procedure to estimate the missing values • initialization of the missings: fast kNN (Aitchison) • compositional part with highest amount of missings is chosen and the data are transformed using proper ilr transformation – missing values from the chosen part (x1) appear in one ilr variable and does not contaminate the others

  14. Robust imputation of missing values for CoDa • consequently, fast LTS regression (able to deal also with large data sets) of z1 on z2 ,…,zD-1 is prefered, but also other robust methods can be considered • missing values are imputed for any variable (starting from the highest amount of missings) • procedure is repeated in an iterative manner till convergence

  15. Simulation study

  16. Simulation study

  17. References • Aitchison, J., 1986, The statistical analysis of compositional data. Chapman and Hall, London. • Egozcue, J.J., Pawlowsky-Glahn, V., Mateu-Figueraz, G., Barceló-Vidal, C., 2003, Isometric logratio transformationsfor compositional data analysis. Math. Geol., vo. 35, no. 3, p. 279-300. • Filzmoser, P., Hron, K., 2008, Outlier detection for compositional data using robust methods. Math. Geosci., vo. 40, no. 3, p. 233-248.

More Related