1 / 25

Outline

Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. Magalie Celton 1 , 2 , Alain Malpertuy 3 , Gaëlle Lelandais 1 , 4 and Alexandre G de Brevern 1 , 4 , ∗

yvonne
Download Presentation

Outline

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative analysis of missing value imputationmethods to improve clustering and interpretationof microarray experiments Magalie Celton1,2, Alain Malpertuy3 ,Gaëlle Lelandais1,4 and Alexandre G de Brevern1,4,∗ 1INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire(EBGM), DSIMB, Université Paris Diderot - Paris 7, 2, place Jussieu, 75005,France MEI-HUEI CHU/NCKU

  2. Outline • Introduction • background • datasets & methods • experiments • Analysis • Discussion • Conclusion MEI-HUEI CHU/NCKU

  3. Introduction(background) • In a previous study, it shows the interest of k-Nearest Neighbour approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. • Since numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, twelve different usable methods and their influence on the quality of gene clustering are evaluated. MEI-HUEI CHU/NCKU

  4. Introduction(datasets & methods) • datasets MEI-HUEI CHU/NCKU

  5. Introduction(datasets & methods) • methods MEI-HUEI CHU/NCKU

  6. Introduction(experiments) • experiments • missing value imputation • extreme value imputation • clustering MEI-HUEI CHU/NCKU

  7. Analysis • principle of the method MEI-HUEI CHU/NCKU

  8. Analysis(missing value imputation) • example 1 MEI-HUEI CHU/NCKU

  9. Analysis(missing value imputation) • example 2 MEI-HUEI CHU/NCKU

  10. Analysis(missing value imputation) • example 3 MEI-HUEI CHU/NCKU

  11. Analysis(missing value imputation) • example 4 MEI-HUEI CHU/NCKU

  12. Analysis(missing value imputation) • Rank the methods in term of efficiency. Roughly, they can be identified to three groups: • EM_array, LSI_array, LSI_combined and LSI_adaptative • BPCA, Row Mean, LSI_gene and LLSI • kNN, SkNN and EM_gene. MEI-HUEI CHU/NCKU

  13. Analysis(extreme value imputation) • 1% of the microarray measurements with the highest absolute values. • example: ζ = 10% corresponds to 10% of the extreme missing values, so 0.1% of the values of the dataset. MEI-HUEI CHU/NCKU

  14. Analysis(extreme value imputation) MEI-HUEI CHU/NCKU

  15. Analysis(extreme value imputation) • all the replacement methods decrease in effectiveness for the estimate of the extreme values. • Performance of the methods also greatly depends on the used dataset and especially in agreement with previous observation. MEI-HUEI CHU/NCKU

  16. Analysis(clustering) • Imputations of missing values have been used both to do hierarchical clustering(with seven different algorithms) and k-means. • Two index for clustering results comparison • CPP • CAR MEI-HUEI CHU/NCKU

  17. Analysis(clustering) • Hierarchical clustering • using the normalized Euclidean distance • seven different algorithms: • average linkage • complete linkage • median linkage • McQuitty • centroid linkage • single linkage • ward minimum variance MEI-HUEI CHU/NCKU

  18. Analysis(clustering) • Conserved Pairs Proportion(CPP) • RC(reference clustering): the result of clustering with the data sets without missing values. • GC(generated clustering): the result of clustering with the data sets having missing values. • CPP: the maximal proportion of genes belonging to two clusters. • Clustering Agreement Ratio(CAR) MEI-HUEI CHU/NCKU

  19. Analysis(clustering) MEI-HUEI CHU/NCKU

  20. Analysis(clustering) • For every hierarchical clustering methods the CPP values are different, but the general tendencies remain the same: • imputation of small rate ζ of MVs has always a strong impact on the CPP values. • the CPP values slowly decreased with the increased of ζ. • common trends can be found between the quality of the imputation method and the gene cluster stability. MEI-HUEI CHU/NCKU

  21. Discussion • EM_array is clearly the most efficient methods we tested. • As expected, the imputation quality is greatly affected by the rate of missing data, but surprisingly it is also related to the kind of data. BPCA is a perfect illustration. • The efficiency of Row_Mean (and Row_Average) is surprisingly good in regards to the simplicity of the methodology used (with the exception of Gheat dataset). MEI-HUEI CHU/NCKU

  22. Discussion • Even if kNN is the most popular imputation method; it is one of the less efficient, compared to other methods tested in this study. It is particularly striking when analyzing the extreme values. MEI-HUEI CHU/NCKU

  23. Discussion • Tuikkala and co-workers have focused interestingly on the GO term class and use k-means. • six different methods were tested but not the methods found the most efficient by our approach. • with less simulation per missing value rates and less missing value rates. • their conclusion about the quality of BPCA. MEI-HUEI CHU/NCKU

  24. Conclusion • More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. The EM_array approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. This study highlights the need for a systematic assessment of imputation methods .A noticeable point is the specific influence of some biological dataset. MEI-HUEI CHU/NCKU

  25. The end Thank you for listening! MEI-HUEI CHU/NCKU

More Related