250 likes | 381 Views
Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. Magalie Celton 1 , 2 , Alain Malpertuy 3 , Gaëlle Lelandais 1 , 4 and Alexandre G de Brevern 1 , 4 , ∗
E N D
Comparative analysis of missing value imputationmethods to improve clustering and interpretationof microarray experiments Magalie Celton1,2, Alain Malpertuy3 ,Gaëlle Lelandais1,4 and Alexandre G de Brevern1,4,∗ 1INSERM UMR-S 726, Equipe de Bioinformatique Génomique et Moléculaire(EBGM), DSIMB, Université Paris Diderot - Paris 7, 2, place Jussieu, 75005,France MEI-HUEI CHU/NCKU
Outline • Introduction • background • datasets & methods • experiments • Analysis • Discussion • Conclusion MEI-HUEI CHU/NCKU
Introduction(background) • In a previous study, it shows the interest of k-Nearest Neighbour approach for restoring the missing gene expression values, and its positive impact of the gene clustering by hierarchical algorithm. • Since numerous replacement methods have been proposed to impute missing values (MVs) for microarray data. In this study, twelve different usable methods and their influence on the quality of gene clustering are evaluated. MEI-HUEI CHU/NCKU
Introduction(datasets & methods) • datasets MEI-HUEI CHU/NCKU
Introduction(datasets & methods) • methods MEI-HUEI CHU/NCKU
Introduction(experiments) • experiments • missing value imputation • extreme value imputation • clustering MEI-HUEI CHU/NCKU
Analysis • principle of the method MEI-HUEI CHU/NCKU
Analysis(missing value imputation) • example 1 MEI-HUEI CHU/NCKU
Analysis(missing value imputation) • example 2 MEI-HUEI CHU/NCKU
Analysis(missing value imputation) • example 3 MEI-HUEI CHU/NCKU
Analysis(missing value imputation) • example 4 MEI-HUEI CHU/NCKU
Analysis(missing value imputation) • Rank the methods in term of efficiency. Roughly, they can be identified to three groups: • EM_array, LSI_array, LSI_combined and LSI_adaptative • BPCA, Row Mean, LSI_gene and LLSI • kNN, SkNN and EM_gene. MEI-HUEI CHU/NCKU
Analysis(extreme value imputation) • 1% of the microarray measurements with the highest absolute values. • example: ζ = 10% corresponds to 10% of the extreme missing values, so 0.1% of the values of the dataset. MEI-HUEI CHU/NCKU
Analysis(extreme value imputation) MEI-HUEI CHU/NCKU
Analysis(extreme value imputation) • all the replacement methods decrease in effectiveness for the estimate of the extreme values. • Performance of the methods also greatly depends on the used dataset and especially in agreement with previous observation. MEI-HUEI CHU/NCKU
Analysis(clustering) • Imputations of missing values have been used both to do hierarchical clustering(with seven different algorithms) and k-means. • Two index for clustering results comparison • CPP • CAR MEI-HUEI CHU/NCKU
Analysis(clustering) • Hierarchical clustering • using the normalized Euclidean distance • seven different algorithms: • average linkage • complete linkage • median linkage • McQuitty • centroid linkage • single linkage • ward minimum variance MEI-HUEI CHU/NCKU
Analysis(clustering) • Conserved Pairs Proportion(CPP) • RC(reference clustering): the result of clustering with the data sets without missing values. • GC(generated clustering): the result of clustering with the data sets having missing values. • CPP: the maximal proportion of genes belonging to two clusters. • Clustering Agreement Ratio(CAR) MEI-HUEI CHU/NCKU
Analysis(clustering) MEI-HUEI CHU/NCKU
Analysis(clustering) • For every hierarchical clustering methods the CPP values are different, but the general tendencies remain the same: • imputation of small rate ζ of MVs has always a strong impact on the CPP values. • the CPP values slowly decreased with the increased of ζ. • common trends can be found between the quality of the imputation method and the gene cluster stability. MEI-HUEI CHU/NCKU
Discussion • EM_array is clearly the most efficient methods we tested. • As expected, the imputation quality is greatly affected by the rate of missing data, but surprisingly it is also related to the kind of data. BPCA is a perfect illustration. • The efficiency of Row_Mean (and Row_Average) is surprisingly good in regards to the simplicity of the methodology used (with the exception of Gheat dataset). MEI-HUEI CHU/NCKU
Discussion • Even if kNN is the most popular imputation method; it is one of the less efficient, compared to other methods tested in this study. It is particularly striking when analyzing the extreme values. MEI-HUEI CHU/NCKU
Discussion • Tuikkala and co-workers have focused interestingly on the GO term class and use k-means. • six different methods were tested but not the methods found the most efficient by our approach. • with less simulation per missing value rates and less missing value rates. • their conclusion about the quality of BPCA. MEI-HUEI CHU/NCKU
Conclusion • More than 6.000.000 independent simulations have assessed the quality of 12 imputation methods on five very different biological datasets. The EM_array approach constitutes one efficient method for restoring the missing expression gene values, with a lower estimation error level. Nonetheless, the presence of MVs even at a low rate is a major factor of gene cluster instability. This study highlights the need for a systematic assessment of imputation methods .A noticeable point is the specific influence of some biological dataset. MEI-HUEI CHU/NCKU
The end Thank you for listening! MEI-HUEI CHU/NCKU