200 likes | 339 Views
Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data. Thanh Le, Tom Altman and Katheleen Gardiner University of Colorado Denver April 16, 2012. Overview. Introduction Data clustering with missing values Current approaches
E N D
Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and KatheleenGardiner University of Colorado Denver April 16, 2012
Overview • Introduction • Data clustering with missing values • Current approaches • Proposed method: fzPBI • Data clustering using Fuzzy C-Means • Imputation using probability model • Datasets • Artificial and real datasets for testing fzPBI • Experimental results • Discussion
Clustering with missing values • Data points & missing values x(x1, x2, …, xn-1, xn) Data points with missing values, x(x1, ?, …, ?, xn) XM = { ? }; XP = X \ XM • Problem • Cluster analysis is based on dissimilarity • Distance is computed using every attribute of data objects. • Improper distance measurement provides incorrect clustering results.
Current approaches • Data preprocess to predict missing values • Remove data points with missing values • Imputation of missing values • During the clustering process • Application of clustering model • Missing values are estimated and used • Popular clustering methods, • Expectation-Maximization (EM), Model based clustering • K-Means Crisp membership • Fuzzy C-Means (FCM) Fuzzy membership, soft cluster boundaries Each data point can belong to multiple clusters, more relationship information provided
Current approaches’ issues • Heuristic methods • Imputation using nearest data points • Heuristics, data distribution is not used • EM based methods • Model based imputation of missing values • Model assumptions, slow convergence • Missing values impact parameter estimation • FCM based methods • Distance based imputation of missing values • Fast convergence, maybe the best approach • Data distribution is omitted
Probability-based imputation - fzPBI • Data clustering using FCM • Possibility to probability transformation • Application of the central limit theory into creation of the probability model of data distribution • Application of the probability model into missing value imputation • Repeat steps 1-4 until convergence
Fuzzy C-Means algorithm • Objective function • Model parameters estimation:
Distance measurement • p: Data space dimensions • Each missing value, xij, is used with confidence, wj, which is, • 0 at the beginning • 1 at the end
Probability model • Central limit theory application, • Cluster is the mean of different distribution models that describe the cluster’s members. It can be approximated using the normal distribution model. • Possibility to probability transformation • {uki}i=1..n - possibility distribution of X at vk • {pki}i=1..n - probability distribution of X at vk, • Create the probability model at vk using {pki} • Missing value imputation using probability model
Datasets • Artificial datasets • A dataset generated using finite mixture model • A non-uniform dataset manually created • Clusters differ in size • Cluster distances are different • Real datasets • Iris, Wine datasets at UC Irvine Machine Learning Repository • RCNS (Rat central nervous system), Serum, Yeast and Yeast-MIPS gene expression datasets. • Incomplete datasets were generated using different percentages of missing values
Performance measures • Root mean square error – RMSE • Misclassification error - ME • Compare the cluster label of each data object with its actual class label
Uniform dataset • fzPBI- Probability based method • OCS- optimal complete strategy • NPS- nearest prototype strategy • FCMimp- FCM based impute • CIAO- Alternating Optimization • FCMGOimp- FCM & GO based impute
The advantages of fzPBI • Approximate the data distribution using probability model • Apply the model into missing value imputation • Inherit the advantages of FCM and model based methods, and the application of the central limit theory
Future work • Combine fzPBI with biological knowledge: protein-protein-interaction, Gene ontology • Internal measures using the data • External measures using the biological knowledge • Internal measures at missing values are adjusted using external measures.
Thank you! Questions? • We acknowledge the support from Vietnamese Ministry of Education and Training, the 322 scholarship program.