Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and KatheleenGardiner University of Colorado Denver April 16, 2012

Overview • Introduction • Data clustering with missing values • Current approaches • Proposed method: fzPBI • Data clustering using Fuzzy C-Means • Imputation using probability model • Datasets • Artificial and real datasets for testing fzPBI • Experimental results • Discussion

Clustering with missing values • Data points & missing values x(x1, x2, …, xn-1, xn) Data points with missing values, x(x1, ?, …, ?, xn) XM = { ? }; XP = X \ XM • Problem • Cluster analysis is based on dissimilarity • Distance is computed using every attribute of data objects. • Improper distance measurement provides incorrect clustering results.

Current approaches • Data preprocess to predict missing values • Remove data points with missing values • Imputation of missing values • During the clustering process • Application of clustering model • Missing values are estimated and used • Popular clustering methods, • Expectation-Maximization (EM), Model based clustering • K-Means Crisp membership • Fuzzy C-Means (FCM) Fuzzy membership, soft cluster boundaries Each data point can belong to multiple clusters, more relationship information provided

Current approaches’ issues • Heuristic methods • Imputation using nearest data points • Heuristics, data distribution is not used • EM based methods • Model based imputation of missing values • Model assumptions, slow convergence • Missing values impact parameter estimation • FCM based methods • Distance based imputation of missing values • Fast convergence, maybe the best approach • Data distribution is omitted

Probability-based imputation - fzPBI • Data clustering using FCM • Possibility to probability transformation • Application of the central limit theory into creation of the probability model of data distribution • Application of the probability model into missing value imputation • Repeat steps 1-4 until convergence

Fuzzy C-Means algorithm • Objective function • Model parameters estimation:

Distance measurement • p: Data space dimensions • Each missing value, xij, is used with confidence, wj, which is, • 0 at the beginning • 1 at the end

Probability model • Central limit theory application, • Cluster is the mean of different distribution models that describe the cluster’s members. It can be approximated using the normal distribution model. • Possibility to probability transformation • {uki}i=1..n - possibility distribution of X at vk • {pki}i=1..n - probability distribution of X at vk, • Create the probability model at vk using {pki} • Missing value imputation using probability model

Datasets • Artificial datasets • A dataset generated using finite mixture model • A non-uniform dataset manually created • Clusters differ in size • Cluster distances are different • Real datasets • Iris, Wine datasets at UC Irvine Machine Learning Repository • RCNS (Rat central nervous system), Serum, Yeast and Yeast-MIPS gene expression datasets. • Incomplete datasets were generated using different percentages of missing values

Performance measures • Root mean square error – RMSE • Misclassification error - ME • Compare the cluster label of each data object with its actual class label

Uniform dataset • fzPBI- Probability based method • OCS- optimal complete strategy • NPS- nearest prototype strategy • FCMimp- FCM based impute • CIAO- Alternating Optimization • FCMGOimp- FCM & GO based impute

Non-uniform dataset

Iris dataset

RCNS gene expression dataset

Yeast gene expression dataset

Serum gene expression dataset

The advantages of fzPBI • Approximate the data distribution using probability model • Apply the model into missing value imputation • Inherit the advantages of FCM and model based methods, and the application of the central limit theory

Future work • Combine fzPBI with biological knowledge: protein-protein-interaction, Gene ontology • Internal measures using the data • External measures using the biological knowledge • Internal measures at missing values are adjusted using external measures.

Thank you! Questions? • We acknowledge the support from Vietnamese Ministry of Education and Training, the 322 scholarship program.

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data

Presentation Transcript

Clustering analysis of microarray gene expression data

Gene Expression meets Gene Ontology: A novel statistical method for Microarray analysis

Microarray technology and analysis of gene expression data

Microarray Gene Expression Data Analysis

Analysis of Gene Expression Data

Knowledge-based analysis of microarray gene expression data by using support vector machines

A Gene Selection Method for Microarray Data based on Sampling

Microarray Data Analysis Differential Gene Expression

Gene expression: Microarray data analysis

ArrayExpress - a Public Repository for Microarray Based Gene Expression Data

Classification of Microarray Gene Expression Data

Gene Expression Data and Cluster Analysis

A Gene Expression Barcode for Microarray Data

Analysis of honey bee microarray gene expression data

Knowledge-based Analysis of Microarray Gene Expression Data using Support Vector Machines

Shrinkage-based similarity metric for cluster analysis of microarray data

BIOL6900 Chapter 9 Gene expression: Microarray data analysis

Cluster Analysis for Gene Expression Data

Classification of Microarray Gene Expression Data

Cluster analysis for microarray data

Clustering analysis of microarray gene expression data

Eigensolvers for analysis of microarray gene expression data