1 / 20

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data. Thanh Le, Tom Altman and Katheleen Gardiner University of Colorado Denver April 16, 2012. Overview. Introduction Data clustering with missing values Current approaches

osias
Download Presentation

Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Probability-based imputation method for fuzzy cluster analysis of gene expression microarray data Thanh Le, Tom Altman and KatheleenGardiner University of Colorado Denver April 16, 2012

  2. Overview • Introduction • Data clustering with missing values • Current approaches • Proposed method: fzPBI • Data clustering using Fuzzy C-Means • Imputation using probability model • Datasets • Artificial and real datasets for testing fzPBI • Experimental results • Discussion

  3. Clustering with missing values • Data points & missing values x(x1, x2, …, xn-1, xn) Data points with missing values, x(x1, ?, …, ?, xn) XM = { ? }; XP = X \ XM • Problem • Cluster analysis is based on dissimilarity • Distance is computed using every attribute of data objects. • Improper distance measurement provides incorrect clustering results.

  4. Current approaches • Data preprocess to predict missing values • Remove data points with missing values • Imputation of missing values • During the clustering process • Application of clustering model • Missing values are estimated and used • Popular clustering methods, • Expectation-Maximization (EM), Model based clustering • K-Means Crisp membership • Fuzzy C-Means (FCM) Fuzzy membership, soft cluster boundaries Each data point can belong to multiple clusters, more relationship information provided

  5. Current approaches’ issues • Heuristic methods • Imputation using nearest data points • Heuristics, data distribution is not used • EM based methods • Model based imputation of missing values • Model assumptions, slow convergence • Missing values impact parameter estimation • FCM based methods • Distance based imputation of missing values • Fast convergence, maybe the best approach • Data distribution is omitted

  6. Probability-based imputation - fzPBI • Data clustering using FCM • Possibility to probability transformation • Application of the central limit theory into creation of the probability model of data distribution • Application of the probability model into missing value imputation • Repeat steps 1-4 until convergence

  7. Fuzzy C-Means algorithm • Objective function • Model parameters estimation:

  8. Distance measurement • p: Data space dimensions • Each missing value, xij, is used with confidence, wj, which is, • 0 at the beginning • 1 at the end

  9. Probability model • Central limit theory application, • Cluster is the mean of different distribution models that describe the cluster’s members. It can be approximated using the normal distribution model. • Possibility to probability transformation • {uki}i=1..n - possibility distribution of X at vk • {pki}i=1..n - probability distribution of X at vk, • Create the probability model at vk using {pki} • Missing value imputation using probability model

  10. Datasets • Artificial datasets • A dataset generated using finite mixture model • A non-uniform dataset manually created • Clusters differ in size • Cluster distances are different • Real datasets • Iris, Wine datasets at UC Irvine Machine Learning Repository • RCNS (Rat central nervous system), Serum, Yeast and Yeast-MIPS gene expression datasets. • Incomplete datasets were generated using different percentages of missing values

  11. Performance measures • Root mean square error – RMSE • Misclassification error - ME • Compare the cluster label of each data object with its actual class label

  12. Uniform dataset • fzPBI- Probability based method • OCS- optimal complete strategy • NPS- nearest prototype strategy • FCMimp- FCM based impute • CIAO- Alternating Optimization • FCMGOimp- FCM & GO based impute

  13. Non-uniform dataset

  14. Iris dataset

  15. RCNS gene expression dataset

  16. Yeast gene expression dataset

  17. Serum gene expression dataset

  18. The advantages of fzPBI • Approximate the data distribution using probability model • Apply the model into missing value imputation • Inherit the advantages of FCM and model based methods, and the application of the central limit theory

  19. Future work • Combine fzPBI with biological knowledge: protein-protein-interaction, Gene ontology • Internal measures using the data • External measures using the biological knowledge • Internal measures at missing values are adjusted using external measures.

  20. Thank you! Questions? • We acknowledge the support from Vietnamese Ministry of Education and Training, the 322 scholarship program.

More Related