Fuzzy Machine Learning Methods for Biomedical Data Analysis

Fuzzy Machine Learning Methods for Biomedical Data Analysis Yanqing Zhang Department of Computer Science Georgia State University Atlanta, GA 30302-5060 yzhang@gsu.edu Yan-Qing Zhang, Georgia State University

Outline • Background • Fuzzy Association Rule Mining for Decision Support (FARM-DS) • FARM-DS on Medical Data • FARM-DS on Microarray Expression Data • Fuzzy-Granular Gene Selection on Microarray Expression Data • Conclusion and Future Work Yan-Qing Zhang, Georgia State University

Background • Theory • Computational Intelligence, Granular Computing, Fuzzy Sets • Knowledge Discovery and Data mining (KDD) • Decision Support system (DS) • Rule-Based Reasoning (RBR), Association Rule Mining • Application • Bioinformatics, Medical Informatics, etc. • Concern • Accuracy • Interpretability Yan-Qing Zhang, Georgia State University

Motivation – deal with numeric data • Traditional Association rule mining algorithm • If X, then Y • Conf = Pr(Y|X) Supp = Pr(X and Y) • don’t work on numeric data • Fuzzy Logic • Feature transform • Fuzzy AR mining (Zadeh, 1965) Yan-Qing Zhang, Georgia State University

Motivation – decision support • FARs for classification • Accuracy vs. Interpretability • Very Few works • Hu et al. 2002 • Combinatorial rule explosion • Chatterjee et al. 2004 • Human intervention Yan-Qing Zhang, Georgia State University

FARM-DS • Target • Numeric data • Binary classification • Effectiveness • Accuracy • Interpretability • Modeling process • Training • Testing Yan-Qing Zhang, Georgia State University

Step 1: Fuzzy Interval Partition • 1-in-1-out 0-order TSK model • ANFIS for model optimization and parameter selection (Jang, 1993) Yan-Qing Zhang, Georgia State University

Step 2: Data Abstraction positive cluster • Clustering • K-Means • Fuzzy C-means • Validation • #clusters • Optimal cluster • Silhouette Value negative cluster Yan-Qing Zhang, Georgia State University

Step 3: Generating Fuzzy Discrete Transactions • Project the center of each cluster on each feature • Create transactions • With positive cluster, +1 is inserted • With negative cluster, -1 is inserted Yan-Qing Zhang, Georgia State University

Step 3 - example f2 • 5-2 = 3 transactions • 1 f1_1 • 1 f1_1 • 1 f1_1 f1 • Avoid combinatorial rule explosion • Number of different transactions are decided by number of clusters Yan-Qing Zhang, Georgia State University

Step 4: Association Rule Mining • Association Rule Mining on fuzzy discrete transactions • Traditional Apriori algorithm (Agrawal and Srikant 1994) If f1 is low, f2 is high, …, fh is low, then y=1/-1 • Rule pruning: • For a pair of rules A and B, if B is more specific than A (that means A is included by B), and B has the same support value as A, A is eliminated. A: If f1 is low, then y=1, sup=50% B: If f1 is low and f2 is high, then y=1, sup=50% Yan-Qing Zhang, Georgia State University

Testing Phase Yan-Qing Zhang, Georgia State University

Adaptive FARM-DS • Train • Fuzzy intervals partition • Data abstraction • Generate fuzzy discrete transactions • AR mining • Test He, et al. 2006a, IJDMB Yan-Qing Zhang, Georgia State University

Empirical Studies • Classification algorithms • C4.5 decision trees (Quinlan, 1993) • Support vector machines (Vapnik, 1995) • FARM-DS (He, et al. 2006a, IJDMB) • Accuracy Estimation • 5-folds cross validation • Interpretability Yan-Qing Zhang, Georgia State University

Evaluation metrics • Accuracy • Classification Error • Area under ROC curve (future work) • Interpretability • Rule numbers • Average rule lengths Bradley, 1997 Yan-Qing Zhang, Georgia State University

Datasets Merz, et al. UCI repository of machine learning databases, 1998 Yan-Qing Zhang, Georgia State University

Result analysis on Accuracy • FARM-DS ≈ SVM > C4.5 • SVM2 and C4.5 results from (Bennett et al. 1997) Yan-Qing Zhang, Georgia State University

Result analysis on Interpretability • SVM, high accuracy, hard to interpret • C4.5, low accuracy , easy to interpret • FARM-DS, high accuracy, easy to interpret Yan-Qing Zhang, Georgia State University

Interpretability (1) • FARs extracted by FARM-DS are short and compact, and hence, easy to understand. • 22 positive rules and 8 negative rules are extracted. • In average, • the length of a positive rule is 2.6, • the length of a negative rule is 4.3, • and every sample activates • 3.3 positive rules and • 5.6 negative rules. Yan-Qing Zhang, Georgia State University

Interpretability (2) • FARs may help human experts to correct the wrongly classified samples. Yan-Qing Zhang, Georgia State University

Interpretability (3) • The larger support of the negative rules may help human experts to make final correct decisions and find inherent disease-resulting mechanisms. Yan-Qing Zhang, Georgia State University

Interpretability (4) • FARs are helpful to select important features. • Higher activation frequency means more important feature Yan-Qing Zhang, Georgia State University

Microarray Expression Data • Extremely high dimensionality • Gene selection • Cancer classification • Rule-based reasoning Yan-Qing Zhang, Georgia State University

Empirical Studies • Rule-Based Reasoning/Classification • CART for decision trees modeling (Breiman, et al. 1984) • ANFIS for fuzzy neural networks modeling (Jang, 1993) • FARM-DS (He, et al. 2006a, IJDMB) Yan-Qing Zhang, Georgia State University

Evaluation metrics • Accuracy • Classification Error • Area under ROC curve • Accuracy Estimation • Leave-one-out cross validation • Interpretability • Rule numbers • Average rule lengths Bradley, 1997 Yan-Qing Zhang, Georgia State University

AML/ALL leukemia dataset Tang, et al. 2006 Yan-Qing Zhang, Georgia State University

Result analysis:AML/ALL leukemia dataset • Higher accuracy than CART • Easier to interpret than ANFIS Yan-Qing Zhang, Georgia State University

Rules extracted by FARM-DS:AML/ALL leukemia dataset • IF • gene2 (Y12670), • gene3 (D14659) and • gene5 (M80254) are down-regulated, • THEN the tissue is ALL(-1) Yan-Qing Zhang, Georgia State University

Prostate cancer dataset Tang, et al. 2006 Yan-Qing Zhang, Georgia State University

Result analysis:prostate cancer dataset • Higher accuracy than CART • Easier to interpret than ANFIS Yan-Qing Zhang, Georgia State University

Rules extracted by FARM-DS: prostate cancer dataset Yan-Qing Zhang, Georgia State University

Gene Selection and Cancer Classification on Microarray Expression Data • Extremely high dimensionality • AML/ALL leukemia dataset 72 * 7129 • no more than 10% relevant genes (Golub, et al. 1999) • Gene selection • accurate classification • helpful for cancer study Yan-Qing Zhang, Georgia State University

Gene Categorization and Gene Ranking • Informative genes • Redundant genes • Irrelevant genes • Noisy genes Yan-Qing Zhang, Georgia State University

Information Loss • Noise • Overfitting themselves • Complementary to redundant/irrelevant genes • Conflict with informative genes • Imbalanced gene selection • Inflexibility How to decrease information loss? Granulation! Yan-Qing Zhang, Georgia State University

Coarse Granulation with Relevance Indexes • Target: remove irrelevant genes imbalance imbalance balance • Target: tune thresholds to select genes in balance Yan-Qing Zhang, Georgia State University

Fine Granulation with Fuzzy C-Means Clustering • clustering in the training samples space • genes with similar expression patterns have similar functions • a gene may have multiple functions (Fuzzy works here!) Yan-Qing Zhang, Georgia State University

Conquer with correlation-based Ranking • Lower-ranked genes are removed as redundant genes Yan-Qing Zhang, Georgia State University

Aggregation with Data Fusion • Pick up genes from different clusters in balance • An informative gene is more possible to survive • (due to fuzzy clustering) Yan-Qing Zhang, Georgia State University

Original Gene Set Relevance Indexes -based pre-filtering Relevant Gene Set Correlation-based Gene Ranking 1 Gene Cluster 1 Fuzzy C-Means Clustering Correlation-based Gene Ranking 2 Gene Cluster 2 Correlation-based Gene Ranking K Gene Cluster K Final Gene Set Yan-Qing Zhang, Georgia State University

Empirical Study • Comparison • Signal to Noise (S2N) (Furey, et al. 2000) • Fuzzy-Granular + S2N • Fisher Criterion (FC) (Pavlidis, et al. 2001) • Fuzzy-Granular + FC • T-Statistics (TS) (Duan, et al. 2004) • Fuzzy-Granular + TS Yan-Qing Zhang, Georgia State University

Evaluation Methods Metrics • Accuracy • Sensitivity • Specificity • Area under ROC curve Estimation • Leave-1-out CV • .632 bootstrapping .632 Perf = 0.368 * training perf + 0.632 * testing perf Yan-Qing Zhang, Georgia State University

prostate cancer dataset Yan-Qing Zhang, Georgia State University

Result analysis:prostate cancer dataset Yan-Qing Zhang, Georgia State University

Colon cancer dataset Yan-Qing Zhang, Georgia State University

Result analysis:colon cancer dataset Yan-Qing Zhang, Georgia State University

Conclusion • High-level data abstraction • data clustering techniques • Quantitative data transformed to fuzzy discrete transactions • Fuzzy interval partition • Apriori algorithm for AR mining • Strong decision support for biomedical study • High accuracy and easy to interpret • More accurate cancer classification • Eliminate irrelevant/redundant genes to decrease noise • Select informative genes in balance Yan-Qing Zhang, Georgia State University

Fuzzy Machine Learning Methods for Biomedical Data Analysis