Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Sci

Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering National Cheng Kung University Taiwan, R.O.C. August 13, 2001

Outline • Microarray Techniques • Goal of Microarray Data Mining • Clustering Methods • Efficient Microarray Data Mining • Conclusions

Current Status • Human genome project is at finishing stage, revealing that there are about 30,000 functional genes in a human cell • For more than 90% of the genes, we know little about their real functions

Microarray Techniques • Main Advantage of Microarray Techniques • allow simultaneous studies of the expression of thousands of genes in a single experiment • Microarray Process • Arrayer • Experiments: Hybridization • Image Capturing of Results • Analysis

Goal of Microarray Mining Multi-Conditions Expression Analysis test … … …. B C A gene 0.4 0.9 0 0.5 .. .. 0.8 0.2 0.8 0.3 0.2 .. .. 0.7 0.6 0.2 0 0.7 .. .. 0.3 … … … … … … … 1 2 3 4 .. .. 1000

Sample Clustering Results

Clustering Methods • Types of Clustering Methods • Partitioning：K-Means, K-Medoids, PAM, CLARA … • Hierarchical：HAC、BIRCH、CURE、ROCK • Density-based： CAST, DBSCAN、OPTICS、CLIQUE… • Grid-based：STING、CLIQUE、WaveCluster… • Model-based：COBWEB、SOM、CLASSIT、AutoClass…

Clustering Methods (cont.) Partitioning Hierarchical

Clustering Methods (cont.) Density-based Grid-based

CAST Clustering • Input • S：a symmetic n × nSimilarity Matrix，S(i, j) ∈ [0, 1] • t：Affinity Threshold (0 < t < 1) • Method 1. Choose a seed for generating a new cluster 2. ADD: add qualified items to the cluster 3. REMOVE: remove unqualified items from the stable cluster 4. Repeat Steps 1-3 till no more clusters can be generated

Similarity Measurements：Correlation Coefficients • The most popular correlation coefficient is Pearson correlationcoefficient (1892) • correlation between X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn}： where

Similarity Measurements：Correlation Coefficients (cont.) • It captures the similarity of the ‘‘shapes’’ of two expression profiles, and ignores differences between their magnitudes.

Problems in Microarray Mining • How to cluster microarray data with the following requirements met simultaneously ? • Efficiency • Accuracy • Automation

Problems in Microarray Mining (cont.) • How to cluster microarray data with the following requirements met simultaneously ? • Efficiency • Accuracy • Automation Good Clustering Methods + Validation Techniques

Efficient Microarray Mining • Improved CAST algorithm for clustering • Hubert’s Γ statistic for validation • Iterative sampled computation for automatic clustering

Reduce the Computation 1. Narrow down the threshold range 2. Split and Conquer: find “nearly-best” result <Example> m = 4 LM: Left Margin RM: Right Margin LM RM threshold 0 100%

Experimental Results • Dataset • Source：Lawrence Berkeley National Lab (LBNL) Michael Eisen's Lab (http://rana.lbl.gov/EisenData.htm） • Microarray expression data of yeast saccharomyces cerevisiae, containing 6221 genes with 80 conditions • Similarity matrix was obtained in advance

Experimental Results (cont.) • Without Range Narrow down • Executions：19 • Execution Time：246 sec • Γ statistic：0.5138 • With Range Narrow down • Executions：13 • Execution Time：27 sec • Γ statistic：0.5137

Experimental Results (cont.)

Conclusions • Microarray data analysis is an emerging field needing support of data mining techniques • Accuracy • Efficiency • Automation

Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Sci