210 likes | 360 Views
Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering National Cheng Kung University Taiwan, R.O.C. August 13, 2001. Outline. Microarray Techniques Goal of Microarray Data Mining
E N D
Recent Research and Development on Microarray Data Mining Shin-Mu Tseng 曾新穆 tsengsm@mail.ncku.edu.tw Dept. Computer Science and Information Engineering National Cheng Kung University Taiwan, R.O.C. August 13, 2001
Outline • Microarray Techniques • Goal of Microarray Data Mining • Clustering Methods • Efficient Microarray Data Mining • Conclusions
Current Status • Human genome project is at finishing stage, revealing that there are about 30,000 functional genes in a human cell • For more than 90% of the genes, we know little about their real functions
Microarray Techniques • Main Advantage of Microarray Techniques • allow simultaneous studies of the expression of thousands of genes in a single experiment • Microarray Process • Arrayer • Experiments: Hybridization • Image Capturing of Results • Analysis
Goal of Microarray Mining Multi-Conditions Expression Analysis test … … …. B C A gene 0.4 0.9 0 0.5 .. .. 0.8 0.2 0.8 0.3 0.2 .. .. 0.7 0.6 0.2 0 0.7 .. .. 0.3 … … … … … … … 1 2 3 4 .. .. 1000
Goal of Microarray Mining Multi-Conditions Expression Analysis test … … …. B C A gene 0.4 0.9 0 0.5 .. .. 0.8 0.2 0.8 0.3 0.2 .. .. 0.7 0.6 0.2 0 0.7 .. .. 0.3 … … … … … … … 1 2 3 4 .. .. 1000
Clustering Methods • Types of Clustering Methods • Partitioning:K-Means, K-Medoids, PAM, CLARA … • Hierarchical:HAC、BIRCH、CURE、ROCK • Density-based: CAST, DBSCAN、OPTICS、CLIQUE… • Grid-based:STING、CLIQUE、WaveCluster… • Model-based:COBWEB、SOM、CLASSIT、AutoClass…
Clustering Methods (cont.) Partitioning Hierarchical
Clustering Methods (cont.) Density-based Grid-based
CAST Clustering • Input • S:a symmetic n × nSimilarity Matrix,S(i, j) ∈ [0, 1] • t:Affinity Threshold (0 < t < 1) • Method 1. Choose a seed for generating a new cluster 2. ADD: add qualified items to the cluster 3. REMOVE: remove unqualified items from the stable cluster 4. Repeat Steps 1-3 till no more clusters can be generated
Similarity Measurements:Correlation Coefficients • The most popular correlation coefficient is Pearson correlationcoefficient (1892) • correlation between X={X1, X2, …, Xn} and Y={Y1, Y2, …, Yn}: where
Similarity Measurements:Correlation Coefficients (cont.) • It captures the similarity of the ‘‘shapes’’ of two expression profiles, and ignores differences between their magnitudes.
Problems in Microarray Mining • How to cluster microarray data with the following requirements met simultaneously ? • Efficiency • Accuracy • Automation
Problems in Microarray Mining (cont.) • How to cluster microarray data with the following requirements met simultaneously ? • Efficiency • Accuracy • Automation Good Clustering Methods + Validation Techniques
Efficient Microarray Mining • Improved CAST algorithm for clustering • Hubert’s Γ statistic for validation • Iterative sampled computation for automatic clustering
Reduce the Computation 1. Narrow down the threshold range 2. Split and Conquer: find “nearly-best” result <Example> m = 4 LM: Left Margin RM: Right Margin LM RM threshold 0 100%
Experimental Results • Dataset • Source:Lawrence Berkeley National Lab (LBNL) Michael Eisen's Lab (http://rana.lbl.gov/EisenData.htm) • Microarray expression data of yeast saccharomyces cerevisiae, containing 6221 genes with 80 conditions • Similarity matrix was obtained in advance
Experimental Results (cont.) • Without Range Narrow down • Executions:19 • Execution Time:246 sec • Γ statistic:0.5138 • With Range Narrow down • Executions:13 • Execution Time:27 sec • Γ statistic:0.5137
Conclusions • Microarray data analysis is an emerging field needing support of data mining techniques • Accuracy • Efficiency • Automation