Bioinformatics Cluster Analysis

BioinformaticsCluster Analysis Mentee: Joonoh Lim Mentor: SankethShetty

Background • Cluster analysis is an unsupervised method of determining groupings (clusters) in data sets. • In biology, cluster analysis is used to study genes and gene expressions. • There are three categories of gene expression data clustering: gene-based, sample-based, subspace clustering. • Data set is usually obtained by DNA microarray.

DNA Microarray

Establishing Data Set Gene-based Sample-based 15 x 15 x 8 → 225 x 8 15 x 15 x 8 → 8 x 225

Types of Clustering Algorithms • Partitional Methods • K-means Clustering • Affinity Propagation • Spectral Clustering • Mean-shift Clustering • Normalized-cuts • Gaussian Mixture Models • Hierarchical Methods • Single linkage • Complete linkage • Average Linkage

Proximity measure • Defines the similarity between data objects • Examples: Euclidean distance, Pearson’s correlation coefficient, Jackknife correlation, Spearman’s rank-order correlation, City block distance (Manhattan distance), Angular separation, etc.. • We use Euclidean distance. The Euclidean distance between points and is defined as:

Hierarchical Clustering • Single linkage: group two objects in minimum distance http://www.resample.com/xlminer/help/HClst/HClst_intro.htm

Hierarchical ClusteringEx)Colon Cancer data Using complete linkage Dendrogram

K-means Clustering www.cs.cmu.edu/~awm

K-means clusteringEx) Colon Cancer data • K = 5

Determining cluster numbers • One of widely used methods is “elbow” method. • Elbow method is to plot the percent variance explained versus the number of clusters and to find the point where increasing the number of clusters does not add much information anymore. • Percentage of variance explained is the ratio of the between-group variance to the total variance.

Elbow Method (Criterion) wikipedia

Challenges and Future Research Directions • No single “best” algorithm. • The performance of different clustering algorithms strongly depends on both data distribution and application requirement. • Clustering is generally “unsupervised” learning problem. • However, often some “partial” knowledge is available, such as the functions of some genes. • If a clustering could integrate such partial knowledge as some ‘clustering constraints’, we can expect more biologically meaningful and reliable results.

Questions? Thank you!

Bioinformatics Cluster Analysis

Bioinformatics Cluster Analysis

Presentation Transcript

Cluster Analysis

Cluster Computer For Bioinformatics Applications

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

CLUSTER ANALYSIS

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Computing Applications for Bioinformatics

Cluster Analysis

Cluster Analysis

Cluster Analysis

Cluster Analysis