230 likes | 612 Views
Bioinformatics Cluster Analysis. Mentee: Joonoh Lim Mentor: Sanketh Shetty. Background. Cluster analysis is an unsupervised method of determining groupings (clusters) in data sets. In biology, cluster analysis is used to study genes and gene expressions.
E N D
BioinformaticsCluster Analysis Mentee: Joonoh Lim Mentor: SankethShetty
Background • Cluster analysis is an unsupervised method of determining groupings (clusters) in data sets. • In biology, cluster analysis is used to study genes and gene expressions. • There are three categories of gene expression data clustering: gene-based, sample-based, subspace clustering. • Data set is usually obtained by DNA microarray.
Establishing Data Set Gene-based Sample-based 15 x 15 x 8 → 225 x 8 15 x 15 x 8 → 8 x 225
Types of Clustering Algorithms • Partitional Methods • K-means Clustering • Affinity Propagation • Spectral Clustering • Mean-shift Clustering • Normalized-cuts • Gaussian Mixture Models • Hierarchical Methods • Single linkage • Complete linkage • Average Linkage
Proximity measure • Defines the similarity between data objects • Examples: Euclidean distance, Pearson’s correlation coefficient, Jackknife correlation, Spearman’s rank-order correlation, City block distance (Manhattan distance), Angular separation, etc.. • We use Euclidean distance. The Euclidean distance between points and is defined as:
Hierarchical Clustering • Single linkage: group two objects in minimum distance http://www.resample.com/xlminer/help/HClst/HClst_intro.htm
Hierarchical ClusteringEx)Colon Cancer data Using complete linkage Dendrogram
K-means Clustering www.cs.cmu.edu/~awm
Determining cluster numbers • One of widely used methods is “elbow” method. • Elbow method is to plot the percent variance explained versus the number of clusters and to find the point where increasing the number of clusters does not add much information anymore. • Percentage of variance explained is the ratio of the between-group variance to the total variance.
Elbow Method (Criterion) wikipedia
Challenges and Future Research Directions • No single “best” algorithm. • The performance of different clustering algorithms strongly depends on both data distribution and application requirement. • Clustering is generally “unsupervised” learning problem. • However, often some “partial” knowledge is available, such as the functions of some genes. • If a clustering could integrate such partial knowledge as some ‘clustering constraints’, we can expect more biologically meaningful and reliable results.
Questions? Thank you!