460 likes | 549 Views
Fuzzy K means. Fuzzy K means. A gene can be assigned to several clusters Each gene is assigned to a cluster with a membership value between 0 and 1 The membership values of a gene add up to one Genes with lower membership values are not well represented by the cluster centroid
E N D
Fuzzy K means • A gene can be assigned to several clusters • Each gene is assigned to a cluster with a membership value between 0 and 1 • The membership values of a gene add up to one • Genes with lower membership values are not well represented by the cluster centroid • Expression of genes with high membership values are close to cluster centroid
Centroid • During the centroid refinement in each clustering cycle, new centroids were calculated on the basis of the weighted mean of all the gene –expression patterns in the data set according to
Membership Function • Each gene’s membership m (a continuous variable from 0 to 1) is defined as:
Fuzzy K means • The gene weight is (only on the seconed and the third round) empirically defined as: Where is the Pearson Correlation between Xi and Xn and is the correlation cutoff
Fuzzy K means • In each clustering cycle , the centroids were iteratively refined until the average change was <0.001. • Around 85 % of the centroids , stabilized within approximately 15 iterations , some of centroids required more : about 40 -60 iterations before stabilizing.
Fuzzy K means • After each clustering cycle , each centroid was compared to all other centroids in the set , and centroid pairs correlated >0.9 were replaced by their average .
Cells respond to environment Various external messages Heat Responds to environmental conditions Food Supply
Genome is fixed – Cells are dynamic • A genome is static • Every cell in our body has a copy of same genome • A cell is dynamic • Responds to external conditions • Saccharomyces cerevisiae cells follow a cell cycle of division and also budding. • Cells differentiate during development
Gene regulation • Gene regulation is responsible for dynamic cell • Gene expression varies according to: • Cell type • External conditions
Transcription Factors Binding to DNA • Transcription regulation: • Certain transcription factors bind DNA • Binding recognizes DNA substrings: • Regulatory motifs
Regulation of Genes Transcription Factor (Protein) RNA polymerase (Protein) DNA Gene Regulatory Element
Regulation of Genes Transcription Factor (Protein) RNA polymerase DNA Regulatory Element Gene
Regulation of Genes New protein RNA polymerase Transcription Factor DNA Regulatory Element Gene
The Challenges of Gene Expression Data • Many genes have expression data patterns that are similar to multiple, distinct gene groups.
Results of Clustering Gene Expression • CLUSTER is simple and easy to use • De facto standard for microarray analysis • Limitations: • Hierarchical and other method clustering in general is not robust • Genes may belong to more than one cluster
Gene can be co expressed with different gene groups in response to different conditions.
Saccharomyces cerevisiae • The yeast Saccharomyces cerevisiae possesses sophisticated mechanisms to choreograph the expression of its 6200 genes in order to thrive or at list to survive in a wide range of environmental conditions.
The gene expression of 40 Yap1p targets, these genes were coordinately induced in responds to subset of conditions shown here ( labeled in red)
What is a microarray (2) • A 2D array of DNA sequences from thousands of genes • Each spot has many copies of same gene • Allow mRNAs from a sample to hybridize • Measure number of hybridizations per spot
Goal of Microarray Experiments • Measure level of gene expression across many different conditions: • Expression Matrix M: {genes}{conditions}: Mij = |genei| in conditionj • Deduce gene function • Genes with similar function are expressed under similar conditions
Fuzzy K-Means clustering • Each gene can belong to many clusters • Soft (fuzzy) assignment of genes to clusters • Each gene has 1.0 membership units, allocated amongst clusters based on correlation with means • Cluster means are calculated by taking the weighted average of all the genes in the cluster
Fuzzy K-Means clustering Algorithm: • Use PCA to initialize cluster means • 3 iterations of fuzzy k-means clustering, find k/3 clusters per iteration • In each iteration, start with brand new clusters and initializations • And a few more heuristic tricks
Initialization • Use PCA to find a few eigenvectors for initialization • These features capture the directions of maximum variance • Must be orthonormal
Example Initialization • k/3 centroids defined from k/3 first eigenvectors
Example • First iteration of clustering
Iteration of the approach • Remove genes that have a Pearson Correlation with a particular cluster greater than 0.7 • Intuition: These strong signal from these genes has been accounted for • Repeat
Removing Duplicate Centroids • Centroids with Pearson correlation > 0.9 will be averaged. • Allows selecting a large initial number of clusters, since duplicates will be removed
Repeat 3 times Output • Cluster means • Gene assignments to clusters
Regulatory systems that govern the expression of overlapping sets of genes in yeast.
Fuzzy K means ADVANTAGES • The method can present overlapping clusters , revealing distinct features of each gene’s function and regulation. • The resulting implication can be used to assign refined hypothetical functions to uncharacterized gene products and additional cellular roles of well none studied proteins .
Fuzzy K means ADVANTAGES • It present more comprehensive groups of conditionally co regulate genes. • It elucidate the environmental conditions that trigger changes in gene expression. • It requires no a priori information about the dataset.
Fuzzy K means DISADVANTAGES • Assignment of genes to the cluster requires a user – defined cutoff and selecting meaningful cutoff is a challenge. • Fuzzy K means failed to identify a small number of groups that were identified by hierarchical clustering.
My opinion • The unique advantages of fuzzy K means clustering make the technique a valuable tool for gene expression analysis , it’s flexibility can be used to reveal more complex correlations between gene expression patterns, promoting refined hypotheses of the role and regulation of gene expression changes.
In order to get over the limitations… combining hierarchical clustering with fuzzy K means can be useful..