270 likes | 276 Views
This research explores advanced hierarchical clustering algorithms, specifically the Self-Organizing Tree Algorithm (SOTA) and the Hierarchical Growing Self-Organizing Tree Algorithm (HGSOT), for analyzing gene expression data. The study aims to extract essential patterns from vast biological data to aid researchers in identifying hidden patterns. The SOTA utilizes a binary tree structure based on Kohonen's self-organizing map, while the HGSOT introduces vertical and horizontal growth strategies for hierarchical clustering. The algorithms involve steps for updating reference vectors and handling errors in the tree structure until convergence. Time and space complexities are also discussed. These algorithms enable effective distribution of input data among newly created cells, ensuring precise analysis of gene expression patterns. The study emphasizes the importance of accurate hierarchical relationships in gene expression data analysis.
E N D
Hierarchical Clustering of Gene Expression Data Author : Feng Luo, Kun Tang Latifur Khan Graduate : Chien-Ming Hsiao
Outline • Motivation • Objective • Introduction • Hierarchical Clustering • Self-Organizing Tree algorithm • hierarchical growing self-organizing tree algorithm • Preliminary result • Conclusions • Personal Opinion
Motivation • Rapid development of biological technologies generates a hug amount of data. • Analyzation and interpretation of these massive data is a challenging task. • we are interested in data analysis tools that can help researchers to detect patterns hidden behind these complex initial data.
Objective • toextract useful and rational fundamental patterns of gene expression inherent in these huge data.
introduction • Current approaches for measuring gene expression profiles • SAGE, RT/PCR, cDNA, oligonucleotide microarray • Sample of Microarray
introduction • Two classes ofalgorithms have been successfully used to analyze gene expression data. • hierarchical clustering • a self-organizing tree
Self-Organizing Tree algorithm • Self-Organizing Tree algorithm (SOTA) • based on the Kohonen’s self-organizing map (SOM) and Fritzke’s growing cell structures • output of SOTA is a binary tree topological neural network
Node Node Cell Node m w Cell Cell w s Cell Cell A B Self-Organizing Tree algorithm • Step 1: Initially the system as a binary tree with three nodes (A) Initial Architecture of SOTA. (B) Two Difference Reference Vector Updating Schemas.
Self-Organizing Tree algorithm • Step 2: Present all data and compute distances from each data to all external Cells (tree leaves) • Euclidean distances • cosine distances • Step 3: Select output winning cell c with minimum distance dij for each data.
Self-Organizing Tree algorithm • Step 4: Update reference vector of winning cell and its neighbors Where (t) is the learning function: The(t) is the learning rate function, (t) = 1/t is a learning constant. will have a different value for the winning cell and Its neighbors.
Self-Organizing Tree algorithm • Step 2,3,4 form a Cycle. While relative error of the entire tree is greater than a threshold repeat the cycle. • Step 5: If a cycle finished, increase the network size: two new cells are attached to the cell with highest resources. This cell becomes a node. • Resources: an average of the distances of the input data associated this cell to itself. • Step 6: Repeat Step 2 until convergence (resources are below a threshold).
Self-Organizing Tree algorithm • Time Complexity of SOTA is O( n log N) • Space Complexity of SOTA is O (n)
hierarchical growing self-organizing tree algorithm • hierarchical growing self-organizing tree algorithm (HGSOT) • The HGSOT grows • vertical grows • adds descendents • the same strategy used in SOTA • horizontal grows • adds more siblings • a level threshold : controlling growth in the sibling generation
hierarchical growing self-organizing tree algorithm • To determine horizontal growth
The pseudo code of HGSOT • Initialization • initially the tree only has one root node. Initialize its reference vector with the centroid of entire data and all data will be associated with the root. • Vertical Growing • change the leaf cell to a node and add two children to each. The reference vector of a new cell is initialized as the node’s reference vector. • Distribution • distribute each input datum between two newly created cells; find the winning cell (using KLD, see 2.2.1), and then update the reference vector of the winning cell and its neighbor. • Error • when the error of the entire tree is larger than a threshold, called error threshold (TE), repeat Step 3.
The pseudo code of HGSOT • Horizontal Growing • when the difference between the minimum and maximum distance of all children cells of a node (x) is less than a threshold, called level threshold (TL), a child is added to this node; on the other hand if the difference is greater than the TL, a child is deleted from this node, and the horizontal growth terminated. • Distribution • distribute the input data associated with x into its descendents along siblings; find the winning cell (using KLD, see 2.2.), then update the reference vector of the winning cell and its neighbor. • Error • if the error of the entire tree is greater than (TE), then repeat Step 6. • if there are more levels to grow in the hierarchy, and then return to Step 2, otherwise, stop.
Hierarchical Cluster Algorithms • How we can distribute input data of selected node among these new created cells. • Similar to the SOTA approach. • Input data of selected node will be distributed not only its new created cells but also its neighbor cells. • We determine K level apart ancestor node of selected node. • We determine sub-tree of rooted by the ancestor node and input data of selected cell will be distributed among all cells (leaf) of this sub-tree. The latter approach is known as K level distribution (KLD).
Hierarchical Cluster Algorithms • KLD: We need to distribute data associated with node M to new created cells. For K=1, Data of node M will be distributed to cells, B, C, D & E. If K=0, data of M will be distributed between B and C.
Preliminary result • Experiment setup • Experiment Data • 112 genes expression data of rat central nervous system (CNS) • Four Gene Families: Neuro-Glial Markers Family (NGMs), Neurotransmitter receptors Family (NTRs), Peptide Signaling Family (PepS) and Diverse • These gene expression data were measured by using RT-PCR in mRNA expression in rat’s cervical spinal cord tissue over nine different developmental time points from embryonic days 11 through 21, postnatal days 0 through 14 and adult. • For each gene, data are normalized to the maximal expression level among the nine time points
Preliminary result • Experiment setup • The Parameters of HGSOT • The winner learning rate w and sibling learning rates of HGSOT is 0.15 and 0.015. • The error threshold is 0.001. • The level threshold is 0.8, which means the minimum distance will not be less than 80% of the maximum distance. • The distribution level K is equal to 4. • Euclidean distance is used to calculate the similarity.
Conclusions • can successfully gain five clusters similar to Wen et al’s original HAC result and gives a better hierarchical structure. • this algorithm can detect more subtle patterns at the lower hierarchical levels, and it shows a more suitable clustering than HAC on some genes.
Personal Opinion • we would like to do more experiments on different data sets