270 likes | 467 Views
Introduction to Bioinformatics - Tutorial no. 12. Expression Data Analysis: - Clustering - GEO - EPClust. Application of Microarrays. We only know the function of about 20% of the 30,000 genes in the Human Genome Gene exploration Faster and better Applications: Evolution Behavior
E N D
Introduction to Bioinformatics - Tutorial no. 12 Expression Data Analysis: - Clustering - GEO - EPClust
Application of Microarrays • We only know the function of about 20% of the 30,000 genes in the Human Genome • Gene exploration • Faster and better • Applications: • Evolution • Behavior • Cancer Research
Microarray Analysis • Unsupervised Grouping: Clustering • Pattern discovery via grouping similarly expressed genes together • Three techniques most often used • k-Means Clustering • Hierarchical Clustering • Kohonen Self Organizing Feature Maps
Hierarchical Agglomerative Clustering Michael Eisen, 1998 • Cluster (algorithm) • TreeView (visualization) • Hierarchical Agglomerative Clustering • Step 1: Similarity score between all pairs of genes • Pearson Correlation • Euclidean distance • Step 2: Find the two most similar genes, replace with a node that contains the average • Builds a tree of genes • Step 3: Repeat
2 4 5 3 1 1 3 2 4 5 Need to define the distance between thenew cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Agglomerative Hierarchical Clustering Distance between joined clusters The dendrogram induces a linear ordering of the data points Dendrogram
Results of Clustering Gene Expression • CLUSTER is simple and easy to use • De facto standard for microarray analysis • Limitations: • Hierarchical clustering in general is not robust • Genes may belong to more than one cluster
K-Means Clustering Algorithm • Randomly initialize k cluster means • Iterate: • Assign each genes to the nearest cluster mean • Recompute cluster means • Stop when clustering converges Notes: • Really fast • Genes are partitioned into clusters • How do we select k?
K-Means Algorithm • Randomly Initialize Clusters
K-Means Algorithm • Assign data points to nearest clusters
K-Means Algorithm • Recalculate Clusters
K-Means Algorithm • Recalculate Clusters
K-Means Algorithm • Repeat
K-Means Algorithm • Repeat
K-Means Algorithm • Repeat … until convergence
EPClust Input (1) Expression data matrix Extra annotation for gene rows Method of tabulation Name for further analysis
EPClust Input (2) Method of measuring distance between gene rows Cluster hierarchically Number k of means Cluster into k means
GEO: Gene Expression Omnibus • NCBI database for gene expression data • Founded at end of 2000
Querying GEO Browse records Search for entries containing a gene Search for experiments Search with Entrez
SGD – Expression database http://db.yeastgenome.org/cgi-bin/expression/expressionConnection.pl
Two labs are running experiments on the APO1 gene. Suggest a method that would allow them to compare their results. • Gene grouping • Relative values
Explain how microarrays can be used as a basis for diagnostic
Explain how microarrays can be used as a basis for diagnostic