160 likes | 295 Views
Clustering. Relevance to Bioinformatics Array(s) analysis Examples Principal Component Analysis Clustering Algorithms. Bioinformatic arrays. Array = View of biological (sub) system(s) in terms of a vector of attributes Arrays at different resolutions Ecosystem: Spectrum of life in a pond
E N D
Clustering • Relevance to Bioinformatics • Array(s) analysis • Examples • Principal Component Analysis • Clustering Algorithms Lecture 13, CS567
Bioinformatic arrays • Array = View of biological (sub) system(s) in terms of a vector of attributes • Arrays at different resolutions • Ecosystem: • Spectrum of life in a pond • Populations: • Characteristics of a population of species • Organism: • “FBI profile” (multiple attribute types) • Single Nucleotide Polymorphism (SNP) profile (single attribute type) Lecture 13, CS567
Bioinformatic arrays • Arrays at different resolutions • Cell: • mRNA expression array in cancer cell • Cell subcompartment: • Cytosolic protein array • Multiple arrays • Trivially, replicates • Increase signal/noise ratio • Under different conditions • Useful for analysis of • Relationships between conditions • Relationships between attributes Lecture 13, CS567
Bioinformatic clusters • Genes {g} that are turned on together in condition (e.g., disease, phase of life) x • Conditions {x} that gene g is turned on in • Sequence/subsequence families • Structure/substructure families • Hierarchical classification of molecules/systems • Group of people sharing disease • Group of diseases common to a people/species Lecture 13, CS567
Principal Component Analysis • Not really clustering but of some use in analysis of multidimensional data • Grounds for PCA • Humans restricted to viewing only 3 dimensions at a time; PCA renders a lower dimensional approximation of a high dimensional space • PCA highlights attributes showing maximum variation; helps in ignoring attributes showing minimal variation Lecture 13, CS567
Principal Component Analysis • Project data from N dimensional space to M dimensional spaces such that • M << N • The first principal component is aligned with the “direction” of maximal variability • Data remaining after removal of projection along a principal component is “residual” data, with “residual” variance • Each component is derived in ranked order of capturing residual variance • All PCs are orthogonal • Limitations • Different clusterings may look alike by PCA • Combination of attributes that constitute a PCA may not be biologically intuitive Lecture 13, CS567
Clustering • Ultimate goal: • “Birds of a feather stick together; We be brothers thicker than blood” • Minimum similarity within a cluster should exceed maximum similarity between clusters • Maximum distance within a cluster should be less than minimum distance between clusters • Clustering: Classifying data without class labels • Supervised: Classes are known (“Classifying Olympic athletes who are missing their country tags”) • Unsupervised: Classes not known (“Classifying forms of life on an alien planet”) • Hybrid: Combination of above (“Find classes based on sample of data, followed by classifying remaining data into the newly defined classes”) Lecture 13, CS567
The notion of Distance • Definition • Positivity • Symmetry (Why relative entropy is not a distance. “One way traffic not really distance!”) • Triangular inequality (“’Beam me up Scotty’ route not a distance”) • Getting from a to b via some c is never less than getting from a to b directly • Self distance = 0 (“If you are trying to find yourself, just open your eyes – you are right here!”) • Examples of true distances • Minkowski distances (déjà vu) • Pearson correlation coefficient Lecture 13, CS567
Representing Distances • Absolute measures • Measures that can be defined without reference to other data points • Examples: Height, Weight • Pairwise distances • Defined with reference to pair of data points • May or may not be derivable from absolute measures • Inverse of similarity between points • Examples: Difference in height Lecture 13, CS567
Distances in bioinformatics • Absolute • (sub)Sequence Length/Molecular weight • Pairwise • Between primary structures • Based on probabilistic similarity scores obtained from pairwise sequence alignment • Based on expectations of similarity scores • Examples: Clustering based on BLAST searches • Between secondary/tertiary structures • Root mean square (RMS) difference of pairs of 3D arrays • Examples: Structural classes in CATH database Lecture 13, CS567
Data Types • Real numbers • Interval-scaled • Linear scale • Ratio-scaled • Non-linear scale • Fine grained resolution • Nominal • Named categories • Distance typically Boolean (“You are either a Muggle or a Wizard!”) • Coarse distance • Ordinal • Named categories with implicit ranking (“Distance from janitor to CEO > distance from President to CEO, not counting the .com boom years”) • Distance depends on relative ranks (Intermediate granularity) Lecture 13, CS567
Clustering algorithms • K-means (Centroid) • Algorithm • Pick k points • For each point i, find Dmin(i,k) and assign i to k • Replace the k points by the centroids (means) of each cluster • Calculate mean square distance from cluster centers • Iterate 2-4 till membership of points does not change OR 4 is minimized • Scales well • Limitations: • Number of clusters predetermined • Clusters always spherical in shape • Adversely affected by outliers (‘coz centroids are virtual data points a la “Simone”) • Mean needs to be definable • Only local minimum found Lecture 13, CS567
Clustering algorithms • K-means (Centroid) – EM variation • Algorithm • Pick k points • For each point i, find all D(i,k) and assign i to each k probabilistically • Replace the k points by the centroids (means) of each cluster • Calculate mean square distance from cluster centers • Iterate 2-4 till expectation of error function is minimized (likelihood is maximized) • Better minimum found (Why?) • Limitations: • Number of clusters predetermined • Clusters always spherical in shape • Adversely affected by outliers (‘coz centroids are virtual data points) • Mean needs to be definable Lecture 13, CS567
Clustering algorithms • K-mediods (Use medians rather than means) • Algorithm • Pick k points • For each point i, find Dmin(i,k) and assign i to k • Calculate mean square distance Cold from cluster centers • Replace one of the k points by a random point from each cluster, calculating Cnew every time cluster structure is changed • If (Cnew < Cold), then accept new clustering • Iterate till membership of points does not change OR C is minimized • Outliers handled well • Limitations: • Number of clusters predetermined • Clusters always spherical in shape • Doesn’t scale as well as k-means Lecture 13, CS567
Clustering algorithms • Hierarchical clustering (Tree building-Average Linkage clustering) • Algorithm • Find points i and j that are the closest and unite with a node • Treat i and j as a single cluster by taking their average • Now iterate 1-2 till all points are included in tree • Biologically relevant results • Limitations: • Solution frequently suboptimal • Cluster boundaries not clear • Linear orderings may not correspond to similarity Lecture 13, CS567
Clustering algorithms • Single linkage clustering • Algorithm • If D(i,j) < cut-off, then include them in cluster • If i in cluster x is within cut-off distance of j in cluster y, then merge x and y into single cluster • Handles non-spherical clusters well • Limitations: • Not the best way to go • Doesn’t scale well • Essentially, assumes transitivity in similarity (“Saudi princely clan, Indian joint family, Godfather’s clan”) • Clusters not compact, but scraggly Lecture 13, CS567