200 likes | 218 Views
Novel data clustering for microarrays and image segmentation. Andrew Knyazev Image from http://www.biosci.utexas.edu/mgm/people/faculty/profiles/VIresearch.jpg This presentation is at http://math.ucdenver.edu/~aknyazev/research/conf/.
E N D
Novel data clustering for microarrays and image segmentation Andrew Knyazev Image from http://www.biosci.utexas.edu/mgm/people/faculty/profiles/VIresearch.jpg This presentation is at http://math.ucdenver.edu/~aknyazev/research/conf/.
We develop novel algorithms and software on parallel computers for data clustering of large datasets. We are interested in applying our approach, e.g., for analysis of large datasets of microarrays or tiling arrays in molecular biology and for segmentation of high resolution images. Many data clustering codes are available, but for large datasets most of them either are prohibitively slow, or give unreliable spurious clusters due to their ad hoc, rather than mathematically based, nature. We use spectral clustering, which has mathematical foundations in spectral theory of graph Laplacians, principal component analysis, random reversible Markov walks, and models of mechanical vibrations of mass-spring systems. Spectral clustering produces high quality clusters, but requires numerical solution of eigenvalue problems of mega-million sizes. Our main expertize, in developing parallel eigenvalue solvers, allows us to efficiently handle such problems, thus opening an opportunity for quality clustering of record-size datasets. We have some experience analyzing Affimetrix microarrays, taking into account match and mismatch data, for clustering genes and experiments. Some of our ideas are already incorporated in MATLAB's Bioinformatics toolbox function PROBESETVALUES. We have preliminary results of segmenting multimegapixel resolution 2D and 3D images on a number of computing systems, ranging from a modern desktop to top 10 world most powerful parallel systems. We have a direct access to and run our tests on IBM BG/L system with several thousand processors. E.g., a 24 megapixel 2D image is segmented on IBM BG/L in a matter of seconds. The 3D segmentation can be applied is a variety of situations such as electronic microscopy, 3D MRI scans, and tracking objects in movies. High resolution 3D image segmetation is especially computationally challenging and requires both powerful computing resources and our sophisticated software.
Microarrays---a massively parallel experiment • Clustering: why? • Clustering: how? • Spectral clustering • Connection to image segmentation • Eigensolvers for spectral clustering
Microarrays-massivelyparallel experiment 1/5 Affymetrix GeneChip DNA Microarrays Image Courtesy: Affymetrix
Microarrays-massively parallel experiment 2/5 • GeneChip: oligonucleotide sequences are photo-lithographed on a quartz wafer in a pattern of ~10 micrometers dots. • Oligonucleotide sequences (oligos) probes: 25 nucleotide chains for selected parts of a gene complementary to mRNA. • GeneChips are manufactured to include all currently known and predicted genes of a particular organism, e.g., H. sapience. The information about physical locations of oligo probes for each gene on the chip is contained in the *.cdf file. • A sample of mRNA extracted from cells of an organism after pre-processing is hybridized with GeneChip giving PM and MM values which characterize genes expressions in the cells. For every gene there are 11-20(depending on chip design) of different oligo probes called perfect matches (PM). In addition, there are mismatch oligos (MM) corresponding to each of the PMs that differ in the middle base pair.
Microarrays-massively parallel experiment 3/5 • Labelled cRNA targets derived from the mRNA of an experimental sample are hybridized to oligo probes. • During hybridization, complementary nucleotides line up and bind together via hydrogen bonds in the same way as two strands of DNA bound together. • The chip is then scanned with a laser giving the amount of each mRNA species represented. • Image Courtesy: cnx.org
Microarrays-massively parallel experiment 4/5 • A pool of mRNA is extracted from the cells of an organism and converted to a Biotin labelled strand (cRNA) that binds to the oligo probes on the GeneChip during hybridization. • The higher the concentration of a particular mRNA in the testing pool---the greater the hybridization level of the PM probes and thus the amount of the hybridized material on the processed GeneChip. • Then a fluorescent stain is applied that binds to the Biotin and the GeneChip is processed through a scanner that illuminates each dot of the GeneChip with a laser, causing dots to fluoresce. • The image data of the scanned probe array is stored in a *.dat file. The Affymetrix GCOS software processes the *.dat file and generates a *.cel file, containing all numerical data of the GeneChip experiment, e.g., probe locations and PM and MM intensities. The processing involves computing a square grid locating the dots for probes, intensity normalization, using internal controls, and detecting the outliers. • More sophisticated *.dat-->*.cel algorithms, e.g., taking into account the cRNA saturation, are being developed elsewhere.
Microarrays-massively parallel experiment 5/5 • The PM and MM values are not normally used directly for high-level statistical analysis, instead they are first converted into the gene expression values, which involves: • Detecting unreliable data by comparing PM and MM • Adjustment for background and noise • Calculating the single array gene expression intensities, basically by averaging adjusted PM values for each probe set • Alternatively, the Comparison Analysis (Experiment versus Baseline arrays) detects and quantifies changes in gene expressions between two arrays, applying normalization of data and using the Signal Log Ratio algorithms. • Either way, the absolute or comparison gene expression values are stored in a *.chp file, which serves as the input for high-level statistical analysis. Typically, multiple GeneChip tests are performed giving multiple *.chp files with gene expression values.
Clustering: why? • When conducting microarray experiments there are multiple microarrays involved typically: • Studying a process over time,e.g., to measure the response to a drug or food. • Looking for differences between states, e.g., normal cells versus cancer cells. • A typical goal is Finding Gene Networks, i.e., groups of genes that change expression inter-dependently across samples. Having a significantly large number of microarrays, we want to reverse engineer the regulatory network that controls gene expressions. We need computer clustering on the microarray data to select a small (ideally) number of co-expressed genes of a gene network. Separate experiments using gene knockout on the selected genes can then be performed to confirm the discovered regulatory network biologically.
Clustering: how? The overview There is no good widely accepted definition of clustering. The traditional graph-theoretical definition is combinatorial in nature and computationally infeasible. Heuristics rule! Good open source software, e.g., METIS and CLUTO. Clustering can be performed hierarchically by agglomeration (bottom-up) and by division (top-down). Agglomeration clustering example
Clustering: how? Co-clustering Two-way clustering, co-clustering or bi-clustering are clustering methods where not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously! Image courtesy http://www.gg.caltech.edu/~zhukov/research/spectral_clustering/spectral_clustering.htm
Clustering: how? Algorithms Partitioning means determining clusters. Partitioning can be used recursively for hierarchical division. Many partitioning methods are known. Here we cover: • Spectral partitioning using Fiedler vectors = Principal Components Analysis (PCA) PCA/spectral partitioning is known to produce high quality clusters, but is considered to be expensive as solution of large-scale eigenproblems is required. Our expertise in eigensolvers comes to the rescue!
Eigenproblems in mechanical vibrations Free transversal vibration without damping of the mass-spring system is described by the ODE system Standing wave assumption leads to the eigenproblem Component xi describes the up-down movement of mass i. Images courtesy http://www.gps.caltech.edu/~edwin/MoleculeHTML/AuCl4_html/AuCl4_page.html
Spectral clustering in mechanics The main idea comes from mechanical vibrations and is intuitive: in the spring-mass system the masses which are tightly connected will have the tendency to move together synchronically in low-frequency free vibrations. Analysing the signs of the components corresponding to different masses of the low-frequency vibration modes of the system allows us to determine the clusters of the masses! A 4-degree-of-freedom system has 4 modes of vibration and 4 natural frequencies: partition into 2 clusters using the second eigenvector: Images Courtesy: Russell, Ketteriung U.
Spectral clustering for simple graphs Undirected graphs with no self-loops and no more than one edge between any two different vertices • A = symmetric adjacency matrix • D = diagonal degree matrix • Laplacian matrix L = D – A L=K describes transversal vibrations of the spring-mass system (as well as Kirchhoff's law for electrical circuits of resistors)
Spectral clustering for simple graphs The Laplacian matrix L is symmetric with the zero smallest eigenvalue and constant eigenvector (free boundary). The second eigenvector, called the Fiedler vector, describes the partitioning. • The Fiedler eigenvector gives bi-partitioning by separating the positive and negative components only • By running the K-means on the Fiedler eigenvector one could find more then 2 partitions if the vector is close to piecewise-constant after reordering • The same idea for more eigenvectors of Lx=λx Example Courtesy: Blelloch CMU 1 2 5 3 www.cs.cas.cz/fiedler80/ 4 Rows sum to zero
PCA clustering for simple graphs • Fiedler vector is an eigenvector of Lx=λx, in the spring-mass system this corresponds to the stiffness matrix K=L and to the mass matrix M=I (identity) • Should not the masses with a larger adjacency degree be heavier? Let us take the mass matrix M=D -the degree matrix • So-called N-cut smallest eigenvectors of Lx=λDx are the largest for Ax=µDx with µ=1-λ since L=D-A • PCA for D-1A computes the largest eigenvectors, which then can be used for clustering by the K-means • D-1A is row-stochastic and describes the Markov random walk probabilities on the simple graph
Connection to image segmentation Image pixels serve as graph vertices. Weighted graph edges are computed by comparing pixel colours. Here is an example displaying 4 Fiedler vectors of an image: We generate a sparse Laplacian, by comparing neighboring pixels here when computing the weights for the edges. Genes correspond to vertices in microarrays, but we have to compare all genes, possibly getting a Laplacian with a large fill-in.
Eigensolvers for spectral clustering • Our BLOPEX-LOBPCG software has proved to be efficient for large-scale eigenproblems for Laplacians from PDE's and for image segmentation using multiscale preconditioning of hypre • The LOBPCG for massively parallel computers is available in our Block Locally Optimal Preconditioned Eigenvalue Xolvers (BLOPEX) package • BLOPEX is built-in in http://www.llnl.gov/CASC/hypre/ and is included as an external package in PETSc, see http://www-unix.mcs.anl.gov/petsc/ • On BlueGene/L 1024 CPU we can compute the Fiedler vector of a 24 megapixel image in seconds (including the hypre algebraic multigrid setup).
Work in Progress • Segmentation of 3D images at the pixel level • Multi-level/resolution segmentation • Mathematical foundation of clustering • Algorithm and Software development for simultaneous clustering of genes and experiments • Accurate clustering for large datasets Need collaborators from medical imaging and molecular biology communities to give us the data