1 / 25

Correlation-based Data Processing and its Application to Biology

Schloss Dagstuhl. Correlation-based Data Processing and its Application to Biology. Marc Strickert. Pattern Recognition Group. stricker@ipk-gatersleben.de. Osnabrück, 14. Januar 2005. Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben. Goals. Attribute rating

coen
Download Presentation

Correlation-based Data Processing and its Application to Biology

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Schloss Dagstuhl Correlation-based Data Processing and its Application to Biology Marc Strickert Pattern Recognition Group stricker@ipk-gatersleben.de Osnabrück, 14. Januar 2005 Leibniz Institute of Plant Genetics and Crop Plant Research Gatersleben

  2. Goals • Attribute rating • Clustering • Classification • Visualization of biological data, exploiting properties of Pearson correlation.

  3. Euclidean distances may be problematic d1= (x1-y1)2+ … + (x5-y5)2 1 identical despite of different shapes 2 d2= (x1-y1)2+ … + (x5-y5)2 [ John Lee and Michel Verleysen ]

  4. Pearson correlation invariant to scaling and shifting amplitude vertical offset Up-regulated gene profiles raw data Euclideanview same profiles, aligned 'Pearson'view same correlations as above!

  5. Derivatives of squared Euclidean and Pearson correlation Squared Euclidean: Pearson correlation:

  6. Applications for derivative of similarity measure 1. Attribute rating (Variance analogon) 4. Visualization (High-Throughput MDS) 2. Clustering (Neural Gas for Correlation, NG-C) 3. Classification (GRLVQ-C)

  7. Attribute rating Variance as double sum of derivatives Squared Euclidean distance =

  8. Correlation Analogon to Euclidean Variance W X

  9. Clustering: Neural Gas (NG revisited) NG-C:

  10. High centroid reproducibility with NG-C 23 gene expression centroids, 10 independent runs NG-C  Crisp final states. k-means  Indeterminate final states.

  11. Classification with relevance learning Adaptive Pearson correlation: For example used in Generalized Learning Vector Quantization with Correlation (GRLVQ-C)

  12. Leukemia cancer data set: AML / ALL separation GRLVQ-C: Relevance factors  top 10 gene ranking. 1 prototype per class + relevance learning. consistent with Golub et al.

  13. Visualization of high-dimensional data d23 d13 d12 d13 d23 d12 High-dimensional data (constant source) Low-dimensional points (variable target) C' 2D 3D C B B' A' “embedding” A ! Gradient-based stochastic optimization  HiT-MDS.

  14. Maximize distance correlations: source ≈ reconstruction original inter-point distance matrix Adaptive parameters point coordinates reconstructed inter-point distance matrix Minimize embedding stress function using negative Fischer's Z':

  15. Iterative gradient descent for stress function minimization | derivative of Fischer's Z' | for Euclidean spaces

  16. High-Throughput Multi-Dimensional Scaling (HiT-MDS) , , and Input xiX Embedding xi X s dij dij r(dij , dij) Hit-MDS Algorithm Initialize X by random projection (or smarter). Calculate correlation r(X,X) once. Draw next Pattern xi. Minimize stress s to all xj: Dxik~ -∂s/∂xik. recalculate distances dij. adapt 1 2 3  r. 4 1 2 3 4

  17. Applications of dimension reduction (visualization) 1. Gene space browser. 2. Macro-experiment grouping. 2 day 0 1 day 26

  18. Embedding 12k Genes (14 time points) in 2D EUC COR SRC U I D FIT FIT FIT COR COR U EUC Euclidean distance COR Pearson correlation SRC Spearman rank cor. I D orig spline

  19. Gene browser (4824 high-quality genes) 0 2 4 6 8 10 12 14 16 18 20 22 24 26 DAF … [ visualization: www.ggobi.org ]

  20. Gene browser for powers of correlation: (1-r)8

  21. Gene clustering (k=11), relevant genes in front

  22. 3D-View of 62 macroarrays (4824 genes)

  23. Data processing challenges in biology Data Sets from • metabolite measurements (2D-gels, HPLC), • QTL LOD-score pattern compression, • DNA-sequence arrangement. Missing value imputation ( probabilistic models) Association studies ( common latent space, CCA) Rank-based data analysis ( distribution models) Faithful low-dimensional data representation Proximity data handling Common language: R / MATLAB / … ?

  24. Thanks Pattern recognition group (IPK, headed by Udo Seiffert) Nese Sreenivasulu (IPK, Molecular Biology) Barbara Hammer (TU-Clausthal) Thomas Villmann (University of Leipzig) http://pgrc-16.ipk-gatersleben.de/~stricker/ http://hitmds.webhop.net/

  25. Some References Strickert, M.; Sreenivasulu N.; Peterek, S.; Weschke W.; Mock, H.-P.; Seiffert, U. Unsupervised Feature Selection for Biomarker Identification in Chromatography and Gene Expression Data. In F. Schwenker and S. Marinai (Eds.), Artificial Neural Networks in Pattern Recognition, LNAI 4087, pp. 274-285, 2006. Strickert M.; Sreenivasulu N.; Seiffert, U. Sanger-driven MDSLocalize - A Comparative study for Genomic Data. In. M. Verleysen (Ed.), Proc.14th European Symp. Artificial Neural Networks (ESANN 2006), Bruges, Belgium. D-Side publishers Evere/Belgium, pp. 265-270, 2006. Strickert, M.; Seiffert, U.; Sreenivasulu, N.; Weschke, W.; Villmann, T.; Hammer, B. Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Data. Neurocomputing 69(2006), pp. 651-659, Springer, 2006. Strickert M.; Sreenivasulu N.; Usadel, B.; Seiffert, U. Correlation-maximizing surrogate gene space for visual mining of gene expression patterns in developing barley endosperm tissue. To appear in BMC Bioinformatics, 2007. Strickert M.; Sreenivasulu N.; Seiffert, U. Browsing temporally regulated gene expressions in correlation-maximizing space. Accepted presentation at conference on Analysis of Compatibility Pathways (March 4-6, 2007).

More Related