280 likes | 317 Views
Information Visualization, Nonlinear Dimensionality Reduction and Sampling for Large and Complex Data Sets. Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University. Acknowledgment. We would like to thank Dr. Mike Egan for his support
E N D
Information Visualization, Nonlinear Dimensionality Reduction and Sampling for Large and Complex Data Sets Misha Pesenson, Isaac Pesenson*, Bruce McCollum California Institute of Technology, *Temple University 215th AAS Meeting, Washington DC
Acknowledgment • We would like to thank Dr. Mike Egan for his support This work was carried out at the SSC, Caltech and supported by • The National Geospatial-Intelligence Agency, Grant # HM1582-08-1-0019 215th AAS Meeting, Washington DC
Motivation • The Data Big Bang • The Expanding Digital Universe • Inflationary Epoch 215th AAS Meeting, Washington DC
Motivation (cont.) • Data is now produced faster than it can be meaningfully analyzed • Modern data are complex - dozens or hundreds of useful parameters associated with each astronomical object • LSST: The ten-year survey will result in tens of petabytes of image and catalog data and will require ~250 TFlops of processing to reduce. • A discussion related to LSST can be found in: The Spectrum of LSST Data Analysis Challenges: Kiloscale to Petascale, 2010, by T. Loredo, G. Babu, K. Borne, E. Feigelson, A. Gray, 215th AAS 215th AAS Meeting, Washington DC
Motivation (cont.) • To capitalize on the opportunities provided by these data sets one needs to be able to organize, analyze and visualize them • Traditional methods are often inadequate not merely because of the size in bytes of the data sets, but also because of the complexity of modern data sets • To be successful, these approaches must extend beyond traditional scientific analysis and information visualization 215th AAS Meeting, Washington DC
Motivation (cont.) • Moreover, to detect the expected and discover the unexpected in massive data sets requires a synergistic approach that utilizes recent advances in: • Statistics • Applied mathematics • Computer science • Artificial intelligence • Machine learning • Knowledge representation • Cognitive and perceptual sciences • Decision sciences, and more 215th AAS Meeting, Washington DC
Motivation (cont.) • Valuable results pertaining to these problems are mostly to be found only in the publications outside of astronomy • There is a big gap between applied mathematics, artificial intelligence and computer science on the one side and astronomy on the other 215th AAS Meeting, Washington DC
Goals of This Presentation • To attract attention of the astronomical community to the aforementioned gap • To help bridge this gap by briefly reviewing the some of the advanced methods • “To increase the general awareness and avoidance of unprincipled data analysis methods” (Xiao Li Meng, 2009, Desired and Feared—What Do We Do Now and Over the Next 50 Years?, American Statistician, v. 63, 3, 202-210). 215th AAS Meeting, Washington DC
Complex Data: Spectral Imaging 224 spectral channels 215th AAS Meeting, Washington DC
Astronomical Data Types and Approaches to their Representation and Processing 215th AAS Meeting, Washington DC
Scientific Visualization vs. Illustrative Visualization • Scientific Visualization (SV) does not simply reproduce visible things, but makes the things visible • SV enables extraction of meaningful patterns from multiparametric data sets 215th AAS Meeting, Washington DC
The Curse of Dimensionality and Dimension Reduction (DR) • Extraction and Visualization of meaningful structures from multiparametric, high-dimensional data sets require an accurate low-dimensional representation of data • DR is motivated by the fact that the more we are able to reduce the dimensionality of a data set, the more regularities (correlations) we have found in it and therefore, the more we have learned from the data • Pesenson M., Pesenson I., McCollum B., 2010, “The Data Big Bang and the Expanding Digital Universe: High-Dimensional, Complex and Massive Data Sets in an Inflationary Epoch”, Advances in Astronomy, special issue on Robotic Astronomy (accepted) 215th AAS Meeting, Washington DC
Dimension Reduction (cont.) • Greatly increases computational efficiency of machine learning algorithms • Improves statistical inference • Enables effective scientific visualization and classification 215th AAS Meeting, Washington DC
Dimension Reduction: “Linear” Data, PCA If the data are mainly confined to an almost linear low-dimensional subspace, then simple linear methods such as principal component analysis (PCA) can be used to discover the subspace and estimate its dimensionality 215th AAS Meeting, Washington DC
Limitations of Linear Methods • Linear methods such as PCA have a serious drawback in that they do not explicitly consider the structure of the manifold on which the data may possibly reside • PCA is intrinsically linear, so if data points form a nonlinear manifold, then obviously, there is no rotation & shift of the axis (this is what a linear transform like PCA provides) that can “unfold” such a manifold as the one on the next slide: 215th AAS Meeting, Washington DC
Data Laying on Manifolds Formally applying geometrically linear methods would produce a complete misrepresentation of the data 215th AAS Meeting, Washington DC
Data Laying on Manifolds + Noise(Balasubramanian, Schwartz 2002 ) • The practical usage of dimension reduction demands: • Representation of measurement errors in high-dimensional instrument calibration • Connors A., van Dyk D., Freeman P., Kashyap V., Siemiginowska A., et al. 2008 • Careful improvement of signal-to-noise ratio without smearing essential features • Pesenson M., Roby W., McCollum, 2008 215th AAS Meeting, Washington DC
Handling Geometrically Nonlinear Data • The modern approach to multidimensional images or data sets is to approximate them by graphs or Riemannian manifolds • Next, after constructing a weighted graph, one can introduce the corresponding combinatorial Laplace operator • Belkin M., Niyogi P., 2005; Coifman R., Lafon S., 2006 • Application to astronomy: Richards J., Freeman P., Lee A., & Schafer C., 2009 215th AAS Meeting, Washington DC
Nonlinear Dimension Reduction as an Approach to Nonlinear Data • The eigenfunctions and eigenvalues of the Laplacian form a basis, thus allowing one to develop a harmonic or Fourier analysis on graphs • This set of basis functions captures patterns intrinsic to a particular state space • Finds a lower-dimensional representation of high-dimensional data without losing a significant amount of information 215th AAS Meeting, Washington DC
Nonlinear Dimension Reduction and Harmonic Analysis on Manifolds and Graphs • We have devised innovative algorithms for nonlinear data dimension reduction and data compression: • enable one to overcome PCA’s limitations for handling nonlinear data manifolds • allow one to deal effectively with: 1) missing observations 2) partial sky coverage 3) non-regular sampling For details: • Pesenson I., 2009, J. of Geometric Analysis, 19 (2), 390; • Pesenson I., Pesenson M., 2010, J. of Math. Analysis and Applications, accepted; • Pesenson I., Pesenson M., 2010, J. of Fourier Analysis and Applications, accepted • Pesenson M., Pesenson I., McCollum B., 2010, Advances in Astronomy, accepted 215th AAS Meeting, Washington DC
Visualization - Multispectral From a set of images obtained at multiple wavebands, effective dimension reduction provides a comprehensible, information-rich single image with minimal information loss and statistical details, unlike a simple coadding with arbitrary, empirical weights 215th AAS Meeting, Washington DC
Manifold-Valued Data and Data Laying on Manifolds • Application: • Cosmic Microwave Background (CMB) • Gorski K., et al. 2005 • Solar Astrophysics • A powerful approach to the problem is based on Needlets - second generation spherical wavelets • Geller D., & Marinucci D., 2008 215th AAS Meeting, Washington DC
Manifold-Valued Data and Data Laying on Manifolds (cont.) • Important properties of needlets that are not shared by other spherical wavelet constructions: • do not rely on any kind of tangent plane approximation; • have good localization properties in both pixel and harmonic space; • Needlet coefficients are asymptotically uncorrelated at any fixed angular distance (which makes their use in statistical procedures very promising) • Pesenson, I., 2006, Integral Geometry and Tomography, Contemporary Mathematics, 405, 135-148, American Mathematical Society; • Geller D., Pesenson I., 2010, Tight Frames and Besov Spaces on Compact Homogeneous Manifolds, J. of Geometric Analysis (accepted) 215th AAS Meeting, Washington DC
Unsupervised Manifold Learning and Information Visualization • Manifold Learning and Visualization based on Nonlinear Dynamics • One needs to distinguish between geometrically nonlinear data and nonlinear methods of analysis 215th AAS Meeting, Washington DC
Unsupervised Manifold Learning – A Nonlinear Approach • Approximating a multidimensional image or a data set by a graph and associating a nonlinear dynamical system with each node enables us to unify the three seemingly unrelated tasks: • image segmentation, • unsupervised learning • data visualization 215th AAS Meeting, Washington DC
Testing the Algorithm: a Simulated 3D set of a 103 uniformly distributed random points with a double-diamond pattern • Left and middle: two screen shots from a running animation – each point in the set oscillates (in this case in 3 dimensions) with its own, random frequency • Right: synchronization made the points that are connected with high-weight edges oscillate in-phase thus allowing to reveal the pattern visually or by automatically selecting in-phase oscillating points and highlighting the pattern in red • Pesenson M., Pesenson I., McCollum B., 2010, Advances in Astronomy, (accepted). • Pesenson M., Pesenson I. 2010, Image Segmentation, Unsupervised Manifold Learning and • Information Visualization: A Unified Approach Based on Nonlinear Dynamics (submitted). 215th AAS Meeting, Washington DC
Conclusions • Many important challenges have been identified by various authors and presentations • Different groups have already been working on some of them the problems: • The Center for Astrostatistics at PSU (E. Feigelson, G. Babu) • BIPS at Cornell (T. Loredo) • InCA at CMU (C. Schafer et al.) • SAMSI-SaFeDe Collaboration (V. Kashyap et al.) • Caltech (M. Pesenson et al.) • Caltech (G. Djorgovski et al.) • AstroNeural collaboration (G. Longo et al.) • Georgia Tech (A. Gray et al.) • GMU (K. Borne et al.) • IIC at Harvard (A. Goodman et al.) 215th AAS Meeting, Washington DC
Conclusions (cont.) • The concepts and approaches described in this presentation also contribute tothe actual steps in creating needed novel approaches and algorithms • All the described efforts when combined together will enable effective automated analysis and processing of giant, complex data sets such as LSST 215th AAS Meeting, Washington DC