270 likes | 373 Views
Data-Intensive Statistical Challenges in Astrophysics. Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU ), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary). The Age of Surveys. Angular Galaxy Surveys ( obj ) 1970 Lick 1M 1990 APM 2M
E N D
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary)
The Age of Surveys • Angular Galaxy Surveys (obj) • 1970 Lick 1M • 1990 APM 2M • 2005 SDSS 200M • 2011 PS11000M • 2020 LSST30000M CMB Surveys (pixels) • 1990 COBE 1000 • 2000 Boomerang 10,000 • 2002 CBI 50,000 • 2003 WMAP 1 Million • 2008 Planck 10 Million • Time Domain • QUEST • SDSS Extension survey • Dark Energy Camera • Pan-STARRS • LSST… • Galaxy Redshift Surveys (obj) • 1986 CfA 3500 • 1996 LCRS 23000 • 2003 2dF 250000 • 2008 SDSS 1000000 • 2012 BOSS 2000000 • 2012 LAMOST 2500000 Petabytes/year …
Sloan Digital Sky Survey • “The Cosmic Genome Project” • Two surveys in one • Photometric survey in 5 bands • Spectroscopic redshift survey • Data is public • 2.5 Terapixels of images => 5 Tpx • 10 TB of raw data => 120TB processed • 0.5 TB catalogs => 35TB in the end • Started in 1992, finished in 2008 • Extra data volume enabled by • Moore’s Law • Kryder’s Law
Analysis of Galaxy Spectra • Sparse signal in large dimensions • Much noise, and very rare events • 4Kx1M SVD problem, perfect for randomized algorithms • Motivated our work on robust incremental PCA
Galaxy Properties from Galaxy Spectra Spectral Lines Continuum Emissions
Galaxy Diversity from PCA PC 1st [Average Spectrum] 2nd [Stellar Continuum] 3rd [Finer Continuum Features + Age] 4th [Age] Balmer series hydrogen lines 5th [Metallicity] Mg b, Na D, Ca II Triplet
Streaming PCA • Initialization • Eigensystem of a small, random subset • Truncate at p largest eigenvalues • Incremental updates • Mean and the low-rank A matrix • SVD of A yields new eigensystem • Randomized algorithm! T. Budavari, D. Mishin 2011
Robust PCA • PCA minimizes σRMS of the residuals r = y – Py • Quadratic formula: r2 extremely sensitive to outliers • We optimize a robust M-scale σ2 (Maronna 2005) • Implicitly given by • Fits in with the iterative method! • Outliers can be processed separately
Eigenvalues in Streaming PCA Classic Robust
Examples with SDSS Spectra Built on top of the Incremental Robust PCA • Principal Component Pursuit (I. Csabai et al) • Importance sampling (C-W Yip et al)
Principal component pursuit * E. Candes, et al. “Robust Principal Component Analysis”. preprint, 2009. Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection) • Low rank approximation of data matrix: X • Standard PCA: • works well if the noise distribution is Gaussian • outliers can cause bias • Principal component pursuit • “sparse” spiky noise/outliers: try to minimize the number of outliers while keeping the rank low • NP-hard problem • The L1 trick: • numerically feasible convex problem (Augmented Lagrange Multiplier)
Testing on Galaxy Spectra • Slowly varying continuum + absorption lines • Highly variable “sparse” emission lines • This is the simple version of PCP: the position of the lines are known • but there are many of them, automatic detection can be useful • spiky noise can bias standard PCA • DATA: • Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.) • SDSS 1M galaxy spectra • Morphological subclasses • Robust averages + first few PCA directions
PCA PCA reconstruction Residual
Principal component pursuit Low rank Sparse Residual λ=0.6/sqrt(n), ε=0.03
Not Every Data Direction is Equal Wavelength Selected Wavelengths Wavelength A = C X Selected Wavelengths Galaxy ID Galaxy ID Procedure: 1. Perform SVD of A = U VT 2. Pick number of eigenvectors = K 3. Calculate Leverage Score = i||VTij||2 / K Mahoney and Drineas 2009
Wavelength Sampling Probability k = 2 c = 7 k = 4 c = 16 k = 6 c = 25 k = 8 c = 29
Ranking Astronomical Line Indices • Subspace Analysis of Spectra Cutouts: • Othogonality • Divergence • Commonality (Worthey et al. 94; Trager et al. 98) (Yip et al. 2012 in prep.)
Identify Informative Regions “NewMethod” • Pick the λ with largest Pλ • Define its region of influence using λ Pλ convergence. Mask λ’s from future selection. • Go back to Step 1, or quit. “MahoneySecond” • Over-select λ’s from the targeted number. • Merge selected λ if two pixels lie within a certain distance • Quit.
Identifying New Line Indices, Objectively (Yip et al. 2012 in prep.)
New Spectral Regions (MahoneySecond; k = 5; Overselecting 10 X; Combining if < 30 Å)
NewMethodvsMahoneySecond NM M2
Angle between Subspaces JHU Lick
λ Pλ JHU Lick
Importance Sampling and Galaxies • Lick indices are ad hoc • The new indices are objective • Recover atomic lines • Recover molecular bands • Recover Lick indices • Informative regions are orthogonal to each other, in contrast to Lick • Future • Emission line indices • More accurate parameter estimation of galaxies
Summary Astronomy has always been data-driven….now becoming more generally accepted Non-Incremental changes on the way • Science is moving increasingly from hypothesis- driven to data-driven discoveries • Need randomized, incremental algorithms • Best result in 1 min, 1 hour, 1 day, 1 week • New computational tools and strategies … not just statistics, not just computer science, not just astronomy, not just genomics…