1 / 27

Data-Intensive Statistical Challenges in Astrophysics

Data-Intensive Statistical Challenges in Astrophysics. Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU ), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary). The Age of Surveys. Angular Galaxy Surveys ( obj ) 1970 Lick 1M 1990 APM 2M

patty
Download Presentation

Data-Intensive Statistical Challenges in Astrophysics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary)

  2. The Age of Surveys • Angular Galaxy Surveys (obj) • 1970 Lick 1M • 1990 APM 2M • 2005 SDSS 200M • 2011 PS11000M • 2020 LSST30000M CMB Surveys (pixels) • 1990 COBE 1000 • 2000 Boomerang 10,000 • 2002 CBI 50,000 • 2003 WMAP 1 Million • 2008 Planck 10 Million • Time Domain • QUEST • SDSS Extension survey • Dark Energy Camera • Pan-STARRS • LSST… • Galaxy Redshift Surveys (obj) • 1986 CfA 3500 • 1996 LCRS 23000 • 2003 2dF 250000 • 2008 SDSS 1000000 • 2012 BOSS 2000000 • 2012 LAMOST 2500000 Petabytes/year …

  3. Sloan Digital Sky Survey • “The Cosmic Genome Project” • Two surveys in one • Photometric survey in 5 bands • Spectroscopic redshift survey • Data is public • 2.5 Terapixels of images => 5 Tpx • 10 TB of raw data => 120TB processed • 0.5 TB catalogs => 35TB in the end • Started in 1992, finished in 2008 • Extra data volume enabled by • Moore’s Law • Kryder’s Law

  4. Analysis of Galaxy Spectra • Sparse signal in large dimensions • Much noise, and very rare events • 4Kx1M SVD problem, perfect for randomized algorithms • Motivated our work on robust incremental PCA

  5. Galaxy Properties from Galaxy Spectra Spectral Lines Continuum Emissions

  6. Galaxy Diversity from PCA PC 1st [Average Spectrum] 2nd [Stellar Continuum] 3rd [Finer Continuum Features + Age] 4th [Age] Balmer series hydrogen lines 5th [Metallicity] Mg b, Na D, Ca II Triplet

  7. Streaming PCA • Initialization • Eigensystem of a small, random subset • Truncate at p largest eigenvalues • Incremental updates • Mean and the low-rank A matrix • SVD of A yields new eigensystem • Randomized algorithm! T. Budavari, D. Mishin 2011

  8. Robust PCA • PCA minimizes σRMS of the residuals r = y – Py • Quadratic formula: r2 extremely sensitive to outliers • We optimize a robust M-scale σ2 (Maronna 2005) • Implicitly given by • Fits in with the iterative method! • Outliers can be processed separately

  9. Eigenvalues in Streaming PCA Classic Robust

  10. Examples with SDSS Spectra Built on top of the Incremental Robust PCA • Principal Component Pursuit (I. Csabai et al) • Importance sampling (C-W Yip et al)

  11. Principal component pursuit * E. Candes, et al. “Robust Principal Component Analysis”. preprint, 2009. Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection) • Low rank approximation of data matrix: X • Standard PCA: • works well if the noise distribution is Gaussian • outliers can cause bias • Principal component pursuit • “sparse” spiky noise/outliers: try to minimize the number of outliers while keeping the rank low • NP-hard problem • The L1 trick: • numerically feasible convex problem (Augmented Lagrange Multiplier)

  12. Testing on Galaxy Spectra • Slowly varying continuum + absorption lines • Highly variable “sparse” emission lines • This is the simple version of PCP: the position of the lines are known • but there are many of them, automatic detection can be useful • spiky noise can bias standard PCA • DATA: • Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.) • SDSS 1M galaxy spectra • Morphological subclasses • Robust averages + first few PCA directions

  13. PCA PCA reconstruction Residual

  14. Principal component pursuit Low rank Sparse Residual λ=0.6/sqrt(n), ε=0.03

  15. Not Every Data Direction is Equal Wavelength Selected Wavelengths Wavelength A = C X Selected Wavelengths Galaxy ID Galaxy ID Procedure: 1. Perform SVD of A = U  VT 2. Pick number of eigenvectors = K 3. Calculate Leverage Score = i||VTij||2 / K Mahoney and Drineas 2009

  16. Wavelength Sampling Probability k = 2 c = 7 k = 4 c = 16 k = 6 c = 25 k = 8 c = 29

  17. Ranking Astronomical Line Indices • Subspace Analysis of Spectra Cutouts: • Othogonality • Divergence • Commonality (Worthey et al. 94; Trager et al. 98) (Yip et al. 2012 in prep.)

  18. Identify Informative Regions “NewMethod” • Pick the λ with largest Pλ • Define its region of influence using  λ Pλ convergence. Mask λ’s from future selection. • Go back to Step 1, or quit. “MahoneySecond” • Over-select λ’s from the targeted number. • Merge selected λ if two pixels lie within a certain distance • Quit.

  19. Identifying New Line Indices, Objectively (Yip et al. 2012 in prep.)

  20. New Spectral Regions (MahoneySecond; k = 5; Overselecting 10 X; Combining if < 30 Å)

  21. NewMethodvsMahoneySecond NM M2

  22. Gunawan & Neswan 2000)

  23. Angle between Subspaces JHU Lick

  24.  λ Pλ JHU Lick

  25. Line Indices for Galaxy Parameter Estimations

  26. Importance Sampling and Galaxies • Lick indices are ad hoc • The new indices are objective • Recover atomic lines • Recover molecular bands • Recover Lick indices • Informative regions are orthogonal to each other, in contrast to Lick • Future • Emission line indices • More accurate parameter estimation of galaxies

  27. Summary Astronomy has always been data-driven….now becoming more generally accepted Non-Incremental changes on the way • Science is moving increasingly from hypothesis- driven to data-driven discoveries • Need randomized, incremental algorithms • Best result in 1 min, 1 hour, 1 day, 1 week • New computational tools and strategies … not just statistics, not just computer science, not just astronomy, not just genomics…

More Related