840 likes | 858 Views
Explore advanced machine learning methods for analyzing massive astrophysical datasets to uncover insights about galaxy evolution and the early universe. Collaborative research with leading astrophysicists.
E N D
Machine Learning on Massive Datasets Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory
The FASTlabFundamental Algorithmic and Statistical Tools Laboratorywww.fast-lab.org • Alexander Gray: Assoc Prof, Applied Math + CS; PhD CS • Arkadas Ozakin: Research Scientist, Math + Physics; PhD Physics • Dong Ryeol Lee: PhD student, CS + Math • Ryan Riegel: PhD student, CS + Math • Sooraj Bhat: PhD student, CS • Wei Guan: PhD student, CS • Nishant Mehta: PhD student, CS • Parikshit Ram: PhD student, CS + Math • William March: PhD student, Math + CS • Hua Ouyang: PhD student, CS • Ravi Sastry: PhD student, CS • Long Tran: PhD student, CS • Ryan Curtin: PhD student, EE • Ailar Javadi: PhD student, EE • Abhimanyu Aditya: Affiliate, CS; MS CS + 5-10 MS students and undergraduates
Instruments Data: CMB Maps Exponential growth in dataset sizes [Science, Szalay & J. Gray, 2001] 1990 COBE 1,000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck 10 Million Data: Local Redshift Surveys Data: Angular Surveys 1986 CfA 3,500 1996 LCRS 23,000 2003 2dF 250,000 2005 SDSS 800,000 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2010 LSST 2B
Happening everywhere! Molecular biology (cancer) microarray chips fiber optics Network traffic (spam) 300M/day Simulations (Millennium) microprocessors particle colliders Particle events (LHC) 1B 1M/sec
How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist R. Nichol, Inst. Cosmol. Gravitation A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Gravitation D. Wake, Inst. Cosmol. Gravitation R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astronomy G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Machine learning/ statistics guy
How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Hierarchical clustering O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky,Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N2) O(N2) O(N3) O(cDT(N)) Machine learning/ statistics guy
How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Hierarchical clustering O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N3) O(N2) O(N3) O(N3) Machine learning/ statistics guy
How did galaxies evolve? • What was the early universe like? • Does dark energy exist? • Is our model (GR+inflation) right? Astrophysicist • Kernel density estimator • n-point spatial statistics • Nonparametric Bayes classifier • Support vector machine • Nearest-neighbor statistics • Gaussian process regression • Hierarchical clustering O(N2) O(Nn) R. Nichol, Inst. Cosmol. Grav. A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Grav. D. Wake, Inst. Cosmol. Grav. R. Scranton,U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics O(N2) O(N3) O(N2) O(N3) O(N3) But I have 1 million points Machine learning/ statistics guy
The challenge State-of-the-art statistical methods… • Best accuracy with fewest assumptions with orders-of-mag more efficiency. • Large N(#data),D(#features),M(#models) D Reduce data? Use simpler model? Approximation with poor/no error bounds? Poor results N M
What I like to think about… • The statistical problems and methods needed for answering scientific questions • The computational problems (bottlenecks) and methods involved in scaling all of them up to big datasets • The infrastructure-tailored software to make the application happen
How to do Stats/Machine Learning on Very Large Survey Datasets? • Choose the appropriate statistical task and method for the scientific question • Use the fastest algorithm and data structure for the statistical method • Apply method to the data!
How to do Stats/Machine Learning on Very Large Survey Datasets? • Choose the appropriate statistical task and method for the scientific question • Use the fastest algorithm and data structure for the statistical method • Apply method to the data!
10 data analysis problems, and scalable tools we’d like for them • Querying (e.g. characterizing a region of space): • spherical range-search O(N) • orthogonal range-search O(N) • spatial join O(N2) • k-nearest-neighbors O(N) • all-k-nearest-neighbors O(N2) [red: of high interest in astronomy] • Density estimation (e.g. comparing galaxy types): • mixture of Gaussians • kernel density estimation O(N2) • kernel conditional density estimation O(N3) • submanifold kernel density estimation O(N2) [Ozakin and Gray, NIPS 2009] • hyperkernel density estimation O(N4) [Sastry and Gray 2010, to be submitted] • L2 density estimation trees [Ram and Gray, under review]
10 data analysis problems, and scalable tools we’d like for them 3. Regression (e.g. photometric redshifts): • linear regression • kernel regression O(N2) • Gaussian process regression/kriging O(N3) 4. Classification (e.g. quasar detection, star-galaxy separation): • decision tree • k-nearest-neighbor classifier O(N2) • nonparametric Bayes classifier O(N2), aka KDA • support vector machine (SVM) O(N3) • non-negative SVM O(N3) [Guan and Gray 2010, under review] • isometric separation map O(N3) [Vasiloglou, Gray, and Anderson, MLSP 2008 (Nominated for Best Paper)]
10 data analysis problems, and scalable tools we’d like for them • Dimension reduction (e.g. galaxy or spectra characterization): • principal component analysis • non-negative matrix factorization • kernel PCA (including LLE, IsoMap, et al.) O(N3) • maximum variance unfolding O(N3) • rank-based manifolds O(N3) [Ouyang and Gray, ICML 2008] • isometric non-negative matrix factorization O(N3) [Vasiloglou, Gray, and Anderson, SDM 2009] • density-preserving maps O(N3) [Ozakin and Gray 2010, to be submitted] • Outlier detection (e.g. new object types, data cleaning): • by density estimation, by dimension reduction • by robust Lp estimation [Ram, Riegel and Gray 2010, to be submitted]
10 data analysis problems, and scalable tools we’d like for them 7. Clustering (e.g. automatic Hubble sequence) • by dimension reduction, by density estimation • k-means • mean-shift segmentation O(N2) • hierarchical clustering (“friends-of-friends”) O(N3) 8. Time series analysis (e.g. asteroid tracking, variable objects): • Kalman filter • hidden Markov model • trajectory tracking O(Nn) • functional independent component analysis [Mehta and Gray, SDM 2009] • Markov matrix factorization [Tran, Wong, and Gray 2010, under review] • generative mean map kernel [Mehta and Gray 2010, to be submitted]
10 data analysis problems, and scalable tools we’d like for them 9. Feature selection and causality (e.g. which features predict star/galaxy) • LASSO regression • L1 SVM • Gaussian graphical model inference and structure search • discrete graphical model inference and structure search • fractional-norm SVM [Guan and Gray 2010, under review] • mixed-integer SVM [Guan and Gray 2010, to be submitted] 10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys): • minimum spanning tree O(N3) • n-point correlation O(Nn) • bipartite matching/Gaussian graphical model inference O(N3) [Riegel and Gray, in preparation]
Core methods ofstatistics / machine learning / mining • Querying: spherical range-search O(N), orthogonal range-search O(N), spatial join O(N2), nearest-neighbor O(N), all-nearest-neighbors O(N2) • Density estimation: mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) • Regression: linear regression, kernel regression O(N2), Gaussian process regression O(N3) • Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3) • Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N2), hierarchical clustering O(N3) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nn) • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • 2-sample testing and matching: bipartite matching O(N3), n-point correlation O(Nn)
Now pretty fast (2011)…made fast by us: fastest, fastest in some settings • Querying: spherical range-search O(logN)*, orthogonal range-search O(logN)*, spatial join O(N)*,nearest-neighbor O(logN),all-nearest-neighbors O(N) • Density estimation: mixture of Gaussians, kernel density estimation O(N), kernel conditional density estimation O(Nlog3)* • Regression: linear regression, kernel regression O(N),Gaussian process regression O(N)* • Classification: decision tree, nearest-neighbor classifier O(N), nonparametric Bayes classifier O(N)*, support vector machine O(N)/O(N2) • Dimension reduction:principal component analysis, non-negative matrix factorization, kernel PCA O(N)*, maximum variance unfolding O(N)* • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentation O(N), hierarchical (FoF) clustering O(NlogN) • Time series analysis: Kalman filter, hidden Markov model, trajectory tracking O(Nlogn)* • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • 2-sample testing and matching: bipartite matching O(N)**, n-point correlation O(Nlogn)*
How to do Stats/Machine Learning on Very Large Survey Datasets? • Choose the appropriate statistical task and method for the scientific question • Use the fastest algorithm and data structure for the statistical method • Apply method to the data!
Core computational problems What are the basic mathematical operations, or bottleneck subroutines, can we focus on developing fast algorithms for?
Core computational problems • Generalized N-body problems • Graphical model inference • Linear algebra • Optimization
4 main computational bottlenecks:N-body,graphical models,linear algebra,optimization • Querying:spherical range-search O(N),orthogonal range-search O(N),spatial join O(N2),nearest-neighbor O(N),all-nearest-neighbors O(N2) • Density estimation:mixture of Gaussians, kernel density estimation O(N2), kernel conditional density estimation O(N3) • Regression:linear regression, kernel regression O(N2),Gaussian process regressionO(N3) • Classification: decision tree, nearest-neighbor classifier O(N2), nonparametric Bayes classifier O(N2), support vector machine O(N3) • Dimension reduction:principal component analysis, non-negative matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3) • Outlier detection: by density estimation or dimension reduction • Clustering: by density estimation or dimension reduction, k-means, mean-shift segmentationO(N2),hierarchical clustering O(N3) • Time series analysis:Kalman filter, hidden Markov model, trajectory tracking O(Nn) • Feature selection and causality: LASSO, L1 SVM, Gaussian graphical models, discrete graphical models • 2-sample testing and matching:bipartite matching O(N3), n-point correlationO(Nn)
Mathematical/practical challenges • Return answers with error guarantees • Rigorous theoretical runtimes • Ensure small constants (small actual runtimes, not just good asymptotic runtimes)
Generalized N-body Problems I • How it appears: nearest-neighbor, sph range-search, ortho range-search • Common methods: locality sensitive hashing, kd-trees, metric trees, disk-based trees • Mathematical challenges: high dimensions, provable runtime, distribution-dependent analysis, parallel indexing • Mathematical topics: computational geometry, randomized algorithms
First example: spherical range search. How can we compute this efficiently? kd-trees: most widely-used space-partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995]
Range-count recursive algorithm Pruned! (inclusion)
Range-count recursive algorithm Pruned! (exclusion)
Range-count recursive algorithm fastest practical algorithm [Bentley 1975] our algorithms can use any tree
Generalized N-body Problems I • Interesting approach: Cover-trees [Beygelzimer et al 2004] • Provable runtime • Consistently good performance, even in higher dimensions • Interesting approach:Learning trees [Cayton et al 2007] • Learning data-optimal data structures • Improves performance over kd-trees • Interesting approach:MapReduce [Dean and Ghemawat 2004] • Brute-force • But makes HPC automatic for a certain problem form • Interesting approach:approximation using spill-trees [Liu, Moore, Gray, and Yang, NIPS 2004] • Faster/more accurate than LSH • Interesting approach:approximation in rank [Ram, Ouyang and Gray, NIPS 2009] • Approximate NN in terms of distance conflicts with known theoretical results • Is approximation in rank feasible? • Faster/more accurate than LSH