Computational AstroStatistics Bob Nichol (Carnegie Mellon)

Computational AstroStatisticsBob Nichol (Carnegie Mellon) • Motivation & Goals • Multi-Resolutional KD-trees (examples) • Npt functions (application) • Mixture models (applications) • Bayes network anomaly detection (application) • Very high dimensional data • NVO Problems

Collaborators Pittsburgh Computational AstroStatistics (PiCA) Group • Chris Miller, Percy Gomez, Kathy Romer, Andy Connolly, Andrew Hopkins, Mariangela Bernardi, Tomo Goto (Astro) • Larry Wasserman, Chris Genovese, Wong Jang, Pierpaolo Brutti (Statistics) • Andrew Moore, Jeff Schneider, Brigham Anderson, Alex Gray, Dan Pelleg (CS) • Alex Szalay, Gordon Richards, Istvan Szapudi & others (SDSS) (See http://www.picagroup.org)

First Motivation • Cosmology is moving from a “discovery” science into a “statistical” science • Drive for ``high precision’’ measurements: • Cosmological parameters to a few percent; • Accurate description of the complex structure in the universe; • Control of observational and sampling biases New statistical tools – e.g. non-parametric analyses – are often computationally intensive. Also, often want to re-sample or Monte Carlo data.

Second Motivation • Last decade was dedicated to building more telescopes and instruments; more coming this decade as well (SDSS, Planck, LSST, 2MASS, DPOSS, MAP). Also, larger simulations. • We have a “Data Flood”; SDSS is terabytes of data a night, while LSST is an SDSS every 5 nights! Petabytes by end of 00’s • Highly correlated datasets and high dimensionality • Existing statistics and algorithms do not scale into these regimes New Paradigm where we must build new tools before we can analyze & visualize data

SDSS

SDSS Data • Most Distant Object! • 100,000 spectra! FACTOR OF 12,000,000 SDSS Science

Goal to build new, fast & efficient statistical algorithms Start with tree data structures: Multi-resolutional kd-trees • Scale to n-dimensions (although for very high dimensions use new tree structures) • Use Cached Representation (store at each node summary sufficient statistics). Compute counts from these statistics • Prune the tree which is stored in memory! • See Moore et al. 2001 (astro-ph/0012333) • Many applications; suite of algorithms!

Range Searches Also Prune cells inside! Greater saving in time • Fast range searches and catalog matching Prune cells outside range

N-point correlation functions The 2-point function has a long history in cosmology (Peebles 1980). It is the excess joint probability of a pair of points over that expected from a poisson process. Also long history (as point processes) in Statistics: Similarly, the three-point is defined as (so on!)

Same 2pt, very different 3pt Naively, this is an n^N process, but all it is, is a set of range searches.

Dual Tree Approach Usually binned into annuli rmin< r < rmax . Thus, for each r transverse both trees and prune pairs of nodes with either dmin < rmin ; dmax > rmax. Also, if dmin > rmin & dmax<rmaxall pairs in these nodes are within annuli. Therefore, only need to calculate pairs cutting the boundaries. Extra speed-ups are possible doing multiple r’s together and controlled approximations

N*N NlogN N*N*N Time depends on density of points and binsize & scale

Fast Mixture Models • Describe the data in N-dimensions as a mixture of, say, Gaussians (kernel shape less important than bandwidth!) • The parameters of the model are then N gaussians each with a mean and covariance • Iterate, testing using BIC and AIC at each iteration. Fast because of kdtrees (20 mins for 100,000 points on a PC!) • Employ heuristic splitting algorithm as well • Details in Connolly et al. 2000 (astro-ph/0008187)

EM-Based Gaussian Mixture Clustering: 1

Applications • Used in SDSS quasar selection (used to map the multi-color stellar locus) Gordon Richards @ PSU • Anomaly detector (look for low probability points in N-dimensions) • Optimal smoothing of large-scale structure

SDSS QSO target selection in 4D color-space • Cluster 9999 spectroscopically confirmed stars • Cluster 8833 spectroscopically confirmed QSOs (33 gaussians) 99% for stars, 96% for QSOs

Bayes Net Anomaly Detector • Instead of using a single joint probability function (fitted to data) factorize into a smaller set of conditional probabilities • Directional and acyclical • If we know graph and conditional probabilities, we have valid probability function to whole model

Use 1.5 million SDSS sources to learn model (25 variables each) • Then evaluate the likelihood of each data being drawn from the model • Lowest 1000 are anomalous; look at ‘em and follow `em up at Keck

Unfortunately, a lot of error • Advantage of Bayes Net is that to tells you why it was anomalous; the most unusual conditional probabilities • Therefore, iterate loop and get scientist to highlight obvious errors; then suppress those errors so they do not return again • Issue of productivity!

Will Only Get Worse • LSST will do an SDSS every 5 nights looking for transient objects producing petabytes of data (2007) • VISTA will collect 300 Terabytes of data (2005) Archival Science is upon us! HST database has 20GBytes per day downloaded (10 times more than goes in!)

Will Only Get Worse II • Surveys spanning electromagnetic spectrum • Combining these surveys is hard: different sensitivities, resolutions and physics • Mixture of imaging, catalogs and spectra • Difference between continuum and point processes • Thousands of attributes per source

What is VO? The “Virtual Observatory” must: • Federate multi-wavelength data sources (interoperability) • Must empower everyone (democratise) • Be fast, distributed and easy • Allow input and output

Computer Science + Statistics! • Scientists will need help through autonomous scientific discovery of large, multi-dimensional, correlated datasets • Scientists will need fast databases • Scientists will need distributed computing and fast networks • Scientists will need new visualization tools • CS and Statistics looking for new challenges: Also no data-rights & privacy issues New breed of students needed with IT skills Symbiotic Relationship

http VO Prototype Ideally we would like all parts of the VO to be web-servises DB C# dym .NET http dym EM

Lessons We Learnt • Tough to marry research c code developed under linux to MS (pointers to memory) • .NET has “unsafe” memory • .NET server is hard to set up! Migrate to using VOTables to perform all I/O. Have server running at CMU so we can control code

Very High Dimensions Using LLE and Isomap; looking for lower dimensional manifolds in higher dimensional spaces 500x2000 space from SDSS spectra

Summary • Era of New Cosmology: Massive data sources and search for subtle features & high precision measurements • Need new methods that scale into these new regimes; ``a virtual universe’’ (students will need different skills). Perfect synergy with Stats, CS, Physics • Good algorithms are as good as faster and more computers! • The “glue” to make a “virtual observatory” is hard and complex. Don’t under-estimate the job

Are the Features Real? (FDR)! This is an example of multiple hypothesis testing e.g. is every point consistent with a smooth p(k)?

Let us first look at a simulated example: consider a 1000x1000 image with 40000 sources. FDR makes 15 times few mistakes for the same power as traditional 2-sigma Why? Controls a scientifically meaningful quantity: FDR = No. of false discoveries/Total no. of discoveries

And it is adaptive to the size of the dataset

We used a FDR of 0.25 i.e. 25% of circled Points are in error Therefore, we can say with statistical rigor that most of these points a rejected and are thus ``features’’ No single point is a 3sigma deviation New statistics has enabled an astronomical discovery

Computational AstroStatistics Bob Nichol (Carnegie Mellon)

Computational AstroStatistics Bob Nichol (Carnegie Mellon)

Presentation Transcript

Computational Biology at Carnegie Mellon University A Quick Tour

Carnegie Mellon

Carnegie Mellon

Carnegie Mellon

Computational AstroStatistics

Computational Social Science

Carnegie Mellon

Carnegie Mellon

Carnegie Mellon

Computational Thinking

Carnegie Mellon

Carnegie Mellon

Carnegie Mellon

Computational Thinking

Carnegie Mellon

Carnegie Mellon

Carnegie Mellon

Carnegie Mellon