420 likes | 563 Views
How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery. Alexander Gray Georgia Institute of Technology College of Computing Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG), Robert Brunner (UIUC/NCSA), Andrew Moore (CMU).
E N D
How to do Bayes-Optimal Classification with Massive Datasets:Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG), Robert Brunner (UIUC/NCSA), Andrew Moore (CMU)
What I do Often the most general and powerful statistical (or “machine learning”) methods are computationally infeasible. I design machine learning methods and fast algorithms to make such statistical methods possible on massive datasets (without sacrificing accuracy).
Quasar detection • Science motivation: use quasars to trace the distant/old mass in the universe • Thus we want lots of sky SDSS DR1, 2099 square degrees, to g = 21 • Biggest quasar catalog to date: tens of thousands • Should be ~1.6M z<3 quasars to g=21
Classification • Traditional approach: look at 2-d color-color plot (UVX method) • doesn’t use all available information • not particularly accurate (~60% for relatively bright magnitudes) • Statistical approach: Pose as classification. • Training: Train a classifier on large set of known stars and quasars (‘training set’) • Prediction: The classifier will label an unknown set of objects (‘test set’)
Which classifier? • Statistical question: Must handle arbitrary nonlinear decision boundaries, noise/overlap • Computational question:We have 16,713 quasars from [Schneider et al. 2003] (.08<z<5.4), 478,144 stars (semi-cleaned sky sample) – way too big for many classifiers • Scientific question: We must be able to understand what it’s doing and why, and inject scientific knowledge
Which classifier? • Popular answers: • logistic regression: fast but linear only • naïve Bayes classifier: fast but quadratic only • decision tree: fast but not the most accurate • support vector machine: accurate but O(N3) • boosting: accurate but requires thousands of classifiers • neural net: reasonable compromise but awkward/human-intensive to train • The good nonparametric methods are also black boxes – hard/impossible to interpret
Main points of this talk • nonparametric Bayes classifier • can be made fast (algorithm design) • accurate and tractable science
Main points of this talk • nonparametric Bayes classifier • can be made fast (algorithm design) • accurate and tractable science
Optimal decision theory Quasar density Star density density f(x) x Optimal decision boundary
So how do you estimate an arbitrary density?
Kernel Density Estimation (KDE) for example (Gaussian kernel):
Kernel Density Estimation (KDE) • There is a principled way to choose the optimal smoothing parameter h • Guaranteed to converge to the true underlying density (consistency) • Nonparametric – distribution need not be known
Nonparametric Bayes Classifier (NBC) [1951] • Nonparametric – distribution can be arbitrary • This is Bayes-optimal, given the right densities • Very clear interpretation • Parameter choices are easy to understand, automatable • There’s a way to enter prior information Main obstacle:
Main points of this talk • nonparametric Bayes classifier • can be made fast (algorithm design) • accurate and tractable science
kd-trees: most widely-used space-partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977] • Univariate axis-aligned splits • Split on widest dimension • O(N log N) to build, O(N) space
For higher dimensions: ball-trees (computational geometry)
We have a fast algorithm for Kernel Density Estimation (KDE) • Generalization of N-body algorithms (multipole expansions optional) • Dual kd-tree traversal: O(N) • Works in arbitrary dimension • The fastest method to date [Gray & Moore 2003]
But we need a fast algorithm for the Nonparametric Bayes Classifier (NBC) We could just use the KDE algorithm for each class. But: • for the Gaussian kernel this is approximate • choosing the smoothing parameter to minimize (cross-validated) classification error is more accurate
Leave-one-out cross-validation Observations: • Doing bandwidth selection requires only prediction. • To predict class label, we don’t need to compute the full densities. Just which one is higher. We can make a fast exact algorithm for prediction
Fast NBC prediction algorithm 1. Build a tree for each class
Fast NBC prediction algorithm 2. Obtain bounds on P(C)f(xq|C) for each class xq P(C1)f(xq|C1) P(C2)f(xq|C2)
Fast NBC prediction algorithm 3. Choose the next node-pair with priority = bound difference xq P(C1)f(xq|C1) P(C2)f(xq|C2)
Fast NBC prediction algorithm 3. Choose the next node-pair with priority = bound difference 50-100x speedup exact P(C1)f(xq|C1) P(C2)f(xq|C2)
Main points of this talk • nonparametric Bayes classifier • can be made fast (algorithm design) • accurate and tractable science
Resulting quasar catalog • 100,563 UVX quasar candidates • Of 22,737 objects w/ spectra, 97.6% are quasars. We estimate 95.0% efficiency overall. (aka “purity”: good/all) • 94.7% completeness w.r.t. g<19.5 UVX quasars from DR1 (good/all true) • Largest mag. range ever: 14.2<g<21.0 • [Richards et al. 2004, ApJ] • More recently, 195k quasars
Cosmic magnification [Scranton et al. 2005] more area more flux 13.5M galaxies, 195,000 quasars Most accurate measurement of cosmic magnification to date [Nature, April 2005]
Next steps (in progress) • better accuracy via coordinate-dependent priors • 5 magnitudes • use simulated quasars to push to higher redshift • use DR4 higher-quality data • faster bandwidth search • 500k quasars easily, then 1M
Bigger picture • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more…
Bigger picture fastest algs • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more…
Bigger picture fastest alg • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more… fastest alg
Bigger picture fastest alg • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more… fastest alg fastest alg
Bigger picture fastest alg • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more… fastest alg fastest alg fastest alg
Bigger picture fastest alg • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, others • support vector machines, maybe fastest alg fastest alg fastest alg we’ll see…
Take-home messages • Estimating a density? Use kernel density estimation (KDE). • Classification problem? Consider the nonparametric Bayes classifier (NBC). • Want to do these on huge datasets? Talk to us, use our software. • Different computational/statistical problem? Grab me after the talk! agray@cc.gatech.edu