1 / 42

How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery

How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery. Alexander Gray Georgia Institute of Technology College of Computing Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG), Robert Brunner (UIUC/NCSA), Andrew Moore (CMU).

clive
Download Presentation

How to do Bayes-Optimal Classification with Massive Datasets: Large-scale Quasar Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to do Bayes-Optimal Classification with Massive Datasets:Large-scale Quasar Discovery Alexander Gray Georgia Institute of Technology College of Computing Joint work with Gordon Richards (Princeton), Robert Nichol (Portsmouth ICG), Robert Brunner (UIUC/NCSA), Andrew Moore (CMU)

  2. What I do Often the most general and powerful statistical (or “machine learning”) methods are computationally infeasible. I design machine learning methods and fast algorithms to make such statistical methods possible on massive datasets (without sacrificing accuracy).

  3. Quasar detection • Science motivation: use quasars to trace the distant/old mass in the universe • Thus we want lots of sky  SDSS DR1, 2099 square degrees, to g = 21 • Biggest quasar catalog to date: tens of thousands • Should be ~1.6M z<3 quasars to g=21

  4. Classification • Traditional approach: look at 2-d color-color plot (UVX method) • doesn’t use all available information • not particularly accurate (~60% for relatively bright magnitudes) • Statistical approach: Pose as classification. • Training: Train a classifier on large set of known stars and quasars (‘training set’) • Prediction: The classifier will label an unknown set of objects (‘test set’)

  5. Which classifier? • Statistical question: Must handle arbitrary nonlinear decision boundaries, noise/overlap • Computational question:We have 16,713 quasars from [Schneider et al. 2003] (.08<z<5.4), 478,144 stars (semi-cleaned sky sample) – way too big for many classifiers • Scientific question: We must be able to understand what it’s doing and why, and inject scientific knowledge

  6. Which classifier? • Popular answers: • logistic regression: fast but linear only • naïve Bayes classifier: fast but quadratic only • decision tree: fast but not the most accurate • support vector machine: accurate but O(N3) • boosting: accurate but requires thousands of classifiers • neural net: reasonable compromise but awkward/human-intensive to train • The good nonparametric methods are also black boxes – hard/impossible to interpret

  7. Main points of this talk • nonparametric Bayes classifier • can be made fast (algorithm design) • accurate and tractable  science

  8. Main points of this talk • nonparametric Bayes classifier • can be made fast (algorithm design) • accurate and tractable  science

  9. Optimal decision theory Quasar density Star density density f(x) x Optimal decision boundary

  10. Bayes’ rule, for Classification

  11. So how do you estimate an arbitrary density?

  12. Kernel Density Estimation (KDE) for example (Gaussian kernel):

  13. Kernel Density Estimation (KDE) • There is a principled way to choose the optimal smoothing parameter h • Guaranteed to converge to the true underlying density (consistency) • Nonparametric – distribution need not be known

  14. Nonparametric Bayes Classifier (NBC) [1951] • Nonparametric – distribution can be arbitrary • This is Bayes-optimal, given the right densities • Very clear interpretation • Parameter choices are easy to understand, automatable • There’s a way to enter prior information Main obstacle:

  15. Main points of this talk • nonparametric Bayes classifier • can be made fast (algorithm design) • accurate and tractable  science

  16. kd-trees: most widely-used space-partitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977] • Univariate axis-aligned splits • Split on widest dimension • O(N log N) to build, O(N) space

  17. A kd-tree: level 1

  18. A kd-tree: level 2

  19. A kd-tree: level 3

  20. A kd-tree: level 4

  21. A kd-tree: level 5

  22. A kd-tree: level 6

  23. For higher dimensions: ball-trees (computational geometry)

  24. We have a fast algorithm for Kernel Density Estimation (KDE) • Generalization of N-body algorithms (multipole expansions optional) • Dual kd-tree traversal: O(N) • Works in arbitrary dimension • The fastest method to date [Gray & Moore 2003]

  25. But we need a fast algorithm for the Nonparametric Bayes Classifier (NBC) We could just use the KDE algorithm for each class. But: • for the Gaussian kernel this is approximate • choosing the smoothing parameter to minimize (cross-validated) classification error is more accurate

  26. Leave-one-out cross-validation Observations: • Doing bandwidth selection requires only prediction. • To predict class label, we don’t need to compute the full densities. Just which one is higher.  We can make a fast exact algorithm for prediction

  27. Fast NBC prediction algorithm 1. Build a tree for each class

  28. Fast NBC prediction algorithm 2. Obtain bounds on P(C)f(xq|C) for each class xq P(C1)f(xq|C1) P(C2)f(xq|C2)

  29. Fast NBC prediction algorithm 3. Choose the next node-pair with priority = bound difference xq P(C1)f(xq|C1) P(C2)f(xq|C2)

  30. Fast NBC prediction algorithm 3. Choose the next node-pair with priority = bound difference 50-100x speedup exact P(C1)f(xq|C1) P(C2)f(xq|C2)

  31. Main points of this talk • nonparametric Bayes classifier • can be made fast (algorithm design) • accurate and tractable  science

  32. Resulting quasar catalog • 100,563 UVX quasar candidates • Of 22,737 objects w/ spectra, 97.6% are quasars. We estimate 95.0% efficiency overall. (aka “purity”: good/all) • 94.7% completeness w.r.t. g<19.5 UVX quasars from DR1 (good/all true) • Largest mag. range ever: 14.2<g<21.0 • [Richards et al. 2004, ApJ] • More recently, 195k quasars

  33. Cosmic magnification [Scranton et al. 2005] more area more flux 13.5M galaxies, 195,000 quasars Most accurate measurement of cosmic magnification to date [Nature, April 2005]

  34. Next steps (in progress) • better accuracy via coordinate-dependent priors • 5 magnitudes • use simulated quasars to push to higher redshift • use DR4 higher-quality data • faster bandwidth search • 500k quasars easily, then 1M

  35. Bigger picture • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more…

  36. Bigger picture fastest algs • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more…

  37. Bigger picture fastest alg • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more… fastest alg

  38. Bigger picture fastest alg • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more… fastest alg fastest alg

  39. Bigger picture fastest alg • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, more… fastest alg fastest alg fastest alg

  40. Bigger picture fastest alg • nearest neighbor (1-,k-,all-,approx,clsf) [Gray & Moore 2000], [Miller et al. 2003], etc. • n-point correlation functions [Gray & Moore 2000], [Moore et al. 2000], [Scranton et al. 2003], [Gray & Moore 2004], [Nichol et al. 2005 in prep.] • density estimation (nonparametric) [Gray & Moore 2000], [Gray & Moore 2003], [Balogh et al. 2003] • Bayes classification (nonparametric) [Richards et al. 2004], [Gray et al. 2005 PhyStat] • nonparametric regression • clustering: k-means and mixture models, others • support vector machines, maybe fastest alg fastest alg fastest alg we’ll see…

  41. Take-home messages • Estimating a density? Use kernel density estimation (KDE). • Classification problem? Consider the nonparametric Bayes classifier (NBC). • Want to do these on huge datasets? Talk to us, use our software. • Different computational/statistical problem? Grab me after the talk! agray@cc.gatech.edu

More Related