130 likes | 301 Views
Random Forest Photometric Redshift Estimation. Samuel Carliles 1 Tamas Budavari 2 , Sebastien Heinis 2 , Carey Priebe 3 , Alex Szalay 2 Johns Hopkins University 1 Dept. of Computer Science 2 Dept. of Physics & Astronomy 3 Dept. of Applied Mathematics & Statistics. Photometric Redshifts.
E N D
Random Forest Photometric Redshift Estimation Samuel Carliles1 Tamas Budavari2, Sebastien Heinis2, Carey Priebe3, Alex Szalay2 Johns Hopkins University 1Dept. of Computer Science 2Dept. of Physics & Astronomy 3Dept. of Applied Mathematics & Statistics
Photometric Redshifts • You know what they are • I did it on SDSS DR6 colors • zspec = f(u-g, g-r, r-i, i-z) • zphot = f(u-g, g-r, r-i, i-z) • = zphot - zspec • I did it with Random Forests ˆ
Regression Trees • A Binary Tree • It partitions input training data into clusters of similar objects • Each new test object is matched with the cluster to which it is “closest” in the input space • The output value is the mean of the output values of training objects in its cluster
x3 x2 x1 x3 Building a Regression Tree • Starting at the root node choose a dimension on which to split • Choose the point which “best” distinguishes clusters in that dimension • Points left go in the left child, right go in the right child • Repeat the process in each child node until all objects are in their own leaf node
How Do You Choose the Dimension and Split Point? • The best split point in a dimension is the one which minimizes resubstitution error in that dimension • The best dimension is the one with the lowest best resubstitution error
What’s Resubstitution Error? • For a candidate split point, there are points left and points right • = L ( x - xL)2 / NL + R (x - xR)2 / NR • That’s the resubstitution error • Minimize it ¯ ¯
Randomizing a Regression Tree • Train it on a bootstrap sample • This is a sample of N objects chosen uniformly at random with replacement from the complete training set • Instead of choosing the best dimension to split on, choose the best from among a random subset of input dimensions
Random Forest • An ensemble of “randomized” Regression Trees • Ensemble estimate is the mean of individual tree estimates • This gives a distribution of iid estimation errors • Central Limit Theorem gives the distribution of their mean • Their mean is exactly zphot - zspec • That means we have the error distribution for that object!
Implemented in R • More training data -> better estimates • Forests converge pretty quickly in forest size • Training set size, input space constrained by memory in R implementation
Training set size = 80,000 Results RMS Error = 0.023
Error Distribution Standardized Error Distribution Since we know the error distribution* for each object, we can standardize them and the results should be standard normal over all test objects. Like in this plot! :) If the standardized errors are standard normal, then we can predict how many of the errors fall between the tails of the distribution for different tail sizes. Like in this plot! (mostly)
Summary • Random Forest estimates come with Gaussian error distributions • 0.023 RMS error is competitive with other methodologies • This makes Random Forests good
Future Work • CRLB says bigger N gives better estimates from the same estimator • 80,000 objects is good, but we have way more than that available • Random Forests in R are extremely memory (=time) inefficient I believe due to FORTRAN implementation • So I’m writing a C# implementation