Random Forest Photometric Redshift Estimation

Random Forest Photometric Redshift Estimation Samuel Carliles1 Tamas Budavari2, Sebastien Heinis2, Carey Priebe3, Alex Szalay2 Johns Hopkins University 1Dept. of Computer Science 2Dept. of Physics & Astronomy 3Dept. of Applied Mathematics & Statistics

Photometric Redshifts • You know what they are • I did it on SDSS DR6 colors • zspec = f(u-g, g-r, r-i, i-z) • zphot = f(u-g, g-r, r-i, i-z) •  = zphot - zspec • I did it with Random Forests ˆ

Regression Trees • A Binary Tree • It partitions input training data into clusters of similar objects • Each new test object is matched with the cluster to which it is “closest” in the input space • The output value is the mean of the output values of training objects in its cluster

x3 x2 x1 x3 Building a Regression Tree • Starting at the root node choose a dimension on which to split • Choose the point which “best” distinguishes clusters in that dimension • Points left go in the left child, right go in the right child • Repeat the process in each child node until all objects are in their own leaf node

How Do You Choose the Dimension and Split Point? • The best split point in a dimension is the one which minimizes resubstitution error in that dimension • The best dimension is the one with the lowest best resubstitution error

What’s Resubstitution Error? • For a candidate split point, there are points left and points right •  = L ( x - xL)2 / NL + R (x - xR)2 / NR • That’s the resubstitution error • Minimize it ¯ ¯

Randomizing a Regression Tree • Train it on a bootstrap sample • This is a sample of N objects chosen uniformly at random with replacement from the complete training set • Instead of choosing the best dimension to split on, choose the best from among a random subset of input dimensions

Random Forest • An ensemble of “randomized” Regression Trees • Ensemble estimate is the mean of individual tree estimates • This gives a distribution of iid estimation errors • Central Limit Theorem gives the distribution of their mean • Their mean is exactly zphot - zspec • That means we have the error distribution for that object!

Implemented in R • More training data -> better estimates • Forests converge pretty quickly in forest size • Training set size, input space constrained by memory in R implementation

Training set size = 80,000 Results RMS Error = 0.023

Error Distribution Standardized Error Distribution Since we know the error distribution* for each object, we can standardize them and the results should be standard normal over all test objects. Like in this plot! :) If the standardized errors are standard normal, then we can predict how many of the errors fall between the tails of the distribution for different tail sizes. Like in this plot! (mostly)

Summary • Random Forest estimates come with Gaussian error distributions • 0.023 RMS error is competitive with other methodologies • This makes Random Forests good

Future Work • CRLB says bigger N gives better estimates from the same estimator • 80,000 objects is good, but we have way more than that available • Random Forests in R are extremely memory (=time) inefficient I believe due to FORTRAN implementation • So I’m writing a C# implementation

Random Forest Photometric Redshift Estimation