100 likes | 205 Views
AMERICAN ASTRONOMICAL SOCIETY Continuous Probability Distribution as an Alternative to Binning of Survey Data JANUARY 6, 2010 David J. Corliss. 16. 14. 12. 10. 8. 6. 4. 2. 0. 30 - 35,000 K. < 30,000 K. 35 - 40,000 K. 40 - 45,000 K. > 45,000 K. A Typical Example of Binned Data.
E N D
AMERICAN ASTRONOMICAL SOCIETYContinuous Probability Distribution as an Alternative to Binning of Survey DataJANUARY 6, 2010David J. Corliss
16 14 12 10 8 6 4 2 0 30 - 35,000 K < 30,000 K 35 - 40,000 K 40 - 45,000 K > 45,000 K A Typical Example of Binned Data Population of Hot DB White Dwarfs in the Sloan Digital Sky Survey Figure 1 – Population Distribution of hot DB white dwarfs described by Eisenstein et al. 2006
Some Amount of Information if Lost as All Points in a Given Bin Are Treated the Same There is Also Some Uncertainty as to Which Bin a Given Point Belongs LOWER DB GAP MIDDLE DB GAP UPPER DB GAP Figure 2A – Population Distribution of hot DB white dwarfs described by Eisenstein et al. 2006b
Kernel Density Estimate (KDE) Process: Represent Each Point as a Normal and Sum Figure 2B – Population Distribution of hot DB white dwarfs described by Eisenstein et al. 2006 b
Summary and Conclusions: Kernel Density Estimation • Creates a Continuous Probability Density Distribution • by summing over Gaussian Distributions for Each • Data Point, Where μ is the Observed Value and σ is the • σ of the Individual Measurement. • Prevents Loss of Information From Relatively • Accurate Measurements Being Placed into Larger Bins • Incorporates the Uncertainty Associated with • Measured Values into Population Distributions • Provides a Viable Alternative to Binning in Developing • Population Distributions for Survey and Other Data
References Babu, G. Jogesh, Summer School in Statistics for Astronomers V lecture Notes, Pennsylvania State University 2009 Barnes, George R., Cerrito, Patricia B., The Visualization of Continuous Data Using PROC KDE and PROC CAPABILITY , SUGI, 26, 2001 Corliss, David J., MS Thesis, Wayne State University, 2008 Eisenstein, D.J., et al., 2006, ApJS, 167, 40 (Eisenstein et al. 2006a) Eisenstein, D.J., et al., 2006, ApJ, 132, 676 (Eisenstein et al. 2006b) Sall, John – Personal Communication re. the SAS KDE Procedure
A Final Thought - “Essentially, all models are wrong, but some are useful.” George E. P. Box (Norman R. Draper (1987). Empirical Model-Building and Response Surfaces, p. 424, Wiley.)
libname project 'C:\SAS\Conferences'; data work.kde; input month 4.0 day 4.0 year 4.0 volume 8.0; cards; 1 1 1962 589 2 1 1962 561 3 1 1962 640 4 1 1962 656 5 1 1962 727 6 1 1962 697 7 1 1962 640 8 1 1962 599 run; DATA WORK.TSERIES; SET WORK.CRYER; IF MONTH = 1; DUMMY = 1; ATTRIB T INFORMAT=8.0 FORMAT=8.0; T = YEAR; ATTRIB Y INFORMAT=8.0 FORMAT=8.1; Y = VOLUME; RUN; PROC MEANS DATA=WORK.TSERIES NOPRINT; VAR VOLUME; OUTPUT OUT=WORK.RANDOM_TERM; RUN; %GLOBAL LAMBDA SIGMA; %MACRO ASSIGNMENT; DATA _NULL_; SET WORK.RANDOM_TERM; IF _STAT_ = MEAN; %LET LAMBDA = VOLUME; RUN; DATA _NULL_; SET WORK.RANDOM_TERM; IF _STAT_ = STD; %LET SIGMA = VOLUME; RUN; %ASSIGNMENT; %PUT LAMBDA = &LAMBDA.; DATA WORK.TEST; SET WORK.TSERIES; LAMBDA = &LAMBDA.; SIGMA = &SIGMA.; RUN;
%MACRO AC(N); PROC SORT DATA=WORK.TSERIES; BY DUMMY; RUN; DATA WORK.LAST; SET WORK.TSERIES; BY DUMMY; IF LAST.DUMMY; RECENT = _N_ - &N. + 1; KEEP DUMMY RECENT; RUN; DATA WORK.RECENT; MERGE WORK.TSERIES WORK.LAST; BY DUMMY; IF _N_ GE RECENT; DROP RECENT; RUN; PROC REG DATA=WORK.RECENT NOPRINT; MODEL Y=T; OUTPUT OUT=WORK.TREND PREDICTED=FORECAST RESIDUAL=RESIDUAL; RUN; DATA WORK.TREND; SET WORK.TREND; OUTPUT; T_PREVIOUS = T; Y_PREVIOUS = FORECAST + RAND(SIGMA,LAMBDA); RETAIN T_PREVIOUS Y_PREVIOUS; RUN; DATA WORK.NEW; SET WORK.TREND; BY DUMMY; IF LAST.DUMMY; DELTA_T = T - T_PREVIOUS; T = T + DELTA_T; DELTA_Y = Y - Y_PREVIOUS + 1; Y = Y + DELTA_Y; KEEP T Y DUMMY; RUN; DATA WORK.TSERIES; SET WORK.TSERIES WORK.NEW; RUN; %MEND AC; %AC(5);