1 / 66

Nonparametric Bayesian Classification

Nonparametric Bayesian Classification. Marc A. Coram University of Chicago http://galton.uchicago.edu/~coram. Persi Diaconis Steve Lalley. Related Approaches. Chipman, George, McCullough Bayesian CART (1998 a,b) Nested CART-like Coordinate aligned splits Good “search” ability

ginny
Download Presentation

Nonparametric Bayesian Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Nonparametric Bayesian Classification Marc A. Coram University of Chicago http://galton.uchicago.edu/~coram Persi Diaconis Steve Lalley

  2. Related Approaches • Chipman, George, McCullough • Bayesian CART (1998 a,b) • Nested • CART-like • Coordinate aligned splits • Good “search” ability • Denison, Mallick, Smith • Bayesian CART • Bayesian splines and “MARS”

  3. Outline • Medical example • Theoretical framework • Bayesian proposal • Implementation • Simulation experiments • Theoretical results • Extensions to a general setting

  4. Example: AIDS Data(1-dimensional) • AIDS patients • Covariate of interest: viral resistance level in blood sample • Goal: estimate conditional probability of response

  5. Idealized Setting (X,Y)iid pairs X (covariate)X  [0,1] Y(response) Y  {0,1} f0(true parameter)f0(x)=P(Y=1|X=x) What, then, is a straightforward way to proceed, thinking like a Bayesian?

  6. Prior on f: 1-dimension • Pick a non-negative integer M at randomSay, choose M=0 with prob 1/2 M=1 with prob 1/4 M=2 with prob 1/8 …. • Conditional on M=m, Randomly choose a step function from [0,1] into [0,1] with m jumps • (i.e. locate the m jumps and (m+1) valuesindependently and uniformly)

  7. Perspective • Simple prior on stepwise functions • Functions are parameterized by: • Goal: Get samples from the posterior; average to estimate posterior mean curve • Idea: Use MCMC, but prefer analytical calculations whenever possible # regions jump locations function values

  8. Observations • The joint distribution of U, V, and the data has density proportional to: • Conditional on u, the counts are sufficient for v. where:

  9. Observations II The marginal of the posterior on U has density proportional to: Where: Conditional on U=u and the data, V’s are independent Beta random variables and

  10. Consequently… • In principle: • We put a prior on piecewise constant curves • The curves are specified by • u, a vector in [0,1]m • v, a vector in [0,1]m+1 • for some m • We sample curves from the posterior using MCMC • We take the posterior mean (pointwise) of the sampled curves • In practice: • We need only sample from the posterior on u • We can then compute the conditional mean of all the curves with this u.

  11. Implementation • Build a reversible base chain to sample U from the prior • E.g., start with an empty vector and add, delete, and move coordinates randomly • Apply Metropolis-Hastings to construct a new chain which samples from the posterior on U • Compute:

  12. Simulation Experiment (a) n=1024 • True • Posterior Mean

  13. n=1024 • True • Posterior Mean

  14. n=1024 • True • Posterior Mean

  15. n=1024 • True • Posterior Mean

  16. Predictive Probability Surface

  17. Posterior on #-jumps

  18. Stable w.r.t Prior

  19. Decomposition

  20. Classification and Regression Trees(CART) • Consider splitting the data into the set with X<x and the set with X>x • Choose x to maximize the fit • Recurse on each subset • “Prune” away splits according to a complexity criterion whose parameter is determined by cross-validation • Splits that do not “explain” enough variability get pruned off

  21. Simulation Experiment (b) • True • Posterior Mean • CART

  22. Bagging • To “bag” an estimator you treat the estimator as a black box • Repeatedly, generate bootstrap resamples from the data set and run the estimator on these new “data sets.” • Average the resulting estimates

  23. Simulation Experiment (c) • True • Posterior Mean • CART • Bagged Cart: Full Trees

  24. Simulation Experiment (d) • True • Posterior Mean • CART • Bagged Cart: cp=0.005

  25. Simulation Experiment (e) • True • Posterior Mean • CART • Bagged Cart: cp=0.01

  26. Simulations 2-10

  27. CART Bagged CART: cp=0.01

  28. Bagged Bayes??

  29. Smoothers?

  30. Boosting? (Lasso Stumps)

  31. Dyadic Bayes [Diaconis, Freedman]

  32. Monotone Invariance?

  33. Bayesian Consistency • Consistent at f0 if: The posterior probability of N tends to 1 a.s. for any  > 0 • Since all f are bounded in L1, Consistency implies a fortiori that:

  34. Sample Size 8192

  35. Related WorkDiaconis and Freedman (1995) DF: K~ Given K=k, split into 2kequal pieces. (k=3) • Similar hierarchical prior, but: • Aggressive splitting • Fixed split points • Strong Results: • If  dies off at a specific geometric rate • Consistency for all f0 • If  dies off just slower than this • Posterior will be inconsistent at f0=1/2 • Consistency results cannot be taken for granted

  36. Consistency Theorem: Thesis If(Xi,Yi) are drawn iid via (i=1..n) X ~ U(0,1) Y|X=x ~ Bernoulli(f0(x)) And if is the specified prior on f, chosen so that the tails the prior on hierarchy level M, decay like exp(-n log(n) ) Thenn, the posterior, is a consistent estimate of f0, for any measurable f0.

  37. Method of Proof • Barron, Schervish, Wasserman (1999) • Need to show: • Lemma 1: Prior puts positive mass on all Kullback-Leibler information neighborhoods of f0 • Choose sieves: Fn={f: f has no more than n/log(n) splits} • Lemma 2: The  -upper metric entropy of Fnis o(n) • Lemma 3: (Fnc) decays exponentially

  38. New Result • Coram and Lalley 2004/5 ( hopefully  ) • Consistency holds for any prior with infinite support, if the true function is not identically ½. • Consistency for the ½ case depends on the tail decay* • Proof revolves around a large-deviation question: • How does predictive probability behave as n --> infinityfor a model with m=an splits? (0<a<infinity) • Proof uses subadditive ergodic theorem to take advantage of self-similarity in the problem

  39. Flip a fair coin repeatedly 1/2 1/2 Pick p in [0,1] at random Flip that p-coin repeatedly A Guessing Game

  40. 64

  41. 128

  42. 256

  43. 512

  44. 1024

  45. 2048

  46. 4096

  47. 8192

  48. A Voronoi Prior for [0,1]d 1 2 V1 V2 V3 5 V5 3 V4 4

  49. A Modified Voronoi Prior for General Spaces • Choose M, as before • Draw V=(V1, V2, … Vk) • With each Vj drawn without replacement from an a-priori fixed set A • In practice, I take A={X1, …, Xn} • This approximates drawing the V’s from the marginal dist of X

  50. Discussion • CON: • Not quite Bayesian • A depends on the data • PRO: • Only partitions the relevant subspace • Applies in general metric spaces • Only depends on D, the pairwise distance matrix • Intuitive Content

More Related