1 / 19

Entropy Estimation and Applications to Decision Trees

Entropy Estimation and Applications to Decision Trees. Estimation. Distribution over K=8 classes Repeat 50,000 times: Generate N samples Estimate entropy from samples. H=1.289. N=10. N=50000. N=100. Estimation. Estimating the true entropy Goals:

kalyca
Download Presentation

Entropy Estimation and Applications to Decision Trees

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Entropy Estimation and Applications to Decision Trees

  2. Estimation Distribution over K=8 classes Repeat 50,000 times: Generate N samples Estimate entropy from samples H=1.289 N=10 N=50000 N=100

  3. Estimation Estimating the true entropy Goals: 1. Consistency: large N guarantees correct result2. Low variance: variation of estimates small 3. Low bias: expected estimate should be correct

  4. Discrete Entropy Estimators

  5. Experimental Results • UCI classificationdata sets • Accuracy on test set • Plugin vs. Grassberger • Better trees Source: [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

  6. Differential Entropy Estimation • In regression, differential entropy • measures remaining uncertainty about y • is a function of a distribution • Problem • q is not from a parametric family • Solution 1: project onto a parametric family • Solution 2: non-parametric entropy estimation

  7. Solution 1: parametric family • Multivariate Normal distribution • Estimate covariance matrix of all y vectors • Plugin estimate of the entropy • Uniform minimum variance unbiased estimator (UMVUE) [Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989]

  8. Solution 1: parametric family

  9. Solution 1: parametric family

  10. Solution 2: Non-parametric entropy estimation • Minimal assumptions on distribution • Nearest neighbour estimate • NN distance • Euler-Mascheroni constant • Volume of d-dim. hypersphere • Other estimators: KDE, spanning tree, k-NN, etc. [Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987] [Beirlant, Dudewicz, Győrfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001] [Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009]

  11. Solution 2: Non-parametric estimation

  12. Experimental Results [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]

  13. Streaming Decision Trees

  14. Streaming Data • “Infinite data” setting • 10 possible splits and their scores • When to stop and make a decision?

  15. Streaming Decision Trees • Score splits on a subset of samples only • Domingos/Hulten (Hoeffding Trees), 2000: • Compute sample count n for given precision • Streaming decision tree induction • Incorrect confidence intervals, but work well in practice • Jin/Agralwal, 2003: • Tighter confidence interval, asymptotic derivation using delta method • Loh/Nowozin, 2013: • Racing algorithm (bad splits are removed early) • Finite sample confidence intervals for entropy and gini [Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000] [Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003] [Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013]

  16. Multivariate Delta Method Theorem. Let be a sequence of -dimensional random vectors such that . Let be once differentiable at with gradient matrix . Then [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]

  17. Delta Method for the Information Gain • 8 classes, 2 choices (left/right) • : probability of choice S, class I [Small, “Expansions and Asymptotics for Statistics”, CRC, 2010] [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008] Multivariate delta method: for we have that • , mutual information (infogain) • Derivation lengthy but not difficult, slight generalization of Jin & Agralwal

  18. Delta Method Example As , is fixed

  19. Conclusion on Entropy Estimation • Statistical problem • Large body of literature exists on entropy estimation • Better estimators yield better decision trees • Distribution of estimate relevant in the streaming setting

More Related