200 likes | 374 Views
Entropy Estimation and Applications to Decision Trees. Estimation. Distribution over K=8 classes Repeat 50,000 times: Generate N samples Estimate entropy from samples. H=1.289. N=10. N=50000. N=100. Estimation. Estimating the true entropy Goals:
E N D
Estimation Distribution over K=8 classes Repeat 50,000 times: Generate N samples Estimate entropy from samples H=1.289 N=10 N=50000 N=100
Estimation Estimating the true entropy Goals: 1. Consistency: large N guarantees correct result2. Low variance: variation of estimates small 3. Low bias: expected estimate should be correct
Experimental Results • UCI classificationdata sets • Accuracy on test set • Plugin vs. Grassberger • Better trees Source: [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]
Differential Entropy Estimation • In regression, differential entropy • measures remaining uncertainty about y • is a function of a distribution • Problem • q is not from a parametric family • Solution 1: project onto a parametric family • Solution 2: non-parametric entropy estimation
Solution 1: parametric family • Multivariate Normal distribution • Estimate covariance matrix of all y vectors • Plugin estimate of the entropy • Uniform minimum variance unbiased estimator (UMVUE) [Ahmed, Gokhale, “Entropy expressions and their estimators for multivariate distributions”, IEEE Trans. Inf. Theory, 1989]
Solution 2: Non-parametric entropy estimation • Minimal assumptions on distribution • Nearest neighbour estimate • NN distance • Euler-Mascheroni constant • Volume of d-dim. hypersphere • Other estimators: KDE, spanning tree, k-NN, etc. [Kozachenko, Leonenko, “Sample estimate of the entropy of a random vector”, Probl. Peredachi Inf., 1987] [Beirlant, Dudewicz, Győrfi, van der Meulen, “Nonparametric entropy estimation: An overview”, 2001] [Wang, Kulkarni, Verdú, “Universal estimation of information measures for analog sources”, FnT Comm. Inf. Th., 2009]
Experimental Results [Nowozin, “Improved Information Gain Estimates for Decision Tree Induction”, ICML 2012]
Streaming Data • “Infinite data” setting • 10 possible splits and their scores • When to stop and make a decision?
Streaming Decision Trees • Score splits on a subset of samples only • Domingos/Hulten (Hoeffding Trees), 2000: • Compute sample count n for given precision • Streaming decision tree induction • Incorrect confidence intervals, but work well in practice • Jin/Agralwal, 2003: • Tighter confidence interval, asymptotic derivation using delta method • Loh/Nowozin, 2013: • Racing algorithm (bad splits are removed early) • Finite sample confidence intervals for entropy and gini [Domingos, Hulten, “Mining High-Speed Data Streams”, KDD 2000] [Jin, Agralwal, “Efficient Decision Tree Construction on Streaming Data”, KDD 2003] [Loh, Nowozin, “Faster Hoeffding racing: Bernstein races via jackknife estimates”, ALT 2013]
Multivariate Delta Method Theorem. Let be a sequence of -dimensional random vectors such that . Let be once differentiable at with gradient matrix . Then [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008]
Delta Method for the Information Gain • 8 classes, 2 choices (left/right) • : probability of choice S, class I [Small, “Expansions and Asymptotics for Statistics”, CRC, 2010] [DasGupta, “Asymptotic Theory of Statistics and Probability”, Springer, 2008] Multivariate delta method: for we have that • , mutual information (infogain) • Derivation lengthy but not difficult, slight generalization of Jin & Agralwal
Delta Method Example As , is fixed
Conclusion on Entropy Estimation • Statistical problem • Large body of literature exists on entropy estimation • Better estimators yield better decision trees • Distribution of estimate relevant in the streaming setting