The horseshoe estimator for sparse signals

The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika(2010) Presented by Eric Wang 10/14/2010

Overview • This paper proposes a highly analytically tractable horseshoe estimatorthat is more robust and adaptive to different sparsity patterns than existing approaches. • Two theorems are proved characterizing the proposed estimator’s tail robustness and demonstrating super-efficient rate of convergence to the correct estimate of the sampling density in sparse situation. • The proposed estimator’s performance is demonstrated using both real and simulated data. The authors show its answer correspond quite closely to those obtained by Bayesian model averaging.

The horseshoe estimator • Consider a p-dimensional vector where is sparse, the authors propose the following model for estimation and prediction: where is a standard half-Cauchy distribution with mean 0 and scale parameter a. • The name horseshoe prior arises from the observation that, for fixed values where and is the amount of shrinkage toward zero, a posteriori. has a horseshoe shaped prior .

The horseshoe estimator • The meaning of is as follows: yields virtually no shrinkage, and describes signals while yields near total shrinkage and (hopefully) describes noise. • At right is the prior on the shrinkage coefficient .

The horseshoe density function • An analytic density function lacks an analytic form, but very tight bounds are available: Theorem 1. The univariate horseshoe density satisfies the following: (a) (b) For where • Alternatively, it is possible to integrate over yielding though the dependence among causes more issues. Therefore the authors do not take this approach.

Horseshoe estimator for sparse signals

Review of similar methods • Scott & Berger (2006) studied the discrete mixture where • Tipping (2001) studied the Student-t prior is defined by an inverse-gamma mixing density, • The double-exponential prior (Bayesian lasso) has mixing density

Review of similar methods • The normal-Jeffreys prior is an improper prior and is induced by placing the Jeffreys’ prior on each variance term leading to . This choice is commonly used in the absence of a global scale parameter. • The Strawderman-Berger prior does not have an analytic form, but arises from assuming , with • The normal-exponential-gamma family of priors generalizes the lasso specification using to mix over the exponential rate parameter, leading to

Review of similar methods Tail robustness of prior Shrinkage of noise

Robustness to large signals • Theorem 2. Let be the likelihood, and suppose that is a zero-mean scale mixture of normals: with having proper prior . Assume further that the likelihood and are such that the marginal density is finite for all . Define the following three pseudo-densities, which may be improper: Then

Robustness to large signals • If is a Gaussian likelihood, then the result of Theorem 2 reduces to • A key result of Theorem 2 is that if the prior on is chosen such that the derivative of the log probability leads to the derivative of the log predictive probability that is bounded at 0 at large . This happens for heavy-tailed priors, including the proposed horseshoe prior. This yields

The horseshoe score function • Theorem 3. Suppose . Let denote the predictive density under the horseshoe prior for known scale parameter , i.e. where and . Then for some that depends upon , and • Corollary: • Although the horseshoe prior has no analytic form, it does lead to the following posterior mean where is a degenerate hypergeometric function of two variables.

Estimating • The conditional posterior distribution of is approximately if dimensionality p is large. • This approximately yields a distribution for where . • If most observations are shrunk toward 0, then will be small with high probability.

Comparison to double exponential

Super-efficient convergence • Theorem 4. Suppose the true sampling model is . Then: (1) For under the horseshoe prior, the optimal rate of convergence of when is where b is a constant. When , the optimal rate is . (2) Suppose is any other density that is continuous, bounded above, and strictly positive on a neighborhood of the true value . For under , the optimal rate of convergence of , regardless of , is

Example - simulated data • Data generated from

Example-Vanguard mutual-fund data • Here, the authors show how the horseshoe can provide a regularized estimate of a large covariance matrix whose inverse may be sparse. • Vanguard mutual funds dataset containing n = 86 weekly returns for p = 59 funds. • Suppose the observation matrix is with each p-dimensional vector is drawn from a zero-mean Gaussian with covariance matrix . • We will model the Cholesky decomposition of .

Example-Vanguard mutual-fund data • The goal is to estimate the ensemble of regression models in the implied triangular system , where is the column of Y. • The regression coefficients are assumed to have a Horseshoe prior, and posterior means were computed using MCMC.

Conclusions • This paper introduces the horseshoe prior as a good default prior for sparse problems. • Empirically, the model performs similarly to Bayesian model averaging, the current standard. • The model exhibits strong global shrinkage and robust local adaptation to signals.

The horseshoe estimator for sparse signals