1 / 17

Score Function for Data Mining Algorithms

Score Function for Data Mining Algorithms. Based on Chapter 7 of Hand, Manilla, & Smyth David Madigan. Introduction. e.g. how to pick the “best” a and b in Y = aX + b usual score in this case is the sum of squared errors Scores for patterns versus scores for models

garima
Download Presentation

Score Function for Data Mining Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Score Function for Data Mining Algorithms Based on Chapter 7 of Hand, Manilla, & Smyth David Madigan

  2. Introduction • e.g. how to pick the “best” a and b in Y = aX + b • usual score in this case is the sum of squared errors • Scores for patterns versus scores for models • Predictive versus descriptive scores • Typical scores are a poor substitute for utility-based scores

  3. Predictive Model Scores • Assume all observations equally important • Depend on differences rather than values • Symmetric • Proper scoring rules

  4. Descriptive Model Scores • “Pick the model that assigns highest probability to what actually happened” • Many different scoring rules for non-probabilistic models

  5. Scoring Models with Different Complexities • Bias-variance tradeoff (MSE=Variance + Bias2) score(model) = error(model) + penalty-function(model) AIC: where SL is the negative log-likelihood Selects overly complex models?

  6. Bayesian Criterion • Typically impossible to compute analytically • All sorts of Monte Carlo approximations

  7. Laplace Method for p(D|M) (i.e., the log of the integrand divided by n) Laplace’s Method:

  8. Laplace cont. • Tierney & Kadane (1986, JASA) show the approximation is O(n-1) • Using the MLE instead of the posterior mode is also O(n-1) • Using the expected information matrix in  is O(n-1/2) but convenient since often computed by standard software • Raftery (1993) suggested approximating by a single Newton step starting at the MLE • Note the prior is explicit in these approximations

  9. Monte Carlo Estimates of p(D|M) Draw iid 1,…, m from p(): In practice has large variance

  10. Monte Carlo Estimates of p(D|M) (cont.) Draw iid 1,…, m from p(|D): “Importance Sampling”

  11. Monte Carlo Estimates of p(D|M) (cont.) Newton and Raftery’s “Harmonic Mean Estimator” Unstable in practice and needs modification

  12. p(D|M) from Gibbs Sampler Output First note the following identity (for any * ): p(D|*) and p(*) are usually easy to evaluate. What about p(*|D)? Suppose we decompose  into (1,2) such that p(1|D,2) and p(2|D,1) are available in closed-form… Chib (1995)

  13. p(D|M) from Gibbs Sampler Output The Gibbs sampler gives (dependent) draws from p(1,2 |D) and hence marginally from p(2 |D)…

  14. What about 3 parameter blocks… OK ? OK To get these draws, continue the Gibbs sampler sampling in turn from: and

  15. Bayesian Information Criterion • BIC is an O(1) approximation to p(D|M) • Circumvents explicit prior • Approximation is O(n-1/2) for a class of priors called “unit information priors.” • No free lunch (Weakliem (1998) example)

  16. Score Functions on Hold-Out Data • Instead of penalizing complexity, look at performance on hold-out data • Note: even using hold-out data, performance results can be optimistically biased

  17. classification performance when data are in fact noise...

More Related