1 / 72

Beyond MARLAP: New Statistical Tests For Method Validation

Beyond MARLAP: New Statistical Tests For Method Validation. NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53 rd Annual RRMC. Outline. The method validation problem MARLAP’s test And its peculiar features New approach – testing mean squared error (MSE)

evea
Download Presentation

Beyond MARLAP: New Statistical Tests For Method Validation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Beyond MARLAP:New Statistical TestsFor Method Validation NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53rd Annual RRMC

  2. Outline • The method validation problem • MARLAP’s test • And its peculiar features • New approach – testing mean squared error (MSE) • Two possible tests of MSE • Chi-squared test • Likelihood ratio test • Power comparisons • Recommendations and implications for MARLAP

  3. The Problem • We’ve prepared spiked samples at one or more activity levels • A lab has performed one or more analyses of the samples at each level • Our task: Evaluate the results to see whether the lab and method can achieve the required uncertainty (uReq) at each level

  4. MARLAP’s Test • In 2003 the MARLAP work group developed a simple test for MARLAP Chapter 6 • Chose a very simple criterion • Original criterion was whether every result was within ±3uReq of the target • Modified slightly to keep false rejection rate ≤ 5 % in all cases

  5. Equations • Acceptance range is TV ± kuReq where • TV = target value (true value) • uReq = required uncertainty at TV, and • E.g., for n = 21 measurements (7 reps at each of 3 levels), with α= 0.05, we get k = z0.99878 = 3.03 • For smaller n we get slightly smaller k

  6. Required Uncertainty • The required uncertainty, uReq, is a function of the target value • Where uMR is the required method uncertainty at the upper bound of the gray region (UBGR) • φMR is the corresponding relative method uncertainty

  7. Alternatives • We considered a chi-squared (χ2) test as an alternative in 2003 • Accounted for uncertainty of target values using “effective degrees of freedom” • Rejected at the time because of complexity and lack of evidence for performance • Kept the simple test that now appears in MARLAP Chapter 6 But we didn’t forget about the χ2 test

  8. Peculiarity of MARLAP’s Test • Power to reject a biased but precise method decreases with number of analyses performed (n) • Because we adjusted the acceptance limits to keep false rejection rates low • Acceptance range gets wider as n gets larger

  9. Biased but Precise This graphic image was borrowed and edited for the RRMC workshop presentation. Please view the original now at despair.com. http://www.despair.com/consistency.html

  10. Best Use of Data? • It isn’t just about bias • MARLAP’s test uses data inefficiently – even to evaluate precision alone (its original purpose) • The statistic – in effect – is just the worst normalized deviation from the target value • Wastes a lot of useful information

  11. Example: The MARLAP Test • Suppose we perform a level D method validation experiment • UBGR = AL = 100 pCi/L • uMR = 10 pCi/L • φMR= 10/100 = 0.10, or 10 % • Three activity levels (L = 3) • 50 pCi/L, 100 pCi/L, and 300 pCi/L • Seven replicates per level (N = 7) • Allow 5 % false rejections (α = 0.05)

  12. Example (continued) • For 21 measurements, calculate • When evaluating measurement results for target value TV, require for each result Xj: • Equivalently, require

  13. Example (continued) • We’ll work through calculations at just one target value • Say TV = 300 pCi/L • This value is greater than UBGR (100 pCi/L) • So, the required uncertainty is 10 % of 300 pCi/L • uReq = 30 pCi/L

  14. Example (continued) • Suppose the lab produces 7 results Xj shown at the right • For each result, calculate the “Z score” • We require |Zj| ≤ 3.0 for each j

  15. Example (continued) • Every Zj is smaller than ±3.0 • The method is obviously biased (~15 % low) • But it passes the MARLAP test

  16. 2007 • In early 2007 we were developing the new method validation guide • Applying MARLAP guidance, including the simple test of Chapter 6 • Someone suggested presenting power curves in the context of bias • Time had come to reconsider MARLAP’s simple test

  17. Bias and Imprecision • Which is worse: bias or imprecision? • Either leads to inaccuracy • Both are tolerable if not too large • When we talk about uncertainty (à la GUM), we don’t distinguish between the two

  18. Mean Squared Error • When characterizing a method, we often consider bias and imprecision separately • Uncertainty estimates combine them • There is a concept in statistics that also combines them: mean squared error

  19. Definition of MSE • If X is an estimator for a parameter θ, the mean squared error of X is • MSE(X) = E((X − θ)2) by definition • It also equals • MSE(X) = V(X) + Bias(X)2= σ2 + δ2 • If X is unbiased, MSE(X) = V(X)= σ2 • We tend to think in terms of the root MSE, which is the square root of MSE

  20. New Approach • For the method validation guide we chose a new conceptual approach A method is adequate if its root MSE at each activity level does not exceed the required uncertainty at that level • We don’t care whether the MSE is dominated by bias or imprecision

  21. Root MSE v. Standard Uncertainty • Are root MSE and standard uncertainty really the same thing? • Not exactly, but one can interpret the GUM’s treatment of uncertainty in such a way that the two are closely related • We think our approach – testing uncertainty by testing MSE – is reasonable

  22. Chi-squared Test Revisited • For the new method validation document we simplified the χ2 test proposed (and rejected) in 2003 • Ignore uncertainties of target values, which should be small • Just use a straightforward χ2 test • Presented as an alternative in App. E • But the document still uses MARLAP’s simple test

  23. The Two Hypotheses • We’re now explicitly testing the MSE • Null hypothesis (H0): • Alternative hypothesis (H1): • In MARLAP the 2 hypotheses were not clearly stated • Assumed any bias (δ) would be small • We were mainly testing variance (σ2)

  24. A χ2 Test for Variance • Imagine we really tested variance only • H0: • H1: • We could calculate a χ2 statistic • Chi-squared with N − 1 degrees of freedom • Presumes there may be bias but doesn’t test for it

  25. MLE for Variance • The maximum-likelihood estimator (MLE) for σ2 when the mean is unknown is: • Notice similarity to χ2 from preceding slide

  26. Another χ2 Test for Variance • We could calculate a different χ2 statistic • N degrees of freedom • Can be used to test variance if there is no bias • Any bias increases the rejection rate

  27. MLE for MSE • The MLE for the MSE is: • Notice similarity to χ2 from preceding slide • In the context of biased measurements, χ2 seems to assess MSE rather than variance

  28. Our Proposed χ2 Test for MSE • For a given activity level (TV), calculate a χ2 statistic W: • Calculate the critical value of W as follows: • N = number of replicate measurements • α = max false rejection rate at this level

  29. Multiple Activity Levels • When testing at more than one activity level, calculate the critical value as follows: • Where L is the number of levels and N is the number of measurements at each level • Now α is the maximum overall false rejection rate

  30. Evaluation Criteria • To perform the test, calculate Wi at each activity level TVi • Compare each Wi to wC • If Wi > wC for any i, reject the method • The method must pass the test at each spike activity level • Don’t allow bad performance at one level just because of good performance at another

  31. Lesson Learned • Don’t test at too many levels • Otherwise you must choose: • High false acceptance rate at each level, • High overall false rejection rate, or • Complicated evaluation criteria • Prefer to keep error rates low • Need a low level and a high level • But probably not more than three levels (L=3)

  32. Better Use of Same Data • The χ2 test makes better use of the measurement data than the MARLAP test • The statistic W is calculated from all the data at a given level – not just the most extreme value

  33. Caveat • The distribution of W is not completely determined by the MSE • Depends on how MSE is partitioned into variance and bias components • Our test looks like a test of variance • As if we know δ = 0 and we’re testing σ2 only • But we’re actually using it to test MSE

  34. False Rejections • If wC<N, the maximum false rejection rate (100 %) occurs when δ= ±uReq and σ=0 • But you’ll never have this situation in practice • If wC≥N+2, the maximum false rejection rate occurs when σ=uReq and δ=0 • This is the usual situation • Why we can assume the null distribution is χ2 • Otherwise the maximum false rejection rate occurs when both δand σ are nonzero • This situation is unlikely in practice

  35. To Avoid High Rejection Rates • We must have wC≥N+2 • This will always be true if α<0.08, even if L=N=1 • Ensures the maximum false rejection rate occurs when δ = 0 and the MSE is just σ2 • Not stated explicitly in App. E, because: • We didn’t have a proof at the time • Not an issue if you follow the procedure • Now we have a proof

  36. Example: Critical Value • Suppose L = 3 and N = 7 • Let α = 0.05 • Then the critical value for W is • Since wC ≥ N + 2 = 9, we won’t have unexpectedly high false rejection rates Since α < 0.08, we didn’t really have to check

  37. Some Facts about the Power • The power always increases with |δ| • The power increases with σ if or if • For a given bias δ with , there is a positive value of σ that minimizes the power • If , even this minimum power exceeds 50 % • Power increases with N

  38. Power Comparisons • We compared the tests for power • Power to reject a biased method • Power to reject an imprecise method • The χ2 test outperforms the simple MARLAP test on both counts • Results of comparisons at end of this presentation

  39. False Rejection Rates H1 Rejection rate = α Rejection rate < α H0 Rejection rate = 0

  40. Region of Low Power H1 Rejection rate = α H0

  41. Region of Low Power (MARLAP) H1 Rejection rate = α H0

  42. Example: Applying the χ2 Test • Return to the scenario used earlier for the MARLAP example • Three levels (L = 3) • Seven measurements per level (N = 7) • 5 % overall false rejection rate (α = 0.05) • Consider results at just one level, TV = 300 pCi/L, where uReq = 30 pCi/L

  43. Example (continued) • Reuse the data from our earlier example  • Calculate the χ2 statistic • Since W > wC (17.4 > 17.1), the method is rejected • We’re using all the data now – not just the worst result

  44. Likelihood Ratio Test for MSE • We also discovered a statistical test published in 1999, which directly addressed MSE for analytical methods • By Danish authors Erik Holst and Poul Thyregod • It’s a “likelihood ratio” test, which is a common, well accepted approach to hypothesis testing

  45. Likelihood Ratio Tests • To test a hypothesis about a parameter θ, such as the MSE • First find a likelihood functionL(θ), which tells how “likely” a value of θ is, given the observed experimental data • Based on the probability mass function or probability density function for the data

  46. Test Statistic • Maximize L(θ) on all possible values of θ and again on all values of θ that satisfy the null hypothesis H0 • Can use the ratio of these two maxima as a test statistic • The authors actually use λ = −2 ln(Λ) as the statistic for testing MSE

  47. Critical Values • It isn’t simple to derive equations for λ, or to calculate percentiles of its distribution, but Holst and Thyregod did both • They used numerical integration to approximate percentiles of λ, which serve as critical values

  48. Equations • For the two-sided test statistic, λ: • Where is the unique real root of the cubic polynomial • See Holst & Thyregod for details

  49. One-Sided Test • We actually need the one-sided test statistic: • This is equivalent to:

  50. Issues • The distribution of either λ or λ* is not completely determined by the MSE • Under H0 with , the percentiles λ1−α and λ*1−α are maximized when σ0 and |δ|uReq • To ensure the false rejection rate never exceeds α, use the maximum value of the percentile as the critical value • Apparently we improved on the authors’ method of calculating this maximum

More Related