720 likes | 737 Views
Explore new statistical tests for method validation beyond MARLAP's criteria. Learn about testing mean squared error, chi-squared and likelihood ratio tests, power comparisons, and implications for MARLAP guidelines. Understand the peculiar features of MARLAP's test, acceptance criteria, and the balance between bias and imprecision in measurement accuracy.
E N D
Beyond MARLAP:New Statistical TestsFor Method Validation NAREL – ORIA – US EPA Laboratory Incident Response Workshop At the 53rd Annual RRMC
Outline • The method validation problem • MARLAP’s test • And its peculiar features • New approach – testing mean squared error (MSE) • Two possible tests of MSE • Chi-squared test • Likelihood ratio test • Power comparisons • Recommendations and implications for MARLAP
The Problem • We’ve prepared spiked samples at one or more activity levels • A lab has performed one or more analyses of the samples at each level • Our task: Evaluate the results to see whether the lab and method can achieve the required uncertainty (uReq) at each level
MARLAP’s Test • In 2003 the MARLAP work group developed a simple test for MARLAP Chapter 6 • Chose a very simple criterion • Original criterion was whether every result was within ±3uReq of the target • Modified slightly to keep false rejection rate ≤ 5 % in all cases
Equations • Acceptance range is TV ± kuReq where • TV = target value (true value) • uReq = required uncertainty at TV, and • E.g., for n = 21 measurements (7 reps at each of 3 levels), with α= 0.05, we get k = z0.99878 = 3.03 • For smaller n we get slightly smaller k
Required Uncertainty • The required uncertainty, uReq, is a function of the target value • Where uMR is the required method uncertainty at the upper bound of the gray region (UBGR) • φMR is the corresponding relative method uncertainty
Alternatives • We considered a chi-squared (χ2) test as an alternative in 2003 • Accounted for uncertainty of target values using “effective degrees of freedom” • Rejected at the time because of complexity and lack of evidence for performance • Kept the simple test that now appears in MARLAP Chapter 6 But we didn’t forget about the χ2 test
Peculiarity of MARLAP’s Test • Power to reject a biased but precise method decreases with number of analyses performed (n) • Because we adjusted the acceptance limits to keep false rejection rates low • Acceptance range gets wider as n gets larger
Biased but Precise This graphic image was borrowed and edited for the RRMC workshop presentation. Please view the original now at despair.com. http://www.despair.com/consistency.html
Best Use of Data? • It isn’t just about bias • MARLAP’s test uses data inefficiently – even to evaluate precision alone (its original purpose) • The statistic – in effect – is just the worst normalized deviation from the target value • Wastes a lot of useful information
Example: The MARLAP Test • Suppose we perform a level D method validation experiment • UBGR = AL = 100 pCi/L • uMR = 10 pCi/L • φMR= 10/100 = 0.10, or 10 % • Three activity levels (L = 3) • 50 pCi/L, 100 pCi/L, and 300 pCi/L • Seven replicates per level (N = 7) • Allow 5 % false rejections (α = 0.05)
Example (continued) • For 21 measurements, calculate • When evaluating measurement results for target value TV, require for each result Xj: • Equivalently, require
Example (continued) • We’ll work through calculations at just one target value • Say TV = 300 pCi/L • This value is greater than UBGR (100 pCi/L) • So, the required uncertainty is 10 % of 300 pCi/L • uReq = 30 pCi/L
Example (continued) • Suppose the lab produces 7 results Xj shown at the right • For each result, calculate the “Z score” • We require |Zj| ≤ 3.0 for each j
Example (continued) • Every Zj is smaller than ±3.0 • The method is obviously biased (~15 % low) • But it passes the MARLAP test
2007 • In early 2007 we were developing the new method validation guide • Applying MARLAP guidance, including the simple test of Chapter 6 • Someone suggested presenting power curves in the context of bias • Time had come to reconsider MARLAP’s simple test
Bias and Imprecision • Which is worse: bias or imprecision? • Either leads to inaccuracy • Both are tolerable if not too large • When we talk about uncertainty (à la GUM), we don’t distinguish between the two
Mean Squared Error • When characterizing a method, we often consider bias and imprecision separately • Uncertainty estimates combine them • There is a concept in statistics that also combines them: mean squared error
Definition of MSE • If X is an estimator for a parameter θ, the mean squared error of X is • MSE(X) = E((X − θ)2) by definition • It also equals • MSE(X) = V(X) + Bias(X)2= σ2 + δ2 • If X is unbiased, MSE(X) = V(X)= σ2 • We tend to think in terms of the root MSE, which is the square root of MSE
New Approach • For the method validation guide we chose a new conceptual approach A method is adequate if its root MSE at each activity level does not exceed the required uncertainty at that level • We don’t care whether the MSE is dominated by bias or imprecision
Root MSE v. Standard Uncertainty • Are root MSE and standard uncertainty really the same thing? • Not exactly, but one can interpret the GUM’s treatment of uncertainty in such a way that the two are closely related • We think our approach – testing uncertainty by testing MSE – is reasonable
Chi-squared Test Revisited • For the new method validation document we simplified the χ2 test proposed (and rejected) in 2003 • Ignore uncertainties of target values, which should be small • Just use a straightforward χ2 test • Presented as an alternative in App. E • But the document still uses MARLAP’s simple test
The Two Hypotheses • We’re now explicitly testing the MSE • Null hypothesis (H0): • Alternative hypothesis (H1): • In MARLAP the 2 hypotheses were not clearly stated • Assumed any bias (δ) would be small • We were mainly testing variance (σ2)
A χ2 Test for Variance • Imagine we really tested variance only • H0: • H1: • We could calculate a χ2 statistic • Chi-squared with N − 1 degrees of freedom • Presumes there may be bias but doesn’t test for it
MLE for Variance • The maximum-likelihood estimator (MLE) for σ2 when the mean is unknown is: • Notice similarity to χ2 from preceding slide
Another χ2 Test for Variance • We could calculate a different χ2 statistic • N degrees of freedom • Can be used to test variance if there is no bias • Any bias increases the rejection rate
MLE for MSE • The MLE for the MSE is: • Notice similarity to χ2 from preceding slide • In the context of biased measurements, χ2 seems to assess MSE rather than variance
Our Proposed χ2 Test for MSE • For a given activity level (TV), calculate a χ2 statistic W: • Calculate the critical value of W as follows: • N = number of replicate measurements • α = max false rejection rate at this level
Multiple Activity Levels • When testing at more than one activity level, calculate the critical value as follows: • Where L is the number of levels and N is the number of measurements at each level • Now α is the maximum overall false rejection rate
Evaluation Criteria • To perform the test, calculate Wi at each activity level TVi • Compare each Wi to wC • If Wi > wC for any i, reject the method • The method must pass the test at each spike activity level • Don’t allow bad performance at one level just because of good performance at another
Lesson Learned • Don’t test at too many levels • Otherwise you must choose: • High false acceptance rate at each level, • High overall false rejection rate, or • Complicated evaluation criteria • Prefer to keep error rates low • Need a low level and a high level • But probably not more than three levels (L=3)
Better Use of Same Data • The χ2 test makes better use of the measurement data than the MARLAP test • The statistic W is calculated from all the data at a given level – not just the most extreme value
Caveat • The distribution of W is not completely determined by the MSE • Depends on how MSE is partitioned into variance and bias components • Our test looks like a test of variance • As if we know δ = 0 and we’re testing σ2 only • But we’re actually using it to test MSE
False Rejections • If wC<N, the maximum false rejection rate (100 %) occurs when δ= ±uReq and σ=0 • But you’ll never have this situation in practice • If wC≥N+2, the maximum false rejection rate occurs when σ=uReq and δ=0 • This is the usual situation • Why we can assume the null distribution is χ2 • Otherwise the maximum false rejection rate occurs when both δand σ are nonzero • This situation is unlikely in practice
To Avoid High Rejection Rates • We must have wC≥N+2 • This will always be true if α<0.08, even if L=N=1 • Ensures the maximum false rejection rate occurs when δ = 0 and the MSE is just σ2 • Not stated explicitly in App. E, because: • We didn’t have a proof at the time • Not an issue if you follow the procedure • Now we have a proof
Example: Critical Value • Suppose L = 3 and N = 7 • Let α = 0.05 • Then the critical value for W is • Since wC ≥ N + 2 = 9, we won’t have unexpectedly high false rejection rates Since α < 0.08, we didn’t really have to check
Some Facts about the Power • The power always increases with |δ| • The power increases with σ if or if • For a given bias δ with , there is a positive value of σ that minimizes the power • If , even this minimum power exceeds 50 % • Power increases with N
Power Comparisons • We compared the tests for power • Power to reject a biased method • Power to reject an imprecise method • The χ2 test outperforms the simple MARLAP test on both counts • Results of comparisons at end of this presentation
False Rejection Rates H1 Rejection rate = α Rejection rate < α H0 Rejection rate = 0
Region of Low Power H1 Rejection rate = α H0
Region of Low Power (MARLAP) H1 Rejection rate = α H0
Example: Applying the χ2 Test • Return to the scenario used earlier for the MARLAP example • Three levels (L = 3) • Seven measurements per level (N = 7) • 5 % overall false rejection rate (α = 0.05) • Consider results at just one level, TV = 300 pCi/L, where uReq = 30 pCi/L
Example (continued) • Reuse the data from our earlier example • Calculate the χ2 statistic • Since W > wC (17.4 > 17.1), the method is rejected • We’re using all the data now – not just the worst result
Likelihood Ratio Test for MSE • We also discovered a statistical test published in 1999, which directly addressed MSE for analytical methods • By Danish authors Erik Holst and Poul Thyregod • It’s a “likelihood ratio” test, which is a common, well accepted approach to hypothesis testing
Likelihood Ratio Tests • To test a hypothesis about a parameter θ, such as the MSE • First find a likelihood functionL(θ), which tells how “likely” a value of θ is, given the observed experimental data • Based on the probability mass function or probability density function for the data
Test Statistic • Maximize L(θ) on all possible values of θ and again on all values of θ that satisfy the null hypothesis H0 • Can use the ratio of these two maxima as a test statistic • The authors actually use λ = −2 ln(Λ) as the statistic for testing MSE
Critical Values • It isn’t simple to derive equations for λ, or to calculate percentiles of its distribution, but Holst and Thyregod did both • They used numerical integration to approximate percentiles of λ, which serve as critical values
Equations • For the two-sided test statistic, λ: • Where is the unique real root of the cubic polynomial • See Holst & Thyregod for details
One-Sided Test • We actually need the one-sided test statistic: • This is equivalent to:
Issues • The distribution of either λ or λ* is not completely determined by the MSE • Under H0 with , the percentiles λ1−α and λ*1−α are maximized when σ0 and |δ|uReq • To ensure the false rejection rate never exceeds α, use the maximum value of the percentile as the critical value • Apparently we improved on the authors’ method of calculating this maximum