200 likes | 216 Views
‘Kønne’ formler uden noget nedenunder: en advarende historie ‘Good-looking’ statistics with nothing underneath: a warning tale Jørgen Hilden. Notes have been added in the NOTES field. jhil@sund.ku.dk. Higher order refinements.
E N D
‘Kønne’ formler uden noget nedenunder: en advarende historie ‘Good-looking’ statistics with nothing underneath: a warning tale Jørgen Hilden Notes have been added in the NOTES field jhil@sund.ku.dk
Higher order refinements 2. floor: 2nd moment. Stand. errors, etc. 1. floor: 1st moment issues. Bias? Statisticalcounselling What do you really want to know / measure? Meaningless estimand?? Nonsense arithmetic? ?
Ground floor examples ”The pH was doubled” – OOSH! How do I define / calculate the mean waiting time to liver transplantation in 2012 ? – Tricky, or impossible. Dr. NN, otologist: #(consultations) / #(patients seen) = 2.7 in 2012, = 1.5 in JAN-MAR 2013. – An interpretable change ?
Higher order refinements 2. floor: 2nd moment. Stand. errors, etc. Statisticalcounselling 1. floor: 1st moment issues. Bias? …THEORY… What do you really want to know / measure? Meaningless estimand?? Nonsense arithmetic? ?
…on the dangers of inventing and popularizing new statistical (epidemiological) measures which are based entirely on ’nice looks’ and have no proper theoretical underpinning
Consider prognostics as to survival vs. death (D) New biochemical marker Standard clinical data risk q risk p Better?Oracle ’Old’ oracle χ2 = 2Σi{(lnq– lnp)D + (ln@q– ln@p)@D}i @: complement is high; odds ratio or hazard ratio, etc., highly significant
The statistics IDI = integrated discrimination improvement & its ‘little brother,’ the NRI = net reclassification index, were designed to measure of the incremental prognostic impact that a new marker will have when added to a battery of prognostic markers for assessing the risk of a binary outcome. Intuitively plausible? – Yes, they are. But their popularity is undeserved, nonetheless.
New biochemical marker Standard clinical data risk q risk p Better?Oracle ’Old’ oracle χ2 = 2Σi{(lnq – lnp)D + (ln@q – ln@p)@D}i Proposed ’measures’ of the superiority of the new oracle: NRI ≈ E{sign(q – p)|Death} + E{sign(p – q)|Survives} IDI ≈ E{q – p | Death} + E{p – q | Survives} Pencina & al. (2008+)
Newbiochemical marker Standard clinical data risk q risk p Better?Oracle ’Old’ oracle χ2 = 2Σi{(lnq – lnp)D + (ln@q – ln@p)@D}i Standard measures of prognostic gain: Δ(logarithmic score) = (1/n)Σ{ ln(q/p)D + ln(@q/@p)@D } = χ2 /( 2n ) ; 2Δ( Harrell’s C) = Σij(qi – qj)(Di – Dj) / Σij(Di – Dj) – (do.with p’s).
IDI andNRI were proposed because the C Index was regarded as the standard measure of prognostic performance, and it turned out to be “insensitive to new information”: “Look, the hazard ratio was as high as 2.5 and strongly significant (P = 0.0001), yet C only increased from 0.777 to 0.790 !”
Main flaws of the NRI/IDI family of statistics … gradually uncovered by various investigators: Attic: • sampling distributions much farther from Gaussian than originally thought. 2nd floor: • original SE formulae wrong and seriously off the mark (when training data = evaluation data). 1st floor: • biased towards attributing prognostic power to uninformative predictors, at least in logistic regression models (Monte Carlo), so they may fool their users; • bias otherwise undefined or irrelevant (see **). Ground floor: …
Main flaws of the NRI/IDI family of statistics (cont’d) Ground floor: • NRI/IDI do reflect prognostic gain, but **what do they measure? What optimality ideal do they portray? • users may also be deliberately fooled by an opponent who wants to sell the q’s (i.e., sell the new marker equipment) and who already knows the p’s of patients in the sample [dishonesty pays; keyword: non-proper scoring rule]. Essence: theyrewardoverconfidence, i.e. , large risksaretoo large, small riskstoo small.
Deliberately fooled?? Recall: pi = patient’s ’old’ risk of ’event’, qi= ’new’ risk. IDI (parameter) graphically defined: IDI = E{q – p | event } – E{q – p | no event } = sum of arrows meanp q Event Risk No event 0 1 = 100% Alas! – The IDI is vulnerable to deliberate ( or accidental ) overconfidence…
The p rule can be “improved on” simply by making its predictions more extreme: For patient i, the cheater may report a fake qi {let’s call it Z} = either100% or = zero: Zero to the left100% to the right of the red line. p q Event No event marginal event frequency approximately known to cheater
Proof Event Consider IDI: The cheater tries to ”optimize” Z: heexpects the i’th patient to contribute to IDI: +(Z – pi)/#D with prob. pi and –(Z – pi)/(n – #D) with prob. (1 – pi); i.e., a linear function of Z, maximizable by setting Z := 1 (0) for pi > (<) #D/n = the marginal frequency approx. known to him. If in doubt, he may play safe by setting Z := pi . No event (Z – pi){ pi / #D – (1 – pi) / (n – #D) },
Adoption of IDI → Spurious results may arise when risks are overconfident ( instead of being well calibrated ) as may happen with an unlucky choice of regression program. (Cheaters beat the best probabilistic model, so…) A supporter of a new lab test may sell it without ever doing it !* Simply by exploiting knowledge of the assessment machinery, a poor prognostician can outperform a good prognostician. * cf. The Emperor’s New Clothes
Stepping back – what do we really want? Ideally, clinical innovations should be rated in human utility terms. In particular: New information sources should be valued in terms of the clinical benefit that is expected to accrue from ( optimized use of ) the enlarged body of information: Value-of-Information ( VOI ) statistics. All VOI-type, (quasi-) utility expectation statistics are Proper Scoring Rules ( PSRs ). Key properties of a PSR: Good performance cannot be faked. It pays for a prognostician to strive to fully use the data at hand and to honestly report his assessment. He cannot increase his performance score by ‘strategic votes,’ not even by exploiting his knowledge of the scoring machinery.
Conversely: IDI can be faked ↓ IDI is not a PSR ↓ IDI is not a VOI criterion ↓ One cannot construct a decision scenario – not even a ridiculously artificial one – that has the IDI as its utility-expectation criterion. Strengthened conclusion Even in the absence of cheating, it cannot be claimed that IDI measures something arguably useful or constitutes a dependable yardstick.
Summing up the horror story: What went wrong in the Boston group? • Theyknewnobetterthanembracing the CIndex • as theirmeasure of prognostic power. • (2) C turns out disappointingowing to itsunexpectedresilience • to ’wellsupported’ novelprognostic markers • [theymix up weight of effect & weight of evidence]. • (3) They [undeservedly] discardC as ’insensitive to new • information.’ • (4) They propose NRI, IDI and variants as being more • sensitive to new information [overlookingthatthesearealso • sensitive to nullorpseudo information]. • (5) Theyrashlysuggest SE formulae and • makevaguepromises of Gaussian distribution • in reasonably large samples [bothwrong].