1 / 41

Patient reported outcome measures and the Rasch model

Helen Parsons. Patient reported outcome measures and the Rasch model. Outline. Patient reported outcome measures Quick overview Analysis problems Rasch models Simple Rasch formulation Rasch extensions: polytomous data Application of the Rasch model Using the Oxford Knee Score

rafi
Download Presentation

Patient reported outcome measures and the Rasch model

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Helen Parsons Patient reported outcome measures and the Rasch model

  2. Outline • Patient reported outcome measures • Quick overview • Analysis problems • Rasch models • Simple Rasch formulation • Rasch extensions: polytomous data • Application of the Rasch model • Using the Oxford Knee Score • Model fit criteria • DIF checking • Summary

  3. Outcome measures • Outcome measures are widespread, with patientreportedoutcome measurements (PROMS) increasingly used • Try to capture some latent traitof the respondent • ie. Some trait that is difficult to directly measure like patient like “quality of life” or “anxiety” • Often in a self-report questionnaire format • EQ5D • Some outcome measures reported by clinicians • HoNoS • Sometimes incorporates clinical findings as well as questionnaire data • DAS 28

  4. Outcome measures have a variety of usages • One off assessment as a diagnosis tools • Comparative assessment • Such as measuring the outcome before and after an intervention • Longitudinal analysis • The NHS records and publishes1 the aggregated results from 4 PROMs as part of the quality assurance process 1: http://www.ic.nhs.uk/proms

  5. Analysis of outcome measures • As PROMs tend to be in a questionnaire format, often in the format of “total score” • i.e. a sum of ordinal scores • Often not “nice” distributions • Not normal • Bi-modal • Floor and ceiling effects • Analysis usually assumes linear relationships • That is, moving from 4/10 to 5/10 is the same clinical gain as moving from 9/10 to 10/10

  6. Example of PROM baseline data2 • Here a low score denotes good function • Most patients on higher values • Tail is abruptly cut off on RHS • Can have worse function than, but score the same as others 2: Data from Nick: OHS from WAT trial (ref: slide 15)

  7. Rasch Models • Part of Item Response Theory • Introduced by Georg Rasch(1901 - 1980) • Rasch, G. (1960/1980). Probabilistic models for some intelligence and attainment tests. • Used in psychometrics, so was created to describe a participant’s ability measured by item difficulties • Ability: the ‘latent trait’ of the participant • i.e. “maths ability” of a student • Difficulty: which levels of latent trait the question can discriminate • i.e. “easy” items identify poor students whilst “hard” items show the difference between good and excellent students

  8. Rasch formulation • Given a data matrix of (binary) scores on n persons (S1, S2, … Sn) to a fixed set of k items (I1, I2, … Ik) that measure the same latent trait, θ • Each subject, Sv has a person parameterθv denoting their position on the latent trait (ability) • Each item Ii has a item parameterβi denoting its difficulty

  9. Let: • β represent the vector of item parameters • θ represent the vector of person parameters • X be the n x k data matrix with elements xvi equal to 0 or 1 • Then: • Also assume: • Independence of answers between persons • No group work, no cheating! • A person’s answers are stochastically independent • All dependent on ability only • No person subgroups • The latent trait is uni-dimensional • i.e. can be used to assess “shame” but not “anxiety and depression” Rasch Models: Foundations, recent developments and applications. Fischer and Molenaar. Springer 1995.

  10. This plot is called an “Item Characteristic Curve” (ICC) • When varying ability, the item response is a logistic relationship • The probability of a positive answer is 0.5 when the person ability equals item difficulty • Given a set difficulty, larger abilities have a greater chance of affirming the item • i.e. better students score more! Ability Notice that the latent dimension is rescaled to centre zero and measured in logistic units

  11. Rasch advantages • A logistic model better captures a finite scale • Gives information on both persons and items • Model parameters are simple to obtain • Total score is sufficient for calculating the person parameter • Item score across persons is sufficient for calculating the item parameter • Extensions include • Polytomous data • 2 and 3 parameter IRT models • 2nd parameter adds a “discrimination” (slope) parameter • 3rd parameter allows “guessing”

  12. Rasch extension: polytomous data • The Rasch model uses a pass/fail score • However, what happens when pass some of the item? • E.g. exam marking – questions with multiple marks available • E.g. Surveys – Likert format questions • Two model variants Rating Scale Models Items all have the same number of thresholds at identical difficulty levels • Partial Credit Models • Allows a different number of thresholds each at a separate difficulty for each item

  13. Polytomous ICCs Plots for Question 1 Plots for Question 2 Plots for Question 3 RSM items only shift left and right PCM items change shape as well as shift Rating scale model Partial credit model The same data (eRm package example) was used to create each model

  14. Software and resources • Several payware packages available • WINSTEPS • www.winsteps.com/ • RUMM2020 • www.rummlab.com.au/ • Freeware becoming available • Several R packages now released • eRm is used throughout this talk • Itmand psych also have Rasch implementations • Growing literature base • But introduction books and courses hard to find!

  15. Example: The Oxford Hip Score • Assesses hip function • Designed to assess patients undergoing hip replacement surgery • Patient reported measure • 12 questions, patients choose appropriate statement which reflects their situation (out of 5 possible) • Here, each item marked 0-4, total score summed • Minimum of 0 indicates ‘perfect’ function • Maximum of 48 Dawson J, Fitzpatrick R, Carr A, Murray D. Questionnaire on the perceptions of patients about total hip replacement. J Bone Joint Surg Br. 1996 Mar;78(2):185-90.

  16. Data from the WAT trial • The Warwick ArthroplastyTrial • 126 participants at baseline • 2 intervention groups: hip replacement v. resurfacing • Analysed using the Partial credit model • Where categories were not all used, the remaining categories were renumbered, starting from 0 • Data available from the same cohort longitudinally Costa ML, Achten J, Parsons N, Edlin RP, Foguet P, Prakash U, Griffin  DR. A Randomised Controlled Trial of Total Hip Arthroplasty Versus Resurfacing Arthroplasty in the Treatment of Young Patients with Arthritis of the Hip Joint.BMJ 2012; 344:e2147.

  17. Distribution of abilities Mean difficulty Category thresholds Item difficulties Items in red indicate non-sequential categories Baseline data

  18. Item parameters • Question 9 has the lowest mean item parameter • Indicatingbest function • Have you been limping when walking? • Question 2 has the highest mean item parameter • Indicating worst function • Have you had any trouble washing and drying yourself? • Question 8 covers the widest set of difficulties • Most discriminating item • After a meal (sat at a table), how painful has it been for you to stand up from a chair? • 4 questions have non-sequential thresholds • Why does this happen?

  19. Non-sequential categories Thresholds occur where curves cross 0|1 2|3 3|4 1|2 1|2 0|1 2|3 3|4 Non-sequential item (Question 5) Sequential item (Question 11)

  20. Non sequential categories result from • Underused categories • Unexpected scoring patters • Could suggest problems with item • Fixed by • Removal of item • Combing categories

  21. Person parameters • Can associate scores and abilities • Monotonically increasing relationship • Clear that an increase of 1 is associated with different increases of ability • “Bigger” loss of function for low scorers • Middle of score scale gives similar abilities Baseline score

  22. Model results Heavy tail Centred about zero Total score distribution Ability distribution

  23. Comparison use Baseline data: Ability by treatment group • Have several models at different time points • Could use baseline model throughout • Could use new models at each time point • Have two treatment groups • A and B • Four follow-up points postintervention

  24. Change in function between baseline and 6 weeks follow up No significant differences between groups Calculated abilities at baseline Calculated abilities at 6 weeks Raw scores at baseline Raw scores at 6 weeks Using intention to treat groups

  25. Using baseline model at 6 weeks follow up No significant differences between groups Scores at 6 weeks Predicted abilities at 6 weeks

  26. Using baseline model at 12 months follow up No significant differences in either rating Differences between scores at 12 months and baseline Scores at 12 months Predicted abilities at 12 months Differences between abilities at 12 month (predicted) and baseline (calculated) Primary outcome of trial

  27. Items at 6 weeks • Very different to baseline • Question 4 now easiest (was Q9) • Question 3 now hardest (was Q2) • Double the number of reversed scales (8) • Suggests that patient function has changed greatly Remember Baseline model

  28. Items at 12 months • Notice wide range of abilities • Some patients now “recovered” • Some patients still with low function • Similar to baseline model • Q9 easiest • Q8 most discriminatory • Q2 second most difficult

  29. Model Comparisons Baseline model Model at 6 weeks Model at 12 months

  30. Scale calibrated from baseline data collection allows comparison of persons Scale calibrated from 6 week data collection allows comparison of items Abilities using baseline model Abilities using 6 week model

  31. Problems • Because at baseline no responders used the lowest two categories, did not have the full range of scores • Q1: how would you describe the pain you usually had from your hip? • This resulted in missing values in other collection points • At 6 weeks: 7 no score, 14 total missing • At 12 months: 3 no score, 9 total missing • Would need “calibration” data • From “healthy” population? • All time points? • Rasch model excludes maximum and minimum scores in model • Can calculate post-hoc

  32. Item fit statistics • Fit statistics are not standardised across software, so it’s hard to get a clear picture • Names, formulae and boundaries are different • There doesn’t appear to be a standard approach • Using WINSTEPS nomenclature • As the manual is available on line • http://www.winsteps.com/winman/index.htm • But this is still work in progress! • Not clear which implementation eRm package uses

  33. Common statistics • Chi squared statistics • Observed v model expected • Mean square residuals (MSQ) • t-statistics • Transformation of MSQ • Not certain where useful cut offs are • Two versions of each type • Infit (weighted by ability) • Outfit (overall sample)

  34. Sample size dependence varies by statistic • Most defined in terms of the standard Rasch model only • Personfit statistics also available • Similar approach • When removing a missfitting item, whole model must be recalculated • Which then finds new poor fitting items, etc, etc • Removed over half of all items in ME data set • May be problems due to instrument not designed for Rasch analysis • Subscales a major problem Smith et al. Rasch fit statistics and sample size considerations for polytomous data. BMC Medical Research Methodology 2008, 8:33

  35. WAT Baseline data

  36. 95% CI for ability thresholds overlap Item Pathway Map Other problems to consider: Lots of variability in item parameters

  37. Add in 95% CI for each person Person pathway map at Baseline Often have miss-fitting persons – but not looked into how to deal with this to date

  38. Differential item functioning • Rasch model requires that item difficulty does not change between groups • E.g. A shoulder function questionnaire asks about the ability to brush and style hair • If (on average) women spend more effort on more elaborate hairstyles, it would not be surprising to see that women with the same level of function find doing their hair more difficult • Differential item functioning (DIF) checks if this is indeed the case

  39. Confidence plot of thresholds Group A However: 6 questions excluded as not all thresholds used by both groups Group B Maybe something here Overall differences using Anderson’s LR test: No difference (p = 0.645)

  40. Summary • Rasch models • Give an alternative analysis approach to ordinal and binary scales • Less “bodging” of assumptions! • Give information on questions as well as respondents • 1 parameter case of item response theory • Rasch models could potentially be used in PROM analysis • Have potential applications in validation and construction of new PROMS

  41. Things I’m still working on • When is it a good fit • Still working on model fit statistics • Then assess person fit statistics • Does it matter at all? • How do you compare different populations • Is a calibration population the best way to go? • How can you find a clinically meaningful change? • How does item information effect the analysis • Is it useful?! • Thanks for listening!

More Related