Testing reliability and validity in medical research

Testing reliability and validity in medical research Moon Seok Park, MD Seoul National University Bundang Hospital

Reliability

1년 차 때, 교수님이 “내일까지 X-ray 1000장 재 봐서 결론 내!!”고 오더를 내리셔서. • 처음 재보는 각도, 밤새 측정을 했다. 힘들어서 인턴도 시켰다. 제대로 했는지도 잘 모르겠다. • 그런데, 결과는의미 있게 나왔다. OK!!

두 개의 다른 방법으로 측정을 했을 때, 신뢰도를 알아 보려면 paired t-test로 하면 안 되는가? • Paired t-test는 어떨 때 쓰는 방법일까?

Reliability • Extent to which scale items measure the same construct, with freedom of random error • 신뢰도 • 측정 시 마다 측정치가 비슷한가? • Test-retest reliability, Inter-rater reliability, Intra-rater reliability, Alternative form reliability, Internal consistency.

Test-retest reliability • 주로 Psychometric analysis : 인터뷰, 설문지…. • 일정한 시간 간격을 두고, 같은 검사를 시행. • Cohen’s kappa, weighted kappa, Pearson’s correlation, Intraclass correlation coefficient(ICC). • Cf) Intra-rater(observer reliability) : 방사선 검사 계측…. • Memory contamination

Inter-rater reliability • 전문가에 의한 인터뷰, scoring, 신체 계측, 방사선 계측. • 여러 명이 한 객체를 계측하여, 비슷한가 비교. • Cf) Agreement : 혼용되어 사용되지만, 특히 다른 기구를 이용한 측정, 예를 들어 MRI와 CT의 비교 등… • 방사선 계측 등에서는 intra- and inter-observer(rater) reliability를 set로. • Cohen’s kappa, weighted kappa, Pearson’s correlation, Intraclass correlation coefficient(ICC)

Internal consistency • 이전의 reliability와는 조금 다른 의미. Psychometric analysis (설문지, 인터뷰) 등에 주로 국한 되어 사용. • Homogeneity • 가령 10개의 문항이 있다고 하면, 각각의 문항이 서로 비슷. • Item to item, Item to total, Cronbach’s alpha • Too high internal consistency = Item redundancy. • Cf) Uni-dimensionality, Item response theory, Rasch analysis(INFIT statistics)

Question: which is reliable? 1 2 3 4

What are the main measures of reliability? • What if the data are dichotomous or polychotomous? • Kappa coefficient • What if the data are quantitative (interval or ratio scale? • Intraclass Correlation Coefficient (ICC)

ICC • Intraclass correlation coefficient • Reliability test for quantitative data

Models of ICC • One-way random effect model • Raters: a random effect • Two-way random effect model • Raters: a random effect • Subjects: a random effect • Two-way mixed effect model • Raters: a fixed effect • Subjects: a random effect

Types of ICC • Absolute agreement • Measures if raters assign the same absolute score • Consistency • Measures if raters’ scores are highly correlated even if they are not identical in absolute terms

Measures of ICC • Single measures • Individual ratings constitute the unit of analysis • Average measures • The mean of all ratings is the unit of analysis

ICC • Affected by true subject variability as well as measurement error

Example • Measurement error • Data 1 = Data 2 • Subject variability • Data 1 < Data 2 Data 1 Data 2

Shrout and Fleiss, 1979 • Propose 6 ICC types: ICC(1,1) ICC(2,1) ICC(3,1) ICC(1,k) ICC(2,k) ICC(3,k) } Expected Reliability of a Single Rater’s Rating } Expected Reliability of the Mean of a set of k Raters

k (no.of observers), n (no.of targets)

between-target mean square (BMS); within-target mean square(WMS); BMS represents true subject variability, and WMS represents measurement error

Shrout and Fleiss, 1979 • Important issue in the choice of an appropriate index • Whether the ANOVA design should be one way or two way • Whether raters are considered fixed or random effects • Whether the unit of analysis is a single rater or the mean of several raters

Pitfalls and important issues in testing reliability using ICC in orthopaedic research

Literature review • Pubmed database • Orthopaedic articles that used ICC • Of the 92 articles identified, 58 (63%) did not clarify the ICC model used. • The model, types, and measures used were clearly declared in only 5 (5%)

ICC of physical examinations • 30 patients with CP • Interobserver reliability of physical examinations using ICC • Popliteal angle • Thomas test • Staheli test Same dimension !! (joint angle)

Simulated data

Conclusion • ICC value could represent the opposite tendency to true measurement error (mean absolute difference) even when measuring similar dimension • ICC could be variable depending on the model used. • ICC value was affected by measurement error, subject variability, and slopes.

결론적으로 이렇게 해야.. • ICC values were large when measurement errors were small, subject variability large, and slopes parallel. • Clinical context need to be considered when interpreting ICC. • ICC setting should be declared.

Validity

Validity • Extent to which instruments is really measuring what it purpose to measures. • 보통 internal validity라고 이야기 한다. • Cf) external validity = generalisability

Validity • Face validity • Content validity • Criterion(concurrent, predictive) validity • Construct(convergent, discriminant) validity

Face validity • 안면 타당도(액면 타당도) • Content validity와 혼동될 수 있지만, 좀 더 추상적임. • 예를 들어 영어 시험의 문항에 수학 문제가 있으면, face validity에 문제가 있는 것. • 대게 저자들이 screening하는 정도로 표현.

Content validity • 내용 타당도 • Face validity와 비슷하지만, 좀 더 systematic하게 분석. • 일정 수의 panel이 모여서 content validity를 scoring하여, 점수화 하고, 평균 점수가 미달이면 기각.

Criterion validity • Concurrent validity : gold standard와 얼마나 비슷한가? • 방사선 지표를 측정한다. Gold standard로 생각하는 CT 측정치와 비교. • Cf) convergent validity. • Predictive validity

Construct validity • 구인 타당도 • Convergent validity : 비슷한 지표(gold standard는 아님)와 상관관계가 있는가? • TEPS라는 영어시험을 만들었다. 타당도를 보려고, TOFLE과 상관관계를 보았다. (영어실력의 gold standard는 ?) • 사람이 측정한 방법과 컴퓨터가 측정한 방법에 상관 관계가 있는가? • Pearson correlation.

Construct validity • Discriminant validity : 전혀 다른 것을 측정하는 지표와 상관 관계가 있는가? • 인성검사와 지능검사의 상관관계 • Cf) Known group validity : 확실히 다른 집단에서 다른 점수가 나오는가?

Others • Precision • Responsiveness • Sensitivity • Specificity • Sensitivity analysis • Item response theory • Rasch analysis

Introduction • Increased femoral anteversionand coxavalgaare common deformities associated with intoeing gait and unstable hips in CP, which need surgical correction.

Introduction • Physical examination and neck shaft angle measured on hip radiographs are primary tools evaluating femoral anteversion and coxavalga.

Introduction • Physical examinations measuring femoral anteversion include • Trochanteric prominence angle test (TPAT) • Hip internal rotation (IR) • Hip external rotation (ER)

Introduction • CT measurement is accurate, but expensive and involves radiation exposure.

Purpose of Study • To assess the validity and reliability of physical exams measuring femoral anteversion and neck shaft angle on hip X-ray • Concurrent validity • Intra- and interobserver reliability

Reliable and valid Not reliable but valid Not reliable and not valid Reliable but not valid

Materials and Methods • Prospective study approved by IRB • 36 consecutive patients with CP • Mean age 11.0 years (SD 1.3) • M : F = 26 : 10 • GMFCS I / II / III / IV / V 5 / 11 / 11 / 7 / 2 • Exclusion • Previous Op, trauma, infection, etc.

Hip Internal Rotation • Prone position • Angle between vertical line & long axis of the leg • legs are rotated outward maximally

Testing reliability and validity in medical research

Testing reliability and validity in medical research

Presentation Transcript

Reliability and Validity in Research

Reliability and Validity

Reliability and Validity

Reliability and Validity

Reliability and Validity

VALIDITY AND RELIABILITY

Reliability and Validity Testing

Reliability and Validity in Research

Reliability and Validity

Validity and Reliability

Validity and Reliability

Reliability and Validity

Validity and reliability

Validity and Reliability

Reliability and Validity

Validity and Reliability

Reliability and Validity

Validity and Reliability

Reliability and Validity in Research

Reliability and Validity

Validity and Reliability