Reliability

Reliability

Reliability - meanings • Everyday uses: • A reliable machine starts and runs continously after we push the ON button. • A reliable employee arrives on time and is rarely absent. • A reliable source provides accurate information. • A reliable car dealer is in business for many years and gives good customer service.

Reliability – psychological testing • In testing theory reliability means: • Replicability – the score can be replicated. • Consistency – the same construct is assessed throughout the test. • Reliability is a feature of all measurement tools.

Intuitive perspective • measuring rule • scales

True Score Theory • Frameworks of the test reliability: • Classical test theory (CTT) • Item response theory • Generalizability theory • Most popular: CTT • IRT – gains popularity

Classical test theory

Observed score (X) • Observed score (X) – is a person’s actual score on a test. • E.g. 15 correct answers out of 20 on a exam test. • It may be affected by many factors either positive or negative. • Examples?

True score (T) • True score – is the score a person would get if all sources of unreliability were removed or cancelled. • Like avarage score from many different (infinite number) administrations of the test. • Different conditions may introduce some unreliability. • In practice: unobservable.

Error score (E) • Error score is the difference between the true score and the observed score. • E may add something to X or subctract from X. • Transformation: T = O – E E = T - O

Error score • Error has unsystematic influence on true score • Therefore, it is random, which means that when we test someone infinite number times: • All possible errors will be normally distributed with mean=0. • E won’t be in a systematic relationship with T • Relationship between two errors = 0

Variance of true score • When we consider groups of scores:

Definition of reliabilty • Using the symbols, we define reliability of a test as:

Definition of reliabilty • Reliability is the proportion of observed score variance that is true variance.

Total observed varianceError varianceTrue variance Observedvariancerepresentsabouthalf of thetruevariance Error variance is relatively small

Link to empirical studies • Alternatively we can think about reliability in terms of stability of a score in time an across different conditions. • We call a test reliable if the score obtained by an individual is repeatable. • Regardless of situation he/she will obtain always the same score.

Definition with empirical link • Reliability – how stable is a score of a test when we repeat the measurement. • In practice there are many methods determinig reliability. • All adopt the above definition.

Methods assessing reliabilty • Test-retest reliability • Alternate form reliability • Internal consistency: • Split-half reliability • Kuder-Richardson formulas • Cronbach’s alpha • Inter-rater reliability

Methods assessing reliabilty • Analyses are different with different reliability coefficients. • Each method provides different information about the test. • It is recommended to use at least two methods.

Test-retest reliability • Administering the same test to the same individuals on two seperate occasions. • Two occasions – week to few months apart. • The reliability coefficient is the correlation. • The higher the correlation, the more reliable test.

Sources of unreliability • Changes in personal conditions. • Fatigue, emotional states • Learning • Motivation • Stability of a test across various situations. • Testing conditions • Climated changes • Stability of a trait in time.

Stability in time • How long should the time between two measurements be? • The content should be forgotten • The construct shouldn’t change • Depends on a test and underlying construct. • The means should be controlled.

Example

Stability - problems • Children change fast • Intelligence, knowledge – subjects can learn • Trait vs state • It is difficult to test the same people • It doesn’t say anything about the content • First measurement affect second

Test-retest - practice • Types of tests its used: • Traits assessment • Personality questionnairs • Cognitive performance (excluding knowdlegde about facts) • Preferences, attitudes

Alternate form reliability • Avoids the problem of learning and remembering content • Requires two forms of the test • Two forms should be smilar in terms of: • Number of items • Time limits • Content specification • Instruction

Alternate form reliability • Two forms should be smilar in terms of statistics: • Equal means • Equal SD • Equal intercorrelations • Should correlate the same with external any variable

Administration • The same group of examinees fill two forms. • Usually there is some time between two administrations. • The time depends on the test and construct (like with test-retest) • The correlation between forms is the reliability coefficient.

Sources of unreliability • Stability in time • Sensitivity for contextual factors (e.g. testing conditions). • Items content – similarity between two forms.

Where to use • All tools that measure traits. • E.g. personality questionnairs • Intelligence tests • Knowledge about facts

Alternate forms - problems • In practice – rarely used. • It is difficult to construct two tests. • Difficult to find parallel items with very similar content but differently expressed.

Internal consistency • One of the most frequently used. • Few methods: • Split-half reliability • Kuder-Richardson formulas • Coefficient alpha

Split-half Reliability • How to avoid two measurements? • Measure only once. • How is the assumption about replicability of a score violeted?

Split-half Reliability • Like the two alternate forms were adiministered in immediate succession. • Split-half means that we administer one test. • After we collect tests, we split items into halves. • Then, we treat the halves as alternate forms.

Split-half Reliability • How to split test into halves? • It depends on the content of the test. • We have few possibilities: • Randomly chosen items • Odd-even items • With respect to items content and their statistics

Random items • If we can assume that all test items are equal in terms of the content and their statistics. • E.g. many personality questionnaires: we assume that items are equally important. • Then, it dosen’t matter how we split the items.

Random choice - example 1. Worry about things. 2. Fear for the worst. 3. Am afraid of many things. 4. Get stressed out easily. 5. Get caught up in my problems. 6. Feel threatened easily.

Split-half: odd-even • E.g. some intelligence tests may have increasing difficulty of items. • Examinees may be fatigued toward the end of the test. • Timing effect affects the second half. • Simple split into halves or random selection is not possible.

Split-half: odd-even • Two halves are ballanced. • E.g., contain items with similar difficulty

Split-half: item content • Sometimes groups of items are distibuted throughout the test. • We need to ballance the halves with items from different groups.

Item content - example 1.Feel comfortable around people. 2. Love excitement. 3. Seek adventure. 4. Love action. 5. Make friends easily. 6. Willing to try anything once. 7. Am skilled in handling social situations. 8. Am the life of the party.

Formula • Simple correlation between the two halves doesn’t give reliability of the full-length test. • It gives the reliability only of one half. • The correction must be applied: Spearman-Brown formula.

Spearman-Brown formula - reliability of the entire test - correlationbetweenthetwohalves

Example • Correlation between halves = 0.5 = 2 x 0.5/1 + 0.5 = 1/1.5 =0.66

Example • Correlation between halves = 0.9 = 2 x 0.9/1 + 0.9 = 1.8/1.9 =0.94

Conclusions • Spearman-Brown formula’s outcome is always higher than initial correlation. • The higher the correlation between halves, the higher the Spearman-Brown result.

Split-half summary • The source of error: incosistency between halves. • Different content of two halves. • Good method for tests with increasing difficulty, like many intelligence tests.

Internal consistency – other methods • Problem with split-half: the result depends on how we split the test. • Few methods were proposed that avoid this problem. • They give result which equals the mean correlation of all possible halves of a test.

Cronbach’s alpha • Very widely used. • Assumes that all items are equal in terms of: • The underlying theory – all items are good representations of the construct • Statistics – e.g. there are no big differences in difficulty. • Perfect method for most personality tests.

Formula - number of items - variance of the total scores - sum of the variances of each item

Reliability

Reliability

Presentation Transcript

Reliability

Reliability

RELIABILITY

Reliability

Reliability

Reliability

Reliability

RELIABILITY

RELIABILITY

Reliability

Reliability

Reliability

Reliability

RELIABILITY

Reliability

Reliability

Reliability

Reliability

Reliability

RELIABILITY

RELIABILITY

Reliability