General look at testing

General look at testing Taking a step backwards

Issues of importance within testing • Categories of Tests - different classification methods • Content vs non-content • Uses and Users of Tests • Assumptions and Questions behind the use of tests • Creating a test from scratch

Ind administered Intelligence Gp administered Series of tests e.g., reading & mathematics Batteries Single Subject Certification Vocational Standardisation Government Diagnostic Individual Categories of Tests / content Wechsler Adult Intelligence Scale - WAIS Mental Abilities Memory, spatial, creativity Scholastic Assessment Test - SAT Achievement Tests

Objective Projective Vocational Attitude Values Categories of Tests / content Minnesota Multiphasic Personality Test - MMPI Personality Rorschach Inkblot Test Interests & Attitudes Neuropsychological Luria-Nebraska Neuropsychological battery - LNNB

Categories of Tests / non-content • Paper and pencil vs. performance • Respondent selects between predefined answers • Examinee performs some action and is judged on it • Speed vs. power • Former purely interested in speed • Latter tests limits of knowledge or ability - no time limit imposed • Usually both are tested at the same time

Categories of Tests / non-content • Individual vs. group testing • Maximum vs. typical performance • Ability tests usually want to know about best performance • On personality test - how typically extroverted are you • Norm-referenced vs. criterion referenced performance • Only relative performance considered • How well did you do relative to predefined criteria

Users of tests • Professional psychologists - time spent in assessment • Psychologists working in a mental health setting spend - 15-18% (Corrigan et al ‘98) • Over 80% of neuropsychologists - 5 or more hours /wk (Camara et al, ‘00) • Educational psychologists - 1/2 of working wk (Hutton et al, ‘00) • 2/3 of counseling psychologists use objective measures regularly (Watkins et al, ‘98)

Other uses of tests • Within education • Measure performance or predict future success • Personnel • Select appropriate person or to select the task to which the person is most suited • Research • Test often serves as the operational definition of the DV

Basic assumptions • Humans must have recognized traits which we consider to be important • Inds must potentially differ on these traits • These traits must be quantifiable • Traits must be stable across time • Traits must have relationship with actual behaviour

Issues to be concerned about • How the test was developed • Reliability • Validity

Constructing a reliable test • Is a much more extensive a process than average user realises • Most personality constructs have been established – tests readily available to measure them – proliferation of tests would therefore seem pointless from a theoretical point of view

Writing test items • Covered question format before – in addition • Need to ensure that all aspects of the construct should be dealt with –anxiety- all the different aspects of construct should be considered • Need to be long enough to be reliable - start with around 30 and reduce to 20 • Should only assess one trait • Should be culturally neutral • Should not be the same item rephrased (mentioned during FA)

Establishing item suitability • Should not be too many items which are either very easy or very hard • >10% of items with scores < .2 or >.8 is questionable • Items should have an acceptable standard deviation. If it is too low then it is not tapping into individual differences • If there are different constructs then it is important that an equal number of items refers to each construct.

Establishing item suitability • Criterion keying – choosing items based on their ability to differentiate groups • Atheoretical • Groups must be well defined • Interpret liberally since there will be overlap in response distributions • By FA – items that have a low loading (<.3) would be removed

Establishing item suitability • Classical item analysis • Correlation of item score with score on the whole test (excluding that item) calculated • Removing item with low such correlation improves reliability • But since reliability is also a product of number of items there is a balance • Point comes where removing a poor item decreases reliability since it depends on the average correlation and the number of items in the test • Each time an item is removed the correlation of each item to the main score must be recalculated since this will change as items are removed

Revisiting reliability and validity • Each scale should assess one psychological construct • Measurement error means that for any one item the psychological construct only accounts for low % of the respondent’s variation • Other factors cause most of variation - age, religious beliefs, sociability, peer-group pressure • Use several items and this random variation should cancel each other out such that measured variance is due to underlying construct

Reliability of test • Does not mean temporal stability (test-retest reliability measured through parallel forms) • Is a measure of the extent to which a scale measures one construct only • Split-half reliability • Cronbach’s Alpha • Influenced by the average correlation between the items and the number of items in the test • Boosted by asking the ‘same’ question twice • Test should not be used if alpha is below .7

Test Validity • Face Validity • Content Validity • Construct Validity • Predictive Validity

General look at testing