630 likes | 1.32k Views
Characteristics of Successful Assessment Measures. Reliable Valid Efficient - Time - Money - Resources Don’t result in complaints. Reliability. What Do We Mean by Reliability?. The extent to which a score from a test is consistent and free from errors of measurement.
E N D
Characteristics of Successful Assessment Measures • Reliable • Valid • Efficient • - Time • - Money • - Resources • Don’t result in complaints
What Do We Mean by Reliability? • The extent to which a score from a test is consistent and free from errors of measurement
Methods of Determining Reliability • Test-retest (temporal stability) • Alternate forms (form stability) • Internal reliability (item stability) • Interrater Agreement
Reliability Test-Retest
Test-Retest Reliability • Measures Temporal Stability • Stable measures • Measures expected to vary • Administration • Same participants • Same test • Two testing periods
Test-Retest ReliabilityScoring • To obtain the reliability of an instrument, the scores at time one are correlated with the scores at time two • The higher the correlation the more reliable the test
Test-Retest ReliabilityProblems • Sources of measurement errors: • Characteristic or attribute being measured • may change over time. • - Reactivity • - Carry over effects • Practical problems: • - Time consuming • - Expensive • - Inappropriate for some types of test
Standard Error of Measurement • Provides a range of estimated accuracy • 1 SE = 68% confident • 1.98 SE = 95% confident • The higher the reliability of a test, the lower the standard error of measurement • Formula
Serial Killer IQ ExerciseMean = 100, SD = 15, Reliability=.90IQ of 70 for death penalty
Serial Killer IQ - AnswersMean = 100, SD = 15, Reliability=.90IQ of 70 for death penalty
Reliability Alternate Forms
Alternate Forms Reliability • Establishes form stability • Used when there are two or more forms of the same test • Different questions • Same questions, but different order • Different administration or response method (e.g., computer, oral) • Why have alternate forms? • - Prevent cheating • Prevent carry over from people who take a test more than once • GRE or SAT • Promotion exams • Employment tests
Alternate Forms ReliabilityAdministration • Two forms of the same test are developed, and to the highest degree possible, are equivalent in terms of content, response process, and statistical characteristics • One form is administered to examinees, and at some later date, the same examinees take the second form
Alternate Forms ReliabilityScoring • Scores from the first form of test are correlated with scores from the second form • If the scores are highly correlated, the test has form stability
Alternate Forms ReliabilityDisadvantages • Difficult to develop • Content sampling errors • Time sampling errors
What the Research Shows • Computer vs. Paper-Pencil • Few test score differences • Cognitive ability scores are lower on the computer • for speed tests but not power tests • Item order • - Few differences • Video vs. Paper-Pencil • Little difference in scores • Video reduces adverse impact
Reliability Internal
Internal Reliability • Defines measurement error strictly in terms of consistency or inconsistency in the content of the test • With this form of reliability the test is administered only once and measures item stability
Determining Internal ReliabilitySplit-Half Method • Test items are divided into two equal parts • Scores for the two parts are correlated to get a measure of internal reliability • Need to adjust for smaller number of items • Spearman-Brown prophecy formula: • (2 x split half reliability) ÷ (1 + split-half reliability)
Spearman-Brown Formula (2 x split-half correlation) (1 + split-half correlation) If we have a split-half correlation of .60, the corrected reliability would be: (2 * .60) ÷ (1 + .60) = 1.2 ÷ 1.6 = .75
Spearman-Brown FormulaEstimating the Reliability of a Longer Test L = the number of time longer the new test will be
Example Suppose you have a test with 20 items and it has a reliability of .50. You wonder if using a 60-item test would result in acceptable reliability. = = = Estimated New Reliability = .75
Common Methods to Determine Internal Reliability • Cronbach’s Coefficient Alpha • - Used with ratio or interval data. • Kuder-Richardson Formula • Used for test with dichotomous items • yes-no • true-false • right-wrong
Interrater Reliability • Used when human judgment of performance is involved in the selection process • Refers to the degree of agreement between 2 or more raters • 3 common methods used to determine interrater reliability • Percent agreement • Correlation • Cohen’s Kappa
Interrater Reliability MethodsPercent Agreement • Determined by dividing the total number of agreements by the total number of observations • Problems • Exact match? • Very high or very low frequency behaviors can • inflate agreement
Interrater Reliability MethodsCorrelation • Ratings of two judges are correlated • Pearson for interval or ratio data and Spearman for ordinal data (ranks) • Problems • Shows pattern similarity but not similarity of actual ratings
Interrater Reliability MethodsCohen’s Kappa • Allows one to determine not only the level of agreement, but the level that would be determined by chance • A Kappa of .70 or higher is considered acceptable agreement
Forensic Examiner A Forensic Examiner B
Increasing Rater Reliability • Have clear guidelines regarding various levels of performance • Train raters • Practice rating and provide feedback
Scorer Reliability • Allard, Butler, Faust, & Shea (1995) • 53% of hand scored personality tests contained at least one • error • 19% contained enough errors to alter a clinical diagnosis
Validity The degree to which inferences from scores on tests or assessments are justified by the evidence
Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests. ... The process of validation involves accumulating evidence to provide a sound scientific basis for the proposed score interpretations. It is the interpretations of test scores required by proposed uses that are evaluated, not the test itself. When test scores are used or interpreted in more than one way, each intended interpretation must be validated. Sources of validity evidence include but not limited to: evidence based on test content, evidence based on response processes, evidence based on internal structure, evidence based on relations to other variables, evidence based on consequences of testing. Standards for Educational and Psychological Testing (1999)
Common Methods of Determining Validity • Content Validity • Criterion Validity • Construct Validity • Known Group Validity • Face Validity
Validity Content Validity
Content Validity • The extent to which test items sample the content that they are supposed to measure • In industry the appropriate content of a test of test battery is determined by a job analysis • Considerations • The content that is actually in the test • The content that is not in the test • The knowledge and skill needed to answer the question
Test of Logic • Stag is to deer as ___ is to human • Butch is to Sundance as ___ is to Sinatra • Porche is to cars as Gucci is to ____ • Puck is to hockey as ___ is to soccer What is the content of this exam?
Messick (1995)Sources of Invalidity • Construct underrepresentation • Construct-irrelevant variance • Construct-irrelevant difficulty • Construct-irrelevant easiness
Domain Content Test Content
Validity Criterion Validity
Criterion Validity • Criterion validity refers to the extent to which a test score is related to some measure of job performance called a criterion • Established using one of the following research designs: • - Concurrent Validity • - Predictive Validity • - Validity Generalization