400 likes | 758 Views
Selecting Effective Early Reading Assessments. Natalie Rathvon, Ph.D. What We’ll Cover . A research-based framework for selecting early reading assessments Application of the framework to selected early reading instruments Early reading assessment case examples
E N D
Selecting Effective Early Reading Assessments Natalie Rathvon, Ph.D.
What We’ll Cover • A research-based framework for selecting early reading assessments • Application of the framework to selected early reading instruments • Early reading assessment case examples • Resources for early reading assessment and intervention
So many tests, so few guidelines . . . • Growing number of print and online tests that claim to assess or predict reading • Standards for Psychological and Educational Testing (AERA, APA, & NCME, 1999) • Provides general guidelines--not specific criteria--for evaluating psychometric quality
Myths about Early Reading Assessments • All claims that a reading measure is “scientifically based” are equally valid. • A valid and reliable measure is equally valid and reliable for all examinees. • All measures of the same reading component yield similar results for the same examinee.
Why does this happen? • Tests vary in terms of their psychometric characteristics and soundness. • Early Reading Assessment: A Practitioner’s Handbook
Traditional “Standard battery” (one size fits all) Assumes reading problems arise from internal child deficits Designed to provide a categorical label for programming purposes Component-based Targets domains related to the identified deficits Assumes most reading problems arise from experiential and/or instructional deficits Designed to provide information for guiding instruction Early Reading Assessment Models
4 Cognitive-linguistic Variables Phonological processing Rapid naming Orthographic processing Oral language 6 Literacy Skills Print awareness Alphabet knowledge Single word reading Contextual reading Reading comprehension Written language 10 Key Reading Components
Considerations in Selecting Early Reading Assessments • Technical adequacy:Psychometric soundness • Usability: Degree to which practitioners can actually use a measure in applied settings
Five Key Technical Adequacy Characteristics • Norms • Test floors • Item gradients • Reliability • Validity
How can we examine a test’s technical characteristics? • Test manuals? Tremendous variation in quality and quantity of the psychometric information provided • WJ III: 2 examiner manuals, separate 209-page technical manual • Dyslexia Early Screening Test: 7 pages in 45-page manual • Research literature? • Continuing stream of validation data
Norms: How do we interpret performance? • Norm-referenced measures:Comparisons with age/grade peers • Criterion-referenced measures: Comparisons with pre-determined performance standards • Nonstandardized measures:Research norms or examiner judgment
Evaluating the Adequacy of Norms • Are they representative? • Criteria:Should match a national or appropriate reference population • Are they recent? • Criteria:No more than 7 – 12 years old • Are subgroup and sample sizes large enough? • Criteria:At least 100 (subgroup size) & 1000 (sample size)
Evaluating Norms, II • Are norm table intervals small enough to reflect small changes in skill development and small differences among examinees? • Criteria: • No more than 6 months for students aged 7-11 and younger • No more than 1 year for students aged 8-0 to 18
Norms example 1: Expressive Vocabulary Test (AGS, 1997) • Date = 1995-1996 (age norms only) • Total norm group = 2,725 examinees • 5-0 to 6-11 group = 119-122 examinees tested per each 6-month interval • Derived scores = 2-month increments • Derived scores for 5-0 to 6-11 age group are based on 39-56 examinees.
Reliability: Are scores consistent and accurate? • Alternate-form: Form A vs. Form B • Internal consistency: Item A vs Item B • Test-retest: Time A vs. Time B • Interscorer: Scorer A vs. Scorer B • Criteria: =/> .80 for screening measures; =/> .90 for diagnostic measures
Hidden Threat to Reliability • Examiner variance: Differences among assessors in administering tasks and recording responses • Especially likely on: • Live-voice tasks (phoneme blending) • Fluency-based tasks (rapid naming) • Tasks with complex administration or scoring systems (LAC–3)
Reliability Example: TOWRE (PRO-ED, 1999) • Internal consistency = .93 and above • Alternate form = .90 and above • Test-retest = .90 and above for a study with examinees ages 6-9 (n = 29) • Interscorer = .99, based on agreement of 2 independent scorers with 30 completed protocols
Test Floors: Can the Test Detect Poor Readers? • Test floor:Lowest possible standard score when a student answers 1 item correctly • Adequate floors: Permit identification of students with very weak skills • Inadequate floors: Overestimate students’ level of skills
Test Floor Criteria • A subtest raw score of 1 should yield a standard score greater than 2 standard deviations below the subtest mean. • SS of 3 or less for a subtest mean of 10 • SS of 69 or less for a subtest mean of 100
Which Tests Are Likely to Display Floor Effects? • “Cradle-to-grave” tests • Phonemic manipulation tasks (deletion, substitution, reversal) • Oral reading fluency tests • Pseudoword reading tests • Spelling tests • Reading comprehension tests
Item Gradients: Can the Test Detect Small Differences? • Item gradient: Steepness with which standard scores change from 1 raw score unit to another • Adequate gradient: Sensitive to small differences in performance • Steep gradient: Obscures differences among performance levels
Item Gradient Criteria • 6 or more items between subtest floor and mean (M = 10) or • 10 or more items between subtest floor and mean (M = 100) • Caution: Item gradients should be evaluated in the context of test floors.
Test Floors and Item Gradients: Special Cases • Screening tests • Critical issue is cutoff score accuracy, not floor/gradient violations • Tests not yielding standard scores • Deciles, percentiles, quartiles, stanines • Rasch-model tests • Preclude direct inspection of raw score-standard score relationships • WJ family: WJ III, WRMT-R/NU, WDRB
Floor & Gradient Example: GORT-4 (PRO-ED, 2001) • Item gradients = adequate • Floors • Rate = inadequate below 8-0 for both forms • Accuracy = inadequate below 7-6 for Form A and below 8-0 for Form B • Comprehension = inadequate below 8-0 for Form A and below 9-0 for Form B • ORQ = inadequate below 6-6 for Form A and below 7-6 for Form B
Validity: Are the Results Meaningful? • Content validity: Effectiveness in assessing the relevant domain • Criterion-related validity:Effectiveness in predicting performance now (concurrent validity) or later (predictive validity) • Construct: Effectiveness in measuring what the test is supposed to measure • Criteria: Evidence of all three types of validity for the target population
Validity Example: WJ III ACH • Content validity: remarkably little content validity evidence • Criterion-related validity:correlates .63 to .82 with WIAT • WJ III Written Expression mean standard scores = more than 10 points higher than WIAT Written Expression mean standard scores
WJ III ACH Validity Example, Cont. • Diagnostic utility = study with 48 students with ADHD ages 6 – 17 • ADHD group scored significantly lower than norm group on 3 of 8 WJ III ACH tests (Oral Comprehension, Passage Comprehension and Calculation)
The Untold Story: Usability Considerations • Usability often has more influence in test selection and use than technical adequacy. • Virtually no research on impact of usability on test selection and use
Do these comments sound familiar? • “I know how to give it.” • “It doesn’t take long to give.” • “It’s easy to carry around.” • “I think I saw one in the storage closet.” • “I think that test kit has all the parts.”
Key Practical Characteristics • Test construction • Administration • Accommodations and adaptations • Scores and scoring • Interpretation • Links to intervention
Usability Example: DEST (PsyCorp, 1996) • Inexpensive ($130.00) • Has numerous stimulus materials to manage, increasing administration time • Letter Naming subtest: 4 cards for 12 items • Digit Naming subtest: 3 cards for 9 items • Requires calibrating a postural stability balance tester • Manual is not spiral bound, so it doesn’t lie flat during administration.
Increasing the Effectiveness of Early Reading Assessments • Begin with measures that target domains directly related to the referral problem. • Supplement norm-referenced measures with criterion-referenced measures to ensure adequate coverage and increase instructionally relevant information. • Know the psychometric strengths and limitations of each measure you use.
Increasing Effectiveness, II • Evaluate the presence of attentional, behavior, and motivational problems. • Key predictors of response to intervention • The Unmotivated Child • Assess environmental and instructional variables.
The Golden Rule of Assessment • The best designed assessment with the most reliable and valid measures administered by the best trained examiner won’t change a child’s reading trajectory . . . unless someone in the child’s life does something different. Effective School Interventions: Strategies for Enhancing Academic Achievement and Social Competence
Early Reading Assessment and Intervention Resources AERA, APA, & NCME. (1999). Standards for educational and psychological testing. Washington DC: AERA. www.apa.org Buros Institute of Mental Measurements. www.unl.edu/buros Center for Equity and Excellence in Education Test Database. http://ceee.gwu.edu/standards_assessments/sa.htm ERIC Clearinghouse on Assessment. http://www.ericae.net Florida Reading Research Center. http://www.fcrr.org
More Resources • Rathvon, N. (2004). Early Reading Assessment: A Practitioner’s Handbook. New York: Guilford. www.guilford.com • Rathvon, N. (1999). Effective School Interventions: Strategies for Enhancing Achievement and Social Competence. New York: Guilford. www.guilford.com • Rathvon, N. (1996). The Unmotivated Child: How to Help Your Underachiever Become a Successful Student. New York: Simon & Schuster. www.simonsays.com • Southern Educational Development Laboratory. www.sedl.org/reading/rad