Large-scale testing: Uses and abuses

Large-scale testing: Uses and abuses Richard P. Phelps Universidad Finis Terrae, Santiago, Chile January 7, 2014

Large-scale testing: Uses and abuses Three types of large-scale tests Judging test quality A chronology of mistakes Economists misunderstand testing How SIMCE is affected

1. Three types of testsAchievementAptitudeNon-cognitive

Achievement tests Historically, were larger versions of classroom tests ~ 1900 - “scientific” achievement tests developed (Germany & USA) J.M. Rice - systematically analyzed test structures & effects E.L. Thorndike - developed scoring scales SOURCE: Phelps, Standardized Testing Primer, 2007

Achievement tests Purpose: to measure how much you know and can recall Developed using: content coverage analysis How validated: retrospective or concurrent validity (correlation with past measures, such as high school grades) Requires a mastery of content prior to test. Fairness assumes that all have same opportunity to learn content Coachable – specific content is known in advance SOURCE: Phelps, Standardized Testing Primer, 2007

Aptitude tests 1890s – A. Binet & T. Simon (France) - Worked with pre-school children with mental disabilities - an achievement test was not possible - developed content-free test of mental abilities (association, attention, memory, motor skills, reasoning) 1917 – Adapted by U.S. Army to select, assign soldiers in World War 1 1930s – Harvard University president J. Conant wanted new admission test that to identify students from lower social classes with the potential to succeed at Harvard Developed the first Scholastic Aptitude Test (SAT) SOURCE: Phelps, Standardized Testing Primer, 2007

Aptitude tests Purpose: predict how much can be learned Developed using: skills/job analysis How validated: predictive validity, correlation with future activity (e.g., university or job evaluations) Content independent. Measures: … what student does with content provided … how student applies skills & abilities developed over a lifetime Not easily coachable – the content is either… … not known in advance, … basic, broad, commonly known by all, curriculum-free; … less dependent on the quality of schools SOURCE: Phelps, Standardized Testing Primer, 2007

Aptitude tests Aptitude tests can identify: students who are bored in school but study what interests them on their own students not well adapted to high school, but well adapted to university students of high ability stuck in poor schools

Comparing Achievement & Aptitude tests

Non-cognitive tests More recently developed – measure values, attitudes, preferences Types: integrity tests career exploration matchmaking employment “fit”

Non-cognitive tests Purpose: to identify “fit” with others or a situation Developed using: surveys, personal interviews How validated? success rate in future activities Content is personal, not learned “Faking” can be an issue (e.g., “honesty” tests)

Comparing Achievement, Aptitude, & Non-Cognitive Tests

2. Judging test quality Test reports can be “data dumps” 3 measures, in particular, are important: 1. Predictive validity 2. Content coverage 3. Sub-group differences

Predictive validity(values from -1.0 to +1.0) …measures how well higher scores on admission test match better outcomes at university (e.g., grades, completion) A test with low predictive validity provides a university little information.

A positive correlation between two measures Source: NIST, Engineering Statistics Handbook

A negative correlation between two measures Source: NIST, Engineering Statistics Handbook

No correlation between two measures Source: NIST, Engineering Statistics Handbook

¿Cómo se mide la capacidad predictiva?Coeficiente de correlación: I--------------------------------------------I-1 0 1

Predictive validities: SAT and PSU SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013

Predictive validities: SAT and PSU (faculty: Administracion) SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013

Predictive validities: SAT and PSU (faculty: Arquitectura) SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013

Predictive validities: SAT and PSU (faculty: Educacion) SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013

Predictive validities of the PSU (CTA v Pearson estimates) SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013; CTA

Incremental Predictive validities: (PAA + PCEs v PSU) SOURCE: S.A. Prado, Estudio de ValidezPredictiva de la PSU y Comparacion con el Sistema PAA, Universidad de Chile

Content coverage(values from 0% to 100%) …how much of the content domain of a test has been taught in the schools. It is not fair to expect students to master content to which they have not been exposed. …or, to compare students who have been exposed to others who have not.

2 problems: There are 2 curricula and PSU covers only one

Content coverage charts

Subgroup differences Differences in test scores among subgroups (e.g., gender, ethnic, school type) should be due only to differences in the attribute measured by the test and not to systematic biases in the test.

3. A chronology of mistakes

3. A chronology of mistakes 2000 (Comision Nuevo Curriculum de la Ensenanza Media y Pruebas del Sistema de Admision a la Educacion Superior SOURCE: InformeSometido en ConsultaPrevia a la Ministra de Educacion, Novembre 2000

Something on Atkinson?

A chronology of mistakes (cont.) 2001 (World Bank & MINEDUC) …the Academic Aptitude Test for entry to the university system is under revision, together with the universities belonging to the Council of Rectors. This instrument of entry selection, needs also to be aligned with the new curriculum and may become an exit exam from the secondary education system. SOURCE: World Bank,Implementation Completion Report on a Loan in the Amount of $35 million to the Republic of Chile for Secondary Education, 2001

A chronology of mistakes (cont.) 2005 (World Bank) …The new law adopted in May 2005 (Bulletin 3223-04) established a system of student loans available to all students achieving a threshold score in the University Admission Exam (PSU). …the new system does not impede students unable to provide collateral from financing their studies. The new system promises to improve equity further by increasing options for talented students from non-affluent families to access higher education. SOURCE: IMPLEMENTATION COMPLETION REPORT (TF-25378 SCL-44040 PPFB-P3360) ON A LOAN IN THE AMOUNT OF US$145.45 MILLION TO THE REPUBLIC OF CHILE FOR THE HIGHER EDUCATION IMPROVEMENT PROJECT, December 2005

A chronology of mistakes (cont.) 2010 (World Bank) Over time the government should consider replacing the university entry exam with a national school leaving exam as the prime criterion for entry into tertiary education institutions. This could establish a closer link between test results and the school that is responsible for them, making it easier to reach the goal that has been pursued with the introduction of the PSU. There is evidence that central curriculum based exit exams are strongly and positively related to student academic performance (Wößmann, 2005; Bishop, 2006). To allow students to show in more detail their knowledge and their ability to apply it, the school exit exam could be a bit more in-depth than the multiple-choice PSU, including verbal and nonverbal reasoning. SOURCE: N. Brandt, CHILE: CLIMBING ON GIANTS' SHOULDERS: BETTER SCHOOLS FOR ALL CHILEANCHILDREN; ECONOMICS DEPARTMENT WORKING PAPERS No. 784

A chronology of mistakes (cont.) 2010 (World Bank) The Catholic University of Chile and some partners have recently designed a complementary university entry exam and first evaluations revealed that this has the potential to reduce the socio-economic gap in university admission, while being a good predictor for later success at the university (Santelices, 2009). This suggests that it could be possible to develop adequate exams that make access to university easier for more disadvantaged children. SOURCE: N. Brandt, CHILE: CLIMBING ON GIANTS' SHOULDERS: BETTER SCHOOLS FOR ALL CHILEANCHILDREN; ECONOMICS DEPARTMENT WORKING PAPERS No. 784

3. Economists misunderstand testing

Testing & Measurement PhD program (University of Massachusetts, USA, 2013-2014) EDUC 501 Classroom Assessment EDUC 553 Construction, Validation, and Uses of Criterion-Referenced Tests EDUC 555 Introduction to Statistics & Computer Analysis I EDUC 632 Principles of Educational & Psychological Testing EDUC 637 Non-Parametric Statistics Analysis EDUC 656 Introduction to Statistical & Computer Analysis II EDUC 661 Educational Research Methods I EDUC 727 Scale and Instrument Development EDUC 731 Structural Equation Modeling EDUC 735 Advanced Theory & Practice of Testing I EDUC 736 Advanced Theory & Practice of Testing II EDUC 771 Application of Applied Multivariate Statistics I EDUC 772 Application of Applied Multivariate Statistics II EDUC 821 Advanced Validity Theory & Test Validation

What economists do not seem to understand about testing - 1 Increasing an admission test’s correlation with high school work can decrease its correlation with university work

What economists do not seem to understand about testing - 2 Incentives aren’t all matter in improving efficiency; also important: better information, classification, & allocation

What economists do not seem to understand about testing - 3 Incentives generally work best when applied to the actor responsible for the target behavior – currently, students bear the consequences when schools do not teach the curriculum tested on the PSU

What economists do not seem to understand about testing - 4 Many useful and successful tests serve multiple purposes. But, some purposes are compatible and some are not. The PSU has been expected to: Measure the implementation of a new curriculum; Incentivize high schools to implement the new curriculum; Incentivize high school students to study more; Predict success in university; ….

Large-scale testing: Uses and abuses