310 likes | 646 Views
Innovation and Growth of Large Scale Assessments. Irwin Kirsch Educational Testing Service February 18, 2013. Overview. Setting a context Growth in Large Scale Assessments (LSA) Features of Large Scale Assessments (LSA) Growing importance of CBA
E N D
Innovation and Growth of Large Scale Assessments Irwin Kirsch Educational Testing Service February 18, 2013
Overview • Setting a context • Growth in Large Scale Assessments (LSA) • Features of Large Scale Assessments (LSA) • Growing importance of CBA • Innovations in recent LSA (PIAAC and PISA) • Future areas for innovation
Setting a Context • Until relatively recently educational data were not collected in a consistent or standardized manner. • In 1958, a group of scholars representing various disciplines met at UNESCO in Hamburg, Germany to discuss issues surrounding the evaluation of schools and students through the systematic collection of data relating to knowledge, skills and attitudes. • Their meeting led to the development of a feasibility study of 13 year olds in 12 countries covering 5 content areas and the legal entity known as IEA in 1967.
Setting a Context • Back in the United States the Commissioner of Education, Francis Keppel, invited Ralph Tyler in 1963 to develop a plan for the periodic assessment of student learning. • Planning meetings were held in 1963 and 1964 and a technical advisory committee formed in 1965. • In April 1969, NAEP first assessed in-school 17 year olds in citizenship, science and writing.
Setting a Context • Tyler’s vision for NAEP was that it would focus on what groups of students know and can do rather than on what score an individual might receive on a test. • The assessment would be based on identified objectives whose specifications would be determined by subject matter experts. • Reports would be based on the performance of selected groups, not individuals, who responded correctly to the exercises and would not rely on grade-level norms.
Setting a Context • Prior to IEA and NAEP there were no assessment programs to measure students or adults as a group. • The primary focus of educational testing had been on measuring individual differences in achievement rather than on students’ learning. • And, the data that were collected dealt primarily with the inputs to education rather than the yield of education.
Setting a Context • Interpretations would be limited to the set of items used in each assessment. This basic approach to large scale assessments remained in place through all of the 1970s. • In the 1980s programs beginning with NAEP began to use item response theory (IRT) to allow for the creation of scales and the broadening of inferences to include items not included in the assessment. • New methodology involving marginal estimation was developed to optimize the reporting of proficiency distributions based on complex designs such as BIB spiraling. This approach remains in use today.
Growth and Expansion … not being satisfied with assertions or self reports … in response to policy makers and researchers wanting to know more … asking more challenging questions … and creating both the need and opportunity for new methodological and technological developments
Growth and Expansion • Number of assessments • Participation of countries • Populations who are surveyed • Domains / Constructs that are measured • Methodology • Modes
Growth and Expansion Overview 10
Growth and Expansion Life skills Curriculum Measurement
Growth and Expansion Life skills Curriculum Measurement
Features of LSA Assessment • LSA are primarily concerned with the accuracy of estimating the distribution of a group of respondents rather than individuals. • In this way, the focus is on providing information that can inform policy and further research • Differ from individual testing in key ways
Features of LSA Assessment • Extensive framework development • Sampling • Weighting • Use of Complex Assessment Designs • IRT Modeling • Population Modeling • Connection to background variables • Increasing reliance on CBA
Growing Importance of Computer Based Assessments • Until very recently all large scale national and international assessments were paper based assessments with some optional computer based components. • PIAAC (2012) was the first large scale survey of adult skills in which the primary mode of delivery was computer and paper and pencil became the option. • In 2015, PISA will also use computers as the primary mode of delivery with paper and pencil becoming an option for countries
Why is Computer Based Assessment Important for Surveys such as PIAAC and PISA? Why is a Computer Delivered Assessment Important for PISA? • Better reflects the ways in which students & adults access, use and communicate information • Enables surveys like PIAAC and PISA to broaden the range of skills that can be measured; • Allow these surveys to take better advantage of both operational and measurement efficiencies that technology can provide
Goals of the PIAAC 2012 and PISA 2015 Assessment Designs • Establish the comparability of inferences across countries, across assessments and across modes • Broaden what can be measured by both extending the existing constructs and by being able to introduce new constructs • Reduce random and systematic error through the use of more complex designs, automated scoring; use of timing information; and the use of adaptive testing
PIAAC Main Study Cognitive Assessment Design ICT use from BQ CBA-Core Stage 1: ICT CORE 4L + 4N LITERACY 20 Tasks NUMERACY 20 Tasks LITERACY Stage 1 (9 tasks) Stage 2 (11 tasks) NUMERACY Stage 1 (9 tasks) Stage 2 (11 tasks) PS in TRE NUMERACY Stage 1 (9 tasks) Stage 2 (11 tasks) LITERACY Stage 1 (9 tasks) Stage 2 (11 tasks) PS in TRE READING COMPONENTS No computer experience Computer experience Fail Pass Fail CBA-Core Stage 2: 3L + 3N Pass Pass Pass Fail :Random assignment
Average Proficiency Scores By Domain and Subgroups
Cumulative Distribution of Numeracy Proficiency by Subgroups
Percentage of Item-by-Country Interactions * Literacy and numeracy interactions go across modes and time
Maintaining and Improving Measurement of Trends • Proposal for PISA 2015 is to enhance and stabilise the measurement of trend data • Refocus the balance between random and systematic errors
Maintaining and Improving Measurement of Trends Recommended Approach for Measuring Trends in PISA 2015 and Beyond Construct Coverage in the Current PISA Design by Major and Minor Domains Width conveys the relative number of students who respond to each item within the domain The reduced height of the bars for the minor domains represents the reduction of items in that domain and therefore the degree to which construct coverage has been reduced Construct Coverage minor MAJOR minor minor MAJOR minor Height of the bars represents the proportion of items measured in each assessment cycle by domain Recommended approach stabilizes trend through reducing bias by including all items in each minor domain while reducing the number of students responding to each item
Maintaining and Improving Measurement of Trends Impact Over Cycles Trend Items New Items New Items Reflecting New Construct New Items Reflecting Old Construct minor 2021 minor 2018 minor 2009 MAJOR 2006 MAJOR 2015 minor 2012 Domain Rotation Domain Rotation Scientific Literacy as a minor domain – new trend line from a construct point of view Scientific Literacy as a major domain - new items
Future Innovations • Introduction of new item types • Use of fully automated scoring • More flexible use of languages • Development of research around process information contained in log files • Introduction of more complex psychometric models • Development of derivative products
Summary • Large scale international assessments continue to grow in importance • Computer based assessments are now feasible and will become the standard for development and delivery … • better reflect the ways in people now access, use and communicate information • add efficiency and quality to the data • introduce innovation that broadens what can be measured and reported
The design for PIAAC was able to … • Broaden what was measured; • Demonstrate high comparability among countries, over time and across modes; • Introduce multi-stage adaptive testing; • Include the use of timing information to better distinguish between omit and not reached items; • Demonstrate an improvement in the quality of the data that was collected
Growth and Expansion Life skills Curriculum Measurement