520 likes | 539 Views
1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)? 2. Does computer scoring of Speaking proficiency work?. BILC Professional Seminar Monterey, CA Ray Clifford , 13 June 2011. These topics have two things in common.
E N D
1. Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)?2. Does computer scoring of Speaking proficiency work? BILC Professional Seminar Monterey, CA Ray Clifford, 13 June 2011
These topics have two thingsin common. • Both topics are related to proficiency testing. • Both attempt to “push the envelope”.
These topics have two thingsin common. • Both topics are related to proficiency testing. • Both attempt to “push the envelope”. • In technical settings “pushing the envelope” means pushing the limits of an aircraft or technology system.
These topics have two thingsin common. • Both topics are related to proficiency testing. • Both projects attempt to “push the envelope”. • In technical settings “pushing the envelope” means pushing the limits of an aircraft or technology system. • An “envelope” is also the name of the container used to mail or protect documents.
These topics have two thingsin common. • Both topics are related to proficiency testing. • Both projects attempt to “push the envelope”. • In technical settings “pushing the envelope” means pushing the limits of an aircraft or technology system. • An “envelope” is also the name of the container used to mail or protect documents. • Some envelopes are stationery and others are stationary .
#1Could a BILC Benchmark Advisory Test (BAT) be delivered as a Computer Adaptive Test (CAT)?
Benchmark Advisory Tests • The BATs follow the Criterion-Referenced scoring model used by the human-adaptive Oral Proficiency Interview (OPI) : • To earn any specific proficiency rating, the test taker has to satisfy all of the level-specific Task, Conditions/Contexts, and Accuracy (TCA) criteria associated with that level. • Note 1: When researchers tried assigning ratings based on a total of component scores, they found that total scores didn’t accurately predict human, Criterion-Referenced, OPI ratings. • Note 2: The same non-alignment occurred when they used multiple-regression analyses,
Why use Criterion-Referenced scoring rather than total scores? • Proficiency ratings are “criterion” ratings, and they require non-compensatory rating judgments at each level. • Total and average scores, even when weighted, are compensatory scores. • “Floor and ceiling” level-specific score comparisons are needed to assign a final rating. • Raters can’t apply “floor and ceiling” rating criteria using a single or composite score.
Why do Speaking tests work? • Defined a primary construct for each proficiency level, and a secondary construct that the primary constructs form a hierarchy. • Converted these proficiency constructs into test specifications. • Created a test delivery system based on those test specifications called the OPI. • Applied Criterion-Referenced (C-R), “floor and ceiling”, scoring procedures.
And OPI Speaking Tests work well. • The primary, level-specific constructs are supported by inter-rater agreement statistics. • Pearson’s r = 0.978 • Cohen weighted Kappa = 0.920 (See Foreign Language Annals, Vol. 36, No. 4, 2003, p.512) • The secondary, hierarchical construct is supported by the fact that the “floor and ceiling” rating system does not result in “inversions” in assigned ratings.
We used the same steps to createReading and Listening BATs. • Defined level-specific primary constructs and a secondary hierarchical construct. • Converted the constructs into test specifications. • Created a test delivery system based on those test specifications. • Applied Criterion-Referenced, “floor and ceiling”, scoring procedures.
Definition of Proficient Reading • Proficient reading: The active, automatic process of using one’s internalized language and culture expectancy system to comprehend an authentic text for the purpose for which it was written.
Benefits of Aligning Reading (and Listening) Test Factors • Complexity is greatly reduced. • Each level becomes a separate “Task, Condition, and Accuracy” ability criterion based on typical language patterns found in the targeted society. • When TCA criteria are aligned, raters can: • Check for sustained ability at each level. • Assign general proficiency ratings using a floor and ceiling approach. • Assign progress ratings toward the next higher level.
Warning!Multiple Choice tests may not be aligned with the trait to be tested.
Predicted Development Stages Level X Level X+1
Counter-Model Inversions Level X Level X+1
Initial BAT Test Results • 187 NATO personnel from 12 nations took the English listening test. • Sustained Level 3 50 • Sustained 2, most of 3 (2+) 42 • Sustained Level 2 28 • Sustained 1, most of 2 (1+) 34 • Sustained Level 1 14 • Most of Level 1 (0+) 3 • No pattern or random ability 16
Initial BAT Test Results (Continued) • The number of counter-model inversions: 0 • The “floor and ceiling” criterion-referenced ratings gave more accurate results than assigning scores based the total score. • In fact, the criterion-referenced rating process ranked 70 (37%) of the test takers differently than they would have been ranked by their total score results.
Example A: Total score = 37 (62%)C-R assigned Proficiency Level = 1+(Level 1 with Developing abilities at Level 2)
Example B: Total score = 35 (58%)C-R assigned Proficiency Level = 2(Level 2 with Random abilities at Level 3)
Thanks to the BILC Secretariat and ATC • “Permissive” BAT research has continued using English language learners interested in applying for admittance to a U.S. university. • A diversity of first languages was represented among the test takers. • The number who have taken the BAT Reading test now exceeds 600. • With 600+ test takers, we have done the IRT analyses needed for adaptive testing.
Preparing a Computer Adaptive Test(or in this case, a CAT BAT) • WinSteps IRT Analyses confirmed that the BAT test items were “clustering” by level. • Clustered items were then assembled into testlets of 5 items each. • The logit values for each level were separated by more than 1.0 logits. • For any given level, the testlets were of comparable difficulty – within 0.02 logits. • The logit standard error of measurement for each group of testlets was 0.06 or less.
#1Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)?
#1Could the BILC Benchmark Advisory Tests (BATs) be delivered as Computer Adaptive Tests (CATs)?Yes!And simulations using actual student data show that testing time would be reduced by an average of 50%.
Types of Speaking Tests • Direct Tests. • Oral Proficiency Interview (OPI). • (Human administered, human scored). • Semi-direct Tests. • OPIc • (Computer administered, human scored). • “OPIc2” • Computer administered and scored. • “Elicited Speech” • Computer administered and scored. • Indirect Tests • Elicited Imitation. • Computer administered and scored.
“OPIc2” Experiment • Found relationships between proficiency levels and composite scores based on “verbosity” and 1–gram lexical matching. • Able to identify Level 1 speakers compared to Level 2 and Level 3 speakers. • But the scoring process took hours. • The voice-to-text conversion process was imprecise.
3 Voice-to-Text Output Examples(From carefully enunciated voicemail messages) • < The meeting was held on Thursday at 3:15 PM. > • < Discussions that took place last Thursday late into a compromise and they shut down was avoided. > • < Hey the concept been more collegial more could've been accomplished ending in pass would've been avoided. >
1st Voice-to-Text Output Example(Original statements and output) • The original message “The meeting was held on Thursday at 3:15 pm.” was transcribed as: < The meeting was held on Thursday at 3:15 PM. >
2nd Voice-to-Text Output Example(Original statements and output) • The original message “The discussions that took place last Thursday led to a compromise, and a shutdown was avoided.” was transcribed as: < Discussions that took place last Thursday late into a compromise and they shut down was avoided. >
3rd Voice-to-Text Output Example(Original statements and output) • The original message “Had the confab been more collegial, more could have been accomplished, and an impasse would have been avoided.” was transcribed as: • < Hey the concept been more collegial more could've been accomplished ending in pass would've been avoided.>
Voice-to-Text Output Examples(Representative of proficiency levels?) • Attempt at Level 1: < The meeting was held on Thursday at 3:15 PM. > • Attempt at Level 2: < Discussions that took place last Thursday late into a compromise and they shut down was avoided. > • Attempt at Level 3: < Hey the concept been more collegial more could've been accomplished ending in pass would've been avoided. >
Next we tried EI and found: • The optimum number of syllables in a prompt was dependent on the speakers’ proficiency. • Low frequency words were more difficult. • Contrasting L1 and L2 language features were more difficult. • Providing user control of prompt timing had no significant impact on EI scores. • Low ability learners showed a positive practice effect with repeated exposure to the identical prompts.
Elicited Speech (ES) Tests • EI findings led to the creation of new ES tests that force “chunking” at the meaning level rather than at the phoneme or word level. • The new ES tests include prompts with… • Complex sentences that exceed the syllable counts previously recommended for EI tests. • Level-specific language features drawn from the ILR “grammar grids”. • Thus, the ES prompts should be aligned with the targeted proficiency levels.
ES Test Goal: Measure the Speaker’s Language Expectancy System (LES) • It is hypothesized that our language comprehension and our language production depend on an internalized Language Expectancy System (LES). • The more developed one’s target-language LES, the more accurately s/he understands and produces the target language. • ES tests are designed to access the LES twice -- for comprehension and production.
Is an ES test a Listening or Speaking Test? • To some extend it doesn’t matter, because the same LES is involved in both activities. • Being able to say things one can’t understand is not a valuable skill. • If one can’t regenerate a sentence, then s/he would not have been able to say it without the benefit of the model prompt.
Hear the EI prompt. Hear the ES prompt. The EI versus the ES Response Process EI: Elicited Imitation ES: Elicited Speech SM: Sensory Memory STM: Short-Term Memory LTM: Long-Term memory
Hear the EI prompt. Form a representation of the sound chunks in SM. Hear the ES prompt. Form a representation of the meaningchunks in SM. The EI versus the ES Response Process EI: Elicited Imitation ES: Elicited Speech SM: Sensory Memory STM: Short-Term Memory LTM: Long-Term memory
Hear the EI prompt. Form a representation of the sound chunks in SM. Store that representation of sounds in STM. Hear the ES prompt. Form a representation of the meaning chunks in SM. Store the meaning representation in STM. The EI versus the ES Response Process EI: Elicited Imitation ES: Elicited Speech SM: Sensory Memory STM: Short-Term Memory LTM: Long-Term memory
Hear the EI prompt. Form a representation of the sound chunks in SM. Store that representation of sounds in STM. Recall the sound representation from STM. Hear the ES prompt. Form a representation of the meaning chunks in SM. Store that meaning representation in STM. Recall the meaning representation from STM. The EI versus the ES Response Process EI: Elicited Imitation ES: Elicited Speech SM: Sensory Memory STM: Short-Term Memory LTM: Long-Term memory
Hear the EI prompt. Form a representation of the sound chunks in SM. Store that representation of sounds in STM. Recall the sound representation from STM. Reproduce the prompt. Hear the ES prompt. Form a representation of the meaning chunks in SM. Store that meaning representation in STM. Recall the meaning representation from STM. Use the Language Expectancy System stored in one’s LTM plus the meaning retrieved from STM toregeneratethe prompt. The EI versus the ES Response Process EI: Elicited Imitation ES: Elicited Speech SM: Sensory Memory STM: Short-Term Memory LTM: Long-Term memory
Hear the EI prompt. Form a representation of the sound chunks in SM. Store that representation of sounds in STM. Recall the sound representation from STM. Reproduce the prompt. Hear the ES prompt. Form a representation of the meaningchunks in SM. Store that meaning representation in STM. Recall the meaning representation from STM. Use the Language Expectancy System stored in one’s LTM plus the meaning retrieved from STM toregeneratethe prompt. The EI versus the ES Response Process EI: Elicited Imitation ES: Elicited Speech SM: Sensory Memory STM: Short-Term Memory LTM: Long-Term memory
Innovations • The ES prompts should be aligned with ILR syntax, vocabulary, and text type expectations. • The Automated Speech Recognition engine uses a forced-alignment scoring strategy that uses the ES prompts as a model. • This approach improves accuracy and avoids the multimillion-dollar development of a full natural language corpus to use as a model for the ASR processor.
Progress • Initial research results using a Spanish ES test have been very promising. • About 100 persons have been double tested with the revised version and the OPI. • The correlation between human scoring of ES tests and official OPI results was r = 0.91. • Automated Speech Recognition (ASR) scoring predicted exact OPI ratings about 2 out of 3 times.
Next Steps for ES Testing • Create a version of the test that can be: • Computer/Internet delivered. • Computer scored in near real time. • Equated to proficiency ratings. • Add a fluency assessment module to the existing ASR accuracy scoring measures. • Try C-R “floor and ceiling” scoring. • Conduct alpha testing with DOD personnel.
#2Does computer scoring of Speaking proficiency work?Somewhat; and it will get better. .