1 / 28

Are all questions created equal:

HarrisCezar
Download Presentation

Are all questions created equal:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:Are all questions created equal?: Factors that influence cloze question difficulty.

    Brooke Soden Hensler Carnegie Mellon University (starting graduate school at Florida Center for Reading Research this Fall) Joseph E. Beck Carnegie Mellon University Funding: National Science Foundation Society for the Scientific Study of Reading – July 2006

    Slide 2:Why Look at Multiple Choice Cloze Questions?

    Multiple Choice Cloze are widely used assessments of comprehension Problem: outcome measure is typically binary (little information about student). Goal: use multiple choice cloze questions to… More accurately assess students Track student reading development Better understand what makes cloze questions hard 2 probs. 1- q diff 2- 2 probs. 1- q diff 2-

    Slide 3:Project LISTEN’s Computer Reading Tutor (Mostow & Aist, 2001)

    Automated Students use throughout year Accompanying paper standardized test scores (pre & post)

    Slide 4:Student is reading a story aloud to the Reading Tutor…

    (say) “Only displays one sentence at a time.”(say) “Only displays one sentence at a time.”

    Slide 5:A question appears… *Reading Tutor reads both Question and Response Choices. (Mostow, et al., 2004)

    Read out prompt pausing for space. “RT Automatically & Randomly generates Q’s” “RT reads both question and response choices to student” Read out prompt pausing for space. “RT Automatically & Randomly generates Q’s” “RT reads both question and response choices to student”

    Slide 6:Student resumes reading story aloud to the Reading Tutor…

    Student reads next sentence in story (mccq)Student reads next sentence in story (mccq)

    Slide 7:Reading Tutor Advantages

    Well-specified & unbiased question construction (randomly generated) Questions automatically administered, scored, & recorded Longitudinal collection over school year Large N (students & questions) Collection methodology makes RT ideal tool to collect & analyze data.Collection methodology makes RT ideal tool to collect & analyze data.

    Slide 8:How many Q’s from Whom? Data Description

    81,175 Questions 1042 Students 11 = Median number of questions answered (Many students infrequent users of tutor) 2001-02 & 2002-03 School years Diverse population in Pittsburgh area SES & ethnicity, urban & suburbanSES & ethnicity, urban & suburban

    Slide 9:Research Questions

    Is a particular part of speech (e.g., nouns, verbs, etc.) more difficult for students? If nouns are learned first (Gentner, 1982; Golinkoff, et al., 2000), might students be more proficient at answering noun questions? Which factors influence question difficulty? How can we better assess students using multiple choice cloze questions? Vocabulary researchers have given partial credit for correct part of speech (e.g., Schwanenflugel, et al., 1997) Student identity is included in the model to account for variance in individual student proficiency.Student identity is included in the model to account for variance in individual student proficiency.

    Slide 10:Approach

    Build logistic regression model to predict individual question performance Terms in model: student identity, part of speech of answer, properties of question (e.g., question length) Advantages of modeling approach Simultaneously estimates impact of question properties and student proficiency on question performance Makes use of all ~80k questions Student identity is included in the model to account for variance in individual student proficiency.Student identity is included in the model to account for variance in individual student proficiency.

    Slide 11:Effect of Parts of Speech

    Nouns Verbs Adverbs Adjectives (p < 0.001) < < < (p < 0.001) (p < 0.05) OVERALL – MAIN EFFECTS Changes: 0. just say verbally, all reliable differ from 0 and 0.001 (or at bottom) and put in p-vals for comparison with next lowest Remove difficulty Kai-min’s graphic approach of showing > for statistical reliable relationships (i.e. put > between nouns and verbs, verbs and ajd, etc.)OVERALL – MAIN EFFECTS Changes: 0. just say verbally, all reliable differ from 0 and 0.001 (or at bottom) and put in p-vals for comparison with next lowest Remove difficulty Kai-min’s graphic approach of showing > for statistical reliable relationships (i.e. put > between nouns and verbs, verbs and ajd, etc.)

    Slide 12:Effect of Parts of Speech

    Nouns Verbs Adverbs Adjectives (p < 0.001) easier harder < < < (p < 0.001) (p < 0.05) OVERALL – MAIN EFFECTS Changes: 0. just say verbally, all reliable differ from 0 and 0.001 (or at bottom) and put in p-vals for comparison with next lowest Remove difficulty Kai-min’s graphic approach of showing > for statistical reliable relationships (i.e. put > between nouns and verbs, verbs and ajd, etc.)OVERALL – MAIN EFFECTS Changes: 0. just say verbally, all reliable differ from 0 and 0.001 (or at bottom) and put in p-vals for comparison with next lowest Remove difficulty Kai-min’s graphic approach of showing > for statistical reliable relationships (i.e. put > between nouns and verbs, verbs and ajd, etc.)

    Slide 13:Impact of other Part of Speech terms

    Difficulty Significance Most Common ? p < 0.01 Part of Speech # of Choices ? p < 0.001 with Answer’s POS “Sally had to _______ her lips when she heard the news.” (cloud, purse, holds, magnificent) “Henry read his _______ under the tree.” (cup, dog, book, hair) OVERALL – MAIN EFFECTS example sentence most common POS Purse ? handbag (N) Purse ? “purse your lips” (V) Most common = TAG_PR_M Q length = Q_RC_LEN Del location = PERC_Q_PREBLANK Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141 OVERALL – MAIN EFFECTS example sentence most common POS Purse ? handbag (N) Purse ? “purse your lips” (V) Most common = TAG_PR_M Q length = Q_RC_LEN Del location = PERC_Q_PREBLANK Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141

    Difficulty Significance Most Common ? p < 0.01 Part of Speech # of Choices ? p < 0.001 with Answer’s POS “Henry read his _______ under the tree.” (cup, dog, book, hair) “Sally had to _______ her lips when she heard the news.” (lamp, purse, beautiful, magnificent)

    Slide 14:Impact of other Part of Speech terms

    ? less common POS = harder ? more common POS = easier OVERALL – MAIN EFFECTS example sentence most common POS Purse ? handbag (N) Purse ? “purse your lips” (V) Most common = TAG_PR_M Q length = Q_RC_LEN Del location = PERC_Q_PREBLANK Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141 OVERALL – MAIN EFFECTS example sentence most common POS Purse ? handbag (N) Purse ? “purse your lips” (V) Most common = TAG_PR_M Q length = Q_RC_LEN Del location = PERC_Q_PREBLANK Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141

    Difficulty Significance Most Common ? p < 0.01 Part of Speech # of Choices ? p < 0.001 with Answer’s POS “Henry read his _______ under the tree.” (cup, dog, book, hair) “Sally had to _______ her lips when she heard the news.” (lamp, purse, beautiful, magnificent)

    Slide 15:Impact of other Part of Speech terms

    fewer choices with correct POS ? more choices with correct POS = harder = easier (verb) (noun) OVERALL – MAIN EFFECTS example sentence most common POS Purse ? handbag (N) Purse ? “purse your lips” (V) Most common = TAG_PR_M Q length = Q_RC_LEN Del location = PERC_Q_PREBLANK Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141 OVERALL – MAIN EFFECTS example sentence most common POS Purse ? handbag (N) Purse ? “purse your lips” (V) Most common = TAG_PR_M Q length = Q_RC_LEN Del location = PERC_Q_PREBLANK Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141

    Slide 16:Impact of other terms

    Difficulty Significance Question ? p < 0.001 Length Deletion ? p < 0.001 Location “We can _______ the stars in the sky despite the bright city lights around us.” (at, with, most, see) “They rode their _______ .” (farmer, bikes, play, blue) OVERALL – MAIN EFFECTS Most common = TAG_PR_M Q length = Q_RC_LEN (in chars [so that we can include response choice lengths]) Del location = PERC_Q_PREBLANK (in chars, consistent w Q_RC_LEN and correlates w counting words r~=.98) Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141 Mention verbally that betas aren’t directly comparable with each other (unstandardized) OVERALL – MAIN EFFECTS Most common = TAG_PR_M Q length = Q_RC_LEN (in chars [so that we can include response choice lengths]) Del location = PERC_Q_PREBLANK (in chars, consistent w Q_RC_LEN and correlates w counting words r~=.98) Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141 Mention verbally that betas aren’t directly comparable with each other (unstandardized)

    Slide 17:Impact of other terms

    Difficulty Significance Question ? p < 0.001 Length Deletion ? p < 0.001 Location “We can _______ the stars in the sky despite the bright city lights around us.” (at, with, most, see) “They rode their _______ .” (farmer, bikes, play, blue) ? longer = harder ? shorter = easier OVERALL – MAIN EFFECTS Most common = TAG_PR_M Q length = Q_RC_LEN (in chars [so that we can include response choice lengths]) Del location = PERC_Q_PREBLANK (in chars, consistent w Q_RC_LEN and correlates w counting words r~=.98) Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141 Mention verbally that betas aren’t directly comparable with each other (unstandardized) OVERALL – MAIN EFFECTS Most common = TAG_PR_M Q length = Q_RC_LEN (in chars [so that we can include response choice lengths]) Del location = PERC_Q_PREBLANK (in chars, consistent w Q_RC_LEN and correlates w counting words r~=.98) Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141 Mention verbally that betas aren’t directly comparable with each other (unstandardized)

    Slide 18:Impact of other terms

    Difficulty Significance Question ? p < 0.001 Length Deletion ? p < 0.001 Location “We can _______ the stars in the sky despite the bright city lights around us.” (at, with, most, see) “They rode their _______ .” (farmer, bikes, play, blue) ? blank earlier = harder ? blank later = easier OVERALL – MAIN EFFECTS Most common = TAG_PR_M Q length = Q_RC_LEN (in chars [so that we can include response choice lengths]) Del location = PERC_Q_PREBLANK (in chars, consistent w Q_RC_LEN and correlates w counting words r~=.98) Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141 Mention verbally that betas aren’t directly comparable with each other (unstandardized) OVERALL – MAIN EFFECTS Most common = TAG_PR_M Q length = Q_RC_LEN (in chars [so that we can include response choice lengths]) Del location = PERC_Q_PREBLANK (in chars, consistent w Q_RC_LEN and correlates w counting words r~=.98) Syntactic guess rate = TPOS_INT Originally… Predicting correct = 0, so changed signs on all variables, BUT t_pos_int is an exception so it should keep its neg sign??? -.060 .008 -.329 -.141 Mention verbally that betas aren’t directly comparable with each other (unstandardized)

    Slide 19:Using model to assess student reading comprehension

    Model estimates Beta parameter for each student Represents how well student did at answering cloze questions (controlling for difficulty factors) Should correlate with external comprehension measure Compare Beta vs. percent correct for predicting WRMT comprehension composite* Student Beta: r = .644, p < .001 Percent correct: r = .507, p < .001 Reliability of difference in correlations, p < .01 Also provides check on validity of regression model *N = 465, 1 extreme outlier was eliminated from analyses. Reliability of difference in correlations, p ~= .0018 Reliability of difference in correlations, p ~= .0018

    Slide 20:Conclusions

    Length of question, location of deleted word, and part of speech of correct answer affect question difficulty. Logistic regression is a strong choice for analyzing cloze data. Multiple-choice cloze questions can assess a student at a more accurate level than current practice.

    Slide 21:Questions?

    Nominated for Best Paper Award: Soden Hensler, B., Beck, J. E. (2006). Better student assessing by finding difficulty factors in a fully automated comprehension measure. Intelligent Tutoring Systems. Brooke Soden Hensler bsodenhensler@gmail.com Joseph E. Beck joseph.beck@gmail.com Project LISTEN & The Reading Tutor http://www.cs.cmu.edu/~listen/

    Slide 22:References

    Gentner, D. (1981). Some interesting differences between verbs and nouns. Cognition and Brain Theory, 4(2). Golinkoff, R.M., Hirsh-Pasek, K., Bloom, L., Smith, L. B., Woodward, A. L., Akhtar, N., Tomasello, M., & Hollich, G. (2000). Becoming a word learner: A debate on lexical acquisition. New York: Oxford University Press. Mostow, J. & Aist, G. (2001). Evaluating tutors that listen: An overview of Project LISTEN. In K. Forbus & P. Feltovich (Eds.), Smart Machines in Education (169 - 234) Menlo Park, CA: MIT/AAAI Press. Mostow, J., Beck, J. E., Bey, J., Cuneo, A., Sison, J., Tobin, B. & Valeri, J. (2004). Using automated questions to assess reading comprehension, vocabulary, and effects of tutorial interventions. Technology, Instruction, Cognition and Learning, 2, p. 97-134 Schwanenflugel, P.J., Stahl, S. A., & McFalls, E. L. (1997). Partial word knowledge and vocabulary growth during reading comprehension. Journal of Literacy Research, 29(4).

    Slide 23:Additional Slides

    x

    Slide 24:Terms in Model

    Slide 25:Developmental Trends in Learning Parts of Speech

    INTERACTION EFFECTS Explain where these numbers came from. Disagged by student proficiency and rebulit model INTERACTION EFFECTS Explain where these numbers came from. Disagged by student proficiency and rebulit model

    Slide 26:Developmental Trends in Learning Parts of Speech

    p < .001 p = .71 p = .99 p = .52 p = .64 p-vals represent reliability of difference btw nouns & verbs in ea proficiency level’s model INTERACTION EFFECTS Explain where these numbers came from. Disagged by student proficiency and rebuilt modelp-vals represent reliability of difference btw nouns & verbs in ea proficiency level’s model INTERACTION EFFECTS Explain where these numbers came from. Disagged by student proficiency and rebuilt model

    Slide 27:Syntactic Awareness

    p = .48 p = .73 p = .01 p = .02 p < .001 p-vals represent relative impact in ea proficiency level’s modelp-vals represent relative impact in ea proficiency level’s model

    Slide 28:Effect of Part of Speech *Interpretation: positive Beta means student is more likely to answer question correctly

    OVERALL – MAIN EFFECTS Changes: 0. just say verbally, all reliable differ from 0 and 0.001 (or at bottom) and put in p-vals for comparison with next lowest Remove difficulty Kai-min’s graphic approach of showing > for statistical reliable relationships (i.e. put > between nouns and verbs, verbs and ajd, etc.)OVERALL – MAIN EFFECTS Changes: 0. just say verbally, all reliable differ from 0 and 0.001 (or at bottom) and put in p-vals for comparison with next lowest Remove difficulty Kai-min’s graphic approach of showing > for statistical reliable relationships (i.e. put > between nouns and verbs, verbs and ajd, etc.)

More Related