270 likes | 578 Views
An Overview Automated Essay Scoring (AES) Research: What are the Implications for ESL/EFL Writers? . Semire Dikli, Ph.D 2nd Untested Ideas International Research Conference June 27,2014. What is AES?. AES is a technology that is a capable of scoring essays within seconds.
E N D
An Overview Automated Essay Scoring (AES) Research: What are the Implications for ESL/EFL Writers? Semire Dikli, Ph.D 2nd Untested Ideas International Research Conference June 27,2014
What is AES? • AES is a technology that is a capable of scoring essays within seconds. • This aspect of AES made it a critical component of large scale writing assessments. • E.g., Educational Testing Service (ETS) used e-rater to score Management Admissions Test (GMAT) AWA (Analytical Writing Assessment) between 1999 and 2006 (Burstein, 2003) and has used to rate TOEFL iBT writings since 2009 (Haberman, 2011). • AES technology has been widely used in school systems including middle and high schools as well as colleges and universities for low stakes assessment purposes as their instructional applications are capable of providing instant feedback on various categories of writing in addition to scoring.
AES Systems • Currently, there are a number of AES systems in the market including Project Essay Grader (PEG), Intelligent Essay Assessor (IEA) and its instructional application WriteToLearn, IntelliMetric and its instructional application MY Access, e-rater and its instructional application Criteron, BETSY (Bayesian Essay Test Scoring sYstem), CRASE, and Benchmark-SkillWriter. • This presentation, however, will focus on the first four AES systems (and their instructional application) as they are widely used for large scale assessment purposes and/or in the classrooms. Before exploring research regarding these AES systems, it
PEG • Project Essay Grader (PEG) was developed by Ellis Page in 1966 upon the request of the College Board in order to have large-scale essay scoring processes more practical and effective (Rudner & Gagne, 2001; Page, 2003). • PEG uses proxy (writing constructs such as average word length, essay length, number of semicolons or commas, etc.) measures to predict the intrinsic quality of the essays (Kukich, 2000; Chung & O’Neil, 1997; Rudner & Gagne, 2001).
IEA • Intelligent Essay Assessor (IEA) and its instructional application WriteToLearn, is produced by the Pearson and rely on their Knowledge Analysis Technologies (KAT) engine (Pearson Assesments, n.d.). • IEA analyzes and scores an essay using a semantic text analysis method called Latent Semantic Analysis (LSA) (Lemaire & Dessus, 2001), which is based on a matrix algebra-based approach, methods used in automatic speech recognition, computational linguistics and other forms of statistical Artificial Intelligence (Pearson Assessments, n.d.).
IntelliMetric and MY Access • IntelliMetric, an AES system developed by Vantage Learning, is an automated essay-scoring tool that uses artificial intelligence (AI), natural language processing (NLP), and statistical technologies (Elliot, 2003a, 2003d; Shermis & Barrera, 2002; Shermis, Raymat, & Barrera, 2003). • MY Access is known as the instructional application of IntelliMetric (Vantage Learning, n.d.).
E-rater and Criterion • E-rater and its instructional application, Criterion have been developed by Educational Testing Service (ETS). • E-rater uses natural language processing (NLP) techniques (Kukich, 2000; Burstein, 2003) and Criterion relies on e-rater scoring engine to evaluate essays.
Research on AES • AES systems have been primarily designed in an attempt to resolve issues with large-scale writing assessments including time, cost, and reliability (Bereiter, 2003; Burstein, 2003; Chung & O’Neil, 1997; Myers, 2003; Page, 2003; Rudner & Gagne, 2001; Rudner & Liang, 2002). • The developing companies are rather concerned with proving the accuracy of AES systems and their high correlations with human raters • (e.g., Attali, 2004; Burstein & Chodorow, 1999; Landauer, Laham, & Foltz, 2003; Landauer, Laham, Rehder, & Schreiner, 1997; Lee, Gentile & Robert, 2008; Monaghan & Bridgeman, 2005; Nichols, 2004; Page, 2003; Rudner, Garcia & Welch 2006; Vantage Learning, 2000a, 2000b, 2001c, 2002a, 2003a, 2003b; Wang & Brown, 2007). • The results of AES research does suggest satisfactory agreement rates between AES systems and human scorers, yet the majority of these studies are conducted in large-scale assessment contexts rather than in classroom settings. • The agreement rates are less likely to be as high in classroom assessments (Keith, 2003); therefore, the results may not be generalized to the instructional applications of AES.
Concerns with validity and reliability • The results of AES research do suggest satisfactory agreement rates between AES systems and human scorers, yet the majority of these studies are conducted in large-scale assessment contexts rather than in classroom settings. • The agreement rates are less likely to be as high in classroom assessments (Keith, 2003); therefore, the results may not be generalized to the instructional applications of AES. • Chen and Cheng (2008) argue that the assessment validation is a more complex process, so comparison of scores from different raters or measures is unlikely adequate.
Concerns with validity and reliability (cont'd) • Another concern with regard to the validity of AES technology has been the possibility of their defeat by users as they likely to fail to flag essays with certain length and familiar vocabulary yet poor content (Chung & O’Neil, 1997; Kukich, 2000; Powers at al., 2001; Rudner & Gagne, 2001). • The developers, however, have been making efforts to eliminate this problem by incorporating algorithms that aim to identify these types of essays.
Limitations with AES research • One limitation with AES research is that AES studies are mostly conducted or sponsored by the developing companies. • (e.g., Attali, 2004; Attali, 2007; Attali & Burstein, 2006; Burstein & Chodorow, 1999; Burstein et al, 1997; Chodorow & Burstein, 2004; Cushing-Weigle, 2011; Elliott, 2003; Elliott & Mikulas, 2004; Edelblut & Vantage Learning, 2003; Landauer, Laham, & Foltz 2003; Landauer et al,1997; Lee, Gentile, & Kantor, 2008; Nichols, 2004; Powers et al, 2000 Powers et al, 2001; Rudner, Garcia, & Welch, 2006; Page, 2003; Rock, 2007; Vantage Learning, 2000a, 2001b, 2002a, 2003a 2003b; Wang & Brown, 2007). • As recently as a few years ago, independent researchers started showing interest in exploring the AES technology • (e.g., Chen & Cheng, 2008; Choi & Lee, 2010; Dikli, 2010; Grimes & Warschauer, 2010; Warschauer & Grimes, 2008).
Limitations with AES research (cont'd) • Prior research with regard to AES and ESL is insufficient as the majority of these studies include writing data from native English speaking writers mostly in large-scale writing assessments • (e.g., Attali, 2004; Attali & Burstein, 2006; Burstein et al, 1997; Cushing-Weigle, 2011; Landauer, Laham, & Foltz 2003; Landauer et al,1997; Lee, Nichols, 2004; Powers et al, 2000; Powers et al, 2001; Rudner, Garcia, & Welch, 2006; Page, 2003; Rock, 2007; Vantage Learning, 2000a, 2001b, 2002a, 2003a 2003b; Wang & Brown, 2007) • There are a handful of studies that analyzed the essays from non-native English speakers. • (e.g., Attali, 2007; Attali & Burstein, 2006; Dikli, 2010; Burstein & Chodorow, 1999; Chodorow & Burstein, 2004; Edelblut & Vantage Learning, 2003; Chen & Cheng, 2008; Choi & Lee, 2010; Elliott & Mikulas, 2004; Lee, Gentile & Kantor, 2008; Vantage Learning, 2001b) • Only a few investigated the use of an AES system in an ESL or EFL classroom. • (e.g., Chen & Cheng, 2008; Choi & Lee, 2010; Dikli, 2010; Dikli & Bleyle, 2014). • This is particularly important because AES systems are widely used by non-native English speakers not only in the U.S. but also around the globe.
Limitations with AES research (cont'd) • The developers assert that AES systems are capable of evaluating numerous essays and suggesting individualized feedback at the same time. This is particularly important for non-native English speakers because they vary tremendously in terms of their proficiency in English. • However, researchers found that AES feedback was generic and it missed or misidentified many student errors, which of most were accurately identified by the instructor (e.g., Dikli, 2010; Dikli & Bleyle, 2014) • These are important limitations that ESL/EFL teachers should recognize since non-native English speakers, particularly those with lower English proficiency levels, are susceptible to make many L2 errors in their writings.
Conclusion • Despite many concerns, AES technology can be a convenient tool that may encourage L2 teachers to utilize the process writing, a widely used approach in writing classrooms that provide students with opportunities to plan, draft, and revise their essays based on feedback. • Classroom-based AES systems such as Criterion, MY Access, and WriteToLearn indeed promote process approach through their pre-writing and portfolio features, therefore, can be viewed as another source of feedback in addition to that of teacher and/or peer. • However, both teachers and students should be aware of its drawbacks and avoid heavily relying on AES feedback. • Due to the fact that research in ESL/EFL classroom context is limited, there is an immediate need for more studies that include writing samples of non-native English speakers.
References • Attali, Y. (2004, April). Exploring the feedback and revision features of Criterion. Paper presented at the National Council on Measurement in Education (NCME), San Diego, CA. • Attali, Y. (2007) Construct validity of e-rater in scoring TOEFL essays (RR-07-21). Princeton, NJ: Educational Testing Service (ETS). • Attali, Y. & Burstein, J. (2006). Automated Essay Scoring with e-rater v.2. Journal of Technology,Learning, and Assessment, 4 (3), 1-31. • Bereiter, C. (2003). Foreword. In Mark D. Shermis and Jill C. Burstein (Eds.), Automated essay scoring: a cross disciplinary approach (pp. vii–ix). Mahwah, NJ: Lawrence Erlbaum Associates.
References (cont'd) • Burstein, J. (2003). The e-rater scoring engine: Automated Scoring with natural language processing. In M.D. Shermis and J. C. Burstein (Eds.), Automated Essay Scoring: A cross-disciplinary approach. (pp. 113-121). Mahwah, NJ: Lawrence Erbaum Associates. • Burstein, J.& Chodorow, M. (1999, June). Automated Essay Scoring for nonnative English speakers. Proceedings of the ACL99 Workshop on Computer-Mediated Language Assessment and Evaluation of Natural Language Processing, College Park, MD. • Burstein, J. Braden-Harder, L. , Chodorow, M., Hua, S., Kaplan, B., Kukich, Karen, Lu C., Nolan, J., Rock, D., and Wolff, S. (1998). Computer Analysis of Essay Content for Automated Score Prediction: A Prototype Automated Scoring System for GMAT Analytical Writing Assessment Essays, (RR-98-15). Princeton, NJ: Educational Testing Service (ETS). • Burstein, J. & Marcu, D. (2000). Benefits of modularity in an Automated Essay Scoring System (ERIC reproduction service no TM 032 010). • Burstein, J., Leacock, C. & Swartz, R. (2001). Automated evaluation of essays and short answers. Proceedings of the 5th International Computer Assisted Assessment Conference (CAA 01), Loughborough University.
References (cont'd) • Chen, C. E. & Cheng, W. E. (2008). Beyond the design of automated writing evaluation: Pedagogical practices and perceived learning effectiveness in EFL writing classes. Language Learning & Technology, 12, 2, 94-112. • Chodorow, M. & Burstein, J. (2004). Beyond essay length: evaluating e-rater’s performance on TOEFL essays (Research report no 73). Princeton, NJ: Educational Testing Service (ETS). • Choi, J. & Lee, Y. (2010). The Use of Feedback in the ESL Writing Class Integrating Automated Essay Scoring (AES). In D. Gibson & B. Dodge (Eds.), Proceedings of Society for Information Technology & Teacher Education International Conference (pp. 3008-3012). Chesapeake, VA: AACE. • Chung, K. W. K. & O’Neil, H. F. (1997). Methodological approaches to online scoring of essays (ERIC reproduction service no ED 418 101). • Cushing-Weigle, S. (2011). Validation of Automated Scores of TOEFL iBT Tasks Against Nontest Indicators of Writing Ability, (RR-11-24).
References (cont'd) • Dikli, S. (2006). An Overview of automated scoring of essays. Journal of Technology, Learning, and Assessment, 5(1). • Dikli, S. (2010). Nature of automated essay scoring feedback. CALICO Journal, 28 (1). • Dikli, S., & Bleyle, S. (in press). Automated essay scoring feedback for second language writers: How does it compare to instructor feedback? Assessing writing. • Edelblut, P. & Vantage Learning (2003, November). An analysis of the reliability of computer Automated Essay Scoring by IntelliMetric of essays written in Malay language. Paper presented at TechEX 03, Ruamrudee International School. • Educational Testing Service (ETS). (n.d.). Retrieved on March 05, 2009 from http://www.ets.org
References (cont'd) • Elliott, S. (2003). Intellimetric from here to validity. In Mark D. Shermis and J. Burstein (Eds.). Automated Essay Scoring: a cross disciplinary approach. Mahwah, NJ: Lawrence Newtown, PA: Vantage Learning. • Elliot, S. & Mikulas, C. (2004, April). A summary of studies demonstrating the educational impact of the MY Access online writing instructional application. Paper presented at the National Council on Measurement in Education (NCME), San Diego, CA. • Grimes, D. & Warschauer, M. (2010). Utility in a fallible tool: A multi-site case study of automated writing evaluation. Journal of Technology, Language, and Assessment, 8 (6). • Haberman, S. J. (2011). Use of e-rater in Scoring of the TOEFL iBT Writing Test (RR 11- 15). Princeton, NJ: Educational Testing Service (ETS).
References (cont'd) • Keith, T. Z. (2003). Validity and automated essay scoring systems. In M.D. Shermis &J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 147-167). Mahwah, NJ: Lawrence Erlbaum. • Kukich, K. (2000, September/October). Beyond Automated Essay Scoring. In M. A. Hearst (Ed.) The debate on automatyed essay grading IEEE Intelligent Systems, 27-31. • Landauer, T. K., Laham, D., & Foltz, P. W. (2003). Automated Essay Scoring: A cross disciplinary perspective. In M. D. Shermis and J. C. Burstein (Eds.), Automated Essay Scoring and annotation of essays with the Intelligent Essay Assessor (pp. 87-112). Mahwah, NJ: Lawrence Erlbaum Associates. • Landauer, T. K., Laham, D., Rehder, B. & Schreiner, M. E. (1997). How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans. Proceedings of the 19th Annual Conference of the Cognitive Science Society, (pp. 412-417). Mahwah, NJ: Erlbaum.
References (cont'd) • Lemaire, B. & Dessus, P. (2001). A system to assess the semantic content of student essays. Educational Computing Research, 23(3), 305-306. • Lee, W.Y., Gentile, C & Kantor, R. (2008). Analytic Scoring of TOEFL CBT Essays: Scores from humans and e-rater (RR 08-01). Princeton, NJ: Educational Testing Service (ETS). • Nichols, P. D. (2004, April). Evidence for the interpretation and use of scores from an Automated Essay Scorer. Paper presented at the Annual Meeting of the American Educational Research Association (AERA), San Diego, CA. • Page, E. B. (2003). Project Essay Grade: PEG. In M. D. Shermis & J. Burstein (Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43-54). Mahwah, NJ: • Lawrence Erlbaum Associates. • Pearson Assessments (n.d.) Retrieved on May 01, 2013 from www.pearsonassessments.com.
References (cont'd) • Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2000). Comparing the Validity of Automated and Human Essay Scoring, (RR-00-10). Princeton, NJ: Educational Testing Service (ETS). • Powers, D. E., Burstein, J. C., Chodorow, M., Fowles, M. E., & Kukich, K. (2001). Stumping E-Rater: Challenging the Validity of Automated Essay Scoring, RR-01-03). Princeton, NJ: Educational Testing Service (ETS). • Rudner, L. & Gagne,P. (2001). An overview of three approaches to scoring written essays by computer (ERIC Digest number ED 548 290). • Rudner, L. M., Garcia, V., & Welch, C. (2006). An Evaluation of the IntelliMetric Essay Scoring System. Journal of Technology, Learning, and Assessment, 4(4). • Shermis, M.D.& Barrera, F. (2002). Exit assessments: Evaluating writing ability through Automated Essay Scoring (ERIC document reproduction service no ED 464 950).
References (cont'd) • Shermis, M.D., Burstein, J. & Bliss, L. (2004). The impact of Automated Essay Scoring in high stakes writing assessments. Proceedings in Annual Meetings of American Educational Research Association (AERA) and the National Council of Measurement in Education (NCME) Conference, San Diego, CA. • Shermis, M.D., Raymat, M.V. & Barrera, F. (2003). Assessing writing through the curriculum with Automated Essay Scoring (ERIC document reproduction service no ED 477 929). • Rock, J. L. (2007). The Impact of Short-Term Use of Criterion on Writing Skills in Ninth Grade (RR-07-07). Princeton, NJ: Educational Testing Service (ETS). • Vantage Learning. (n.d.). Retrieved on May 12, 2007 from www.vantageleraning.com. • Vantage Learning. (2000a). A study of expert scoring and IntelliMetric scoring accuracy for dimensional scoring of Grade 11 student writing responses (RB- 397). Newtown, PA: Vantage Learning. • Vantage Learning. (2000b). A true score study of IntelliMetric accuracy for holistic and dimensional scoring of college entry-level writing program (RB-407). Newtown, PA:Vantage Learning.
References (cont'd) • Vantage Learning. (2001a). Applying IntelliMetric Technology to the scoring of 3rd and 8th grade standardized writing assessments (RB-524). Newtown, PA: Vantage Learning. • Vantage Learning. (2001b). A preliminary study of the efficacy of IntelliMetric for use in scoring Hebrew assessments RB-561. Newtown, PA: Vantage Learning. • Vantage Learning. (2002). A study of expert scoring, standard human scoring and IntelliMetric scoring accuracy for statewide eighth grade writing responses (RB-726). Newtown, PA: Vantage Learning. • Vantage Learning. (2003a). Assessing the accuracy of IntelliMetric for scoring a district-wide writing assessment (RB-806). Newtown, PA: Vantage Learning. • Vantage Learning. (2003b). How does IntelliMetric score essay responses? (RB-929). Newtown, PA: Vantage Learning. • Wang, J. & Brown, M.S. (2007). Automated Essay Scoring Versus Human Scoring: A Comparative Study. Journal of Technology, Learning, and Assessment, 6 (2). • Warschauer, M. & Grimes D. (2008) Automated Writing Assessment in the Classroom. Pedagogies, (3) 22-36. • Warschauer, M., & Ware, P. (2006). Automated writing evaluation: Defining the classroom research agenda. Language Teaching Research, 10(2), 1-24.
Thanks! • Dr. Semire Dikli • sdikli@ggc.edu