250 likes | 462 Views
Automatic Identification of Discourse Moves in Scientific Article Introductions. NICK PENDAR AND ELENA COTOS IOWA STATE UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE 19, 2008. Outline . Background and motivation
E N D
Automatic Identification of Discourse Moves in Scientific Article Introductions NICK PENDAR AND ELENA COTOS IOWA STATE UNIVERSITY THE 3RD WORKSHOP ON INNOVATIVE USE OF NLP FOR BUILDING EDUCATIONAL APPLICATIONS JUNE 19, 2008
Outline • Background and motivation • Discourse move identification • Data and annotation scheme • Feature selection • Sentence representation • Classifier • Evaluation • Inter-annotator agreement • Further work
Automated evaluation: Background • Automated essay scoring (AES) in performance-based and high-stakes standardized tests (e.g., ACT, GMAT, TOEFL, etc.) • Automated error detection in L2 output (Burstein and Chodorow, 1999; Chodorow et al., 2007; Han et al., 2006; Leacock and Chodorow, 2003) • Assessment of various constructs, e.g., topical content, grammar, style, mechanics, syntactic complexity, and deviance or plagiarism (Burstein, 2003; Elliott, 2003; Landauer et al., 2003; Mitchell et al., 2002; Page, 2003; Rudner and Liang, 2002) • Text organization limited to recognizing the five-paragraph essay format, thesis, and topic sentences • AntMover (Anthony and Lashkia, 2003)
Automated evaluation: CALI Motivation • Wide range of possibilities for high quality evaluation and feedback (Criterion; Burstein, Chodorow, & Leacock, 2004) • Potential in formative assessment, but – the effects of intelligent formative feedback are not fully investigated • Warschauer and Ware (2006) call for the development of a classroom research agendathat would help evaluate and guide the application of AES in the writing pedagogy • “the potential of automated essay evaluation for improving student writing is an empirical question, and virtually no peer-reviewed research has yet been published” (Hyland and Hyland, 2006, p. 109)
Automated evaluation: EAP Motivation • EAP pedagogical approaches (Cortes, 2006; Levis & Levis-Muller, 2003; Vann & Myers, 2001) fail to provide NNSs with sufficient academic writing practice and remediational guidance • Problem of disciplinarity • An NLP-based academic discourse evaluation software application could account for this drawback • Such an application has not yet been developed
Automated evaluation: Research Motivation • Long-term research goals: • design and implementation of IADE (Intelligent Academic Discourse Evaluator) • analysis of IADE effectiveness for formative assessment purposes
Evaluates students’ research article introductions in terms of moves/steps (Swales 1990, 2004) • Draws from • SLA models: interactionist views (Caroll, 1999; Gass, 1997; Long, 1996; Long & Robinson, 1998; Mackey, Gass, & McDonough, 2000; Swain, 1993) and Systemic Functional Linguistics (Martin, 1992; Halliday, 1985) • Skill Acquisition Theory of learning (DeKeyser, 2007 ) • Is informed by empirical research on the provision of feedback • Is informed by Evidence Centered Design principles (Mislevy et al., 2006)
Discourse Move Identification • Approached as a classification problem (similar to Burstein et al., 2003) • given a sentence and a finite set of moves and steps, what move/step does the sentence signify? • ISUAW corpus: 1,623 articles; 1,322,089 words; average length of articles 814.09 words • Stratified sampling of 401 introduction sections representative of 20 academic disciplines • Sub-corpus: 267,029 words; average length 665.91 words; 11,149 sentences • Manual annotation
Discourse Move Identification • Annotation scheme (Swales, 1990; Swales, 2004)
Discourse Move Identification • Multiple layers of annotation for cases when the same sentence signified more than one move or more than one step
Feature Selection • Features that reliably indicate a move/step • Text-categorization approach (see Sebastiani, 2002) • Each sentence treated as a data item to be classified and represented as an n-dimensional vector in the Euclidean space • The task of the learning algorithm is to find a function F : S → M that would map the sentences in the corpus S to classes in M = {m1,m2,m3} • Identification of moves, not yet steps
Feature Selection • Extraction of word unigrams, bigrams, and trigrams from the annotated corpus • Preprocessing: • All tokens stemmed using the NLTK port of the Porter Stemmer algorithm (Porter, 1980) • All numbers in the texts replaced by the string _number_ • The tokens inside each n-gram alphabetized in case of bigrams and trigrams • All n-grams with a frequency of less than five excluded
Feature Selection • Odds ratio • Conditional probabilities are calculated as maximum likelihood estimates • N-grams with maximum odds ratios selected as features
Sentence Representation • Each sentence represented as a vector • Presence or absence of terms in sentences recorded as Boolean values (0 for the absence of the corresponding term or a 1 for its presence)
Classifier • Support Vector Machines (SVM) (Basu et al., 2003; Burges, 1998; Cortes and Vapnik, 1995; Joachims, 1998; Vapnik, 1995) • five-fold cross validation • Machine learning environment RAPIDMINER (Mierswa et al., 2006) • RBF kernel found through a set of different parameter settings on the feature set with 3,000 unigrams • Parameters not necessarily the best; exhaustive searches will be performed on the other feature sets
Evaluation • Five-fold cross validation on 14 different feature sets were performed
Evaluation Accuracy - the proportion of classifications that agreed with the manually assigned labels
Evaluation • Precision - what proportion of the items assigned to a given category actually belonged to it • Recall - what proportion of the items actually belonging to a category were labeled correctly
Evaluation • Trigram models result in the best precision • Unigram models result in the best recall
Evaluation • Move 2 is most difficult to identify as revealed by error analysis – Move 2 gets misclassified as Move 1 • Use the relative position of the sentence in the text to disambiguate the move involved • see what percentage of Move 2 sentences identified as Move 1 by the system also have been labeled Move 1 by the annotator • Extracted features are not discipline-dependent
This just in… • Built a model with top 3000 unigrams and top 3000 trigrams • Precision: 91.14% • Recall: 82.98% • Kappa: 87.57
Inter-annotator agreement • Second annotations on a sample of files across all 20 disciplines = 487 sentences • k - inter-annotator agreement • P(A) - observed probability of agreement • P(E) - expected probability of agreement • Average k = 0.945 over the three moves
Further work on IADE • Ongoing experiments to improve accuracy experimenting with different kernel parameters to find optimal models • More annotation • Inter-annotator agreement (3 annotators) • Identification of steps • Development of intelligent feedback • Web interface design
Further research with IADE • Evaluation of IADE effectiveness • Learning potential • Learner fit • Meaning focus • Authenticity • Impact • Practicality (Chapelle, 2001) • Process/product research direction - interaction between use and outcome (Warschauer &Ware, 2006) • Target for evaluation - “what is taught through technology”(Chapelle, 2007, p.30)
Questions?Suggestions? THANK YOU!