120 likes | 126 Views
This project aims to evaluate, train, and improve an English error correction system for Hong Kong students, focusing on areas such as grammar, collocation, meaning, and style. The project utilizes deep learning techniques and data-driven approaches to provide more accurate and context-aware corrections. The system includes a database and interface for data collection and analysis.
E N D
English projectMore detail and the data collection system 2018-08-24 David Ling
Contents • Project background • Evaluation • Training • Data collection system
Background • Project in charge: • Holly Chung, Amy Kwok, Anora Wong (ENG) • Target: • English error corrections for HK students • Highlight good practices (not well defined yet) • More than traditional grammar checkers: • Chinglish, collocation, meaning, and style • Math lessons use English. Math lessons are conducted in English. • He can say Chinese. He can speak Chinese.
An error rule extracted from LanguageTool on subject-verb-agreement Background • Old methods: • Rule based • eg. Microsoft Word, LanguageTool • Statistical methods • New method: • Deep Learning, Translation, data driven • Chollampatt, 2018 (National Singapore University) • Fairseq (Facebook) + Language model (KenLM) Sentence start + determiner + plural noun + is (Eg. The dogs is …, The teachers is, ….) Pattern matched Trigger correction About 1.7k error handcrafted error patterns
Input sentences • He go to schol tomorrow. • "I go to school by bus.", said David yesterday. • … she did not want another mother would also feeling it. • It can make the audiences having the same feeling on it. Deep learning • He will go to school tomorrow. • "I go to school by bus," said David yesterday. • … she did not want another mother to feel it. • It can make the audience feel the same way. Which set is by deep learning? A • Correction based on the context • Recall more errors • Not just correcting errors, but also improving styles Grammarly • He goes to school tomorrow. • "I go to school by bus.", said, David, yesterday. • … she did not want another mother would also feel it. • [No change] B
Evaluation – Four main steps INPUT He go to schol tomorrow. 1. Tokenize + Byte pair encoding He go to scho@@ l tomorrow . He will go to school tomorrow. OUTPUT 2. Fairseq(Beam search 12 sentences) 4. Reweighted with number of edit operations and sentence length He will go to school tomorrow . ||| F0= -0.053 He goes to school tomorrow . ||| F0= -0.375 He is going to school tomorrow . ||| F0= -0.397 … He will go to school tomorrow . ||| LM0= -24.9241 He goes to school tomorrow . ||| LM0= -25.8588 He is going to school tomorrow . ||| LM0= -25.1118 … 3. Language model (Kenlm 150GB)
Training – data sets • NUCLE (National University of Singapore), 2014 • ~60k sentences (1500 essays) • LANG-8 (Japan social website), 2012 • ~2000k sentences • Topics and errors are far from enough, eg. eSports
Training - with additional training sentences • Before • After are conducted in are conducted in be used in be used in math • Recalled successfully • Chinese and Science are not in training data Five additional training sentences 1. Math lessons used English . Math lessons were conducted in English . 2. Physics lessons used English . Physics lessons were conducted in English . 3. Biology classes use English . Biology classes are conducted in English . 4. History lessos used English . History lessons was conducted in English . 5. Philosophy lessons often used English . Philosophy lessons are conducted in English often
Grammar correction data set • Building a data set for Hong Kong students • Improvement on the checker • Different sentence style • Different error types • Literature value • statistical analysis on HK students’ English
Data collection system • Four tables in the database • System = database + interface (PHP+JS) • System • http://10.244.0.191/annotation/login.php • Contains about 40 computer corrected essays • Asked the English teachers to try • SQLITE • Easy compatible with python and php • Stored as a single file
Data collection system Table -- ESSAYS Table -- ANNOTATIONS Stored in JSON format
END • Thank you