1 / 12

English project More detail and the data collection system

This project aims to evaluate, train, and improve an English error correction system for Hong Kong students, focusing on areas such as grammar, collocation, meaning, and style. The project utilizes deep learning techniques and data-driven approaches to provide more accurate and context-aware corrections. The system includes a database and interface for data collection and analysis.

sraymond
Download Presentation

English project More detail and the data collection system

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. English projectMore detail and the data collection system 2018-08-24 David Ling

  2. Contents • Project background • Evaluation • Training • Data collection system

  3. Background • Project in charge: • Holly Chung, Amy Kwok, Anora Wong (ENG) • Target: • English error corrections for HK students • Highlight good practices (not well defined yet) • More than traditional grammar checkers: • Chinglish, collocation, meaning, and style • Math lessons use English.  Math lessons are conducted in English. • He can say Chinese.  He can speak Chinese.

  4. An error rule extracted from LanguageTool on subject-verb-agreement Background • Old methods: • Rule based • eg. Microsoft Word, LanguageTool • Statistical methods • New method: • Deep Learning, Translation, data driven • Chollampatt, 2018 (National Singapore University) • Fairseq (Facebook) + Language model (KenLM) Sentence start + determiner + plural noun + is (Eg. The dogs is …, The teachers is, ….) Pattern matched  Trigger correction About 1.7k error handcrafted error patterns

  5. Input sentences • He go to schol tomorrow. • "I go to school by bus.", said David yesterday. • … she did not want another mother would also feeling it. • It can make the audiences having the same feeling on it. Deep learning • He will go to school tomorrow. • "I go to school by bus," said David yesterday. • … she did not want another mother to feel it. • It can make the audience feel the same way. Which set is by deep learning? A • Correction based on the context • Recall more errors • Not just correcting errors, but also improving styles Grammarly • He goes to school tomorrow. • "I go to school by bus.", said, David, yesterday. • … she did not want another mother would also feel it. • [No change] B

  6. Evaluation – Four main steps INPUT He go to schol tomorrow. 1. Tokenize + Byte pair encoding He go to scho@@ l tomorrow . He will go to school tomorrow. OUTPUT 2. Fairseq(Beam search 12 sentences) 4. Reweighted with number of edit operations and sentence length He will go to school tomorrow . ||| F0= -0.053 He goes to school tomorrow . ||| F0= -0.375 He is going to school tomorrow . ||| F0= -0.397 … He will go to school tomorrow . ||| LM0= -24.9241 He goes to school tomorrow . ||| LM0= -25.8588 He is going to school tomorrow . ||| LM0= -25.1118 … 3. Language model (Kenlm 150GB)

  7. Training – data sets • NUCLE (National University of Singapore), 2014 • ~60k sentences (1500 essays) • LANG-8 (Japan social website), 2012 • ~2000k sentences • Topics and errors are far from enough, eg. eSports

  8. Training - with additional training sentences • Before • After are conducted in are conducted in be used in be used in math • Recalled successfully • Chinese and Science are not in training data Five additional training sentences 1. Math lessons used English .  Math lessons were conducted in English . 2. Physics lessons used English .  Physics lessons were conducted in English . 3. Biology classes use English .  Biology classes are conducted in English . 4. History lessos used English .  History lessons was conducted in English . 5. Philosophy lessons often used English .  Philosophy lessons are conducted in English often

  9. Grammar correction data set • Building a data set for Hong Kong students • Improvement on the checker • Different sentence style • Different error types • Literature value • statistical analysis on HK students’ English

  10. Data collection system • Four tables in the database • System = database + interface (PHP+JS) • System • http://10.244.0.191/annotation/login.php • Contains about 40 computer corrected essays • Asked the English teachers to try • SQLITE • Easy compatible with python and php • Stored as a single file

  11. Data collection system Table -- ESSAYS Table -- ANNOTATIONS Stored in JSON format

  12. END • Thank you

More Related