1 / 21

Stephen Doherty, CNGL/SALIS s tephen.doherty2@mail.dcu.ie

Current Research A comparative investigation of the readability and comprehensibility of SMT and RBMT output for controlled and uncontrolled input. Stephen Doherty, CNGL/SALIS s tephen.doherty2@mail.dcu.ie. Overview. Past Research Readability & Comprehensibility Controlled Language

twyla
Download Presentation

Stephen Doherty, CNGL/SALIS s tephen.doherty2@mail.dcu.ie

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Current ResearchA comparative investigation of the readability and comprehensibility of SMT and RBMT output for controlled and uncontrolled input Stephen Doherty, CNGL/SALIS stephen.doherty2@mail.dcu.ie

  2. Overview • Past Research • Readability & Comprehensibility • Controlled Language • Research Proposal (Methodology) • Evaluation (Eye Tracking) • Conclusion

  3. Past Research • Translating Versus Post-Editing: A Segmentation Comparison Based on Pauses (B.A. Dissertation) • Think-Aloud Protocols in Translation Studies (Interessen der kognitiv orientiereten Translationswissenschaft)

  4. Research Proposal • CNGL Work Package: ILT1.8 Controlled Language: • Supervisors – Dr. Sharon O’Brien, Dr. Dorothy Kenny • “adapt the systems developed by other ILT WPs to deal with in-house data which conforms to both source and target controlled language guidelines”

  5. Readability & Comprehensibility • What is readability? • (Gray 1935: “In the reader, those features affecting readability are 1. prior knowledge, 2. reading skill, 3. interest, and 4. motivation. In the text, those features are 1. content, 2. style, 3. design, and 4. structure”.) • What is comprehensibility?

  6. Readability & Comprehensibility • Metrics: (Reading scores, recall tests...) • E.g. Flesch Reading Ease: • Gunning-Fog Index – SMOG (Simple Measure of Gobbledygook) (Mc Laughlin 1969) 6

  7. Controlled Language • What is controlled language? “an explicitly defined restriction of a natural language that specifies constraints on lexicon, grammar, and style” (Huijsen, 1998)

  8. Controlled Language • Types of CL: • Human-Orientated Controlled Language (HOCL): readability & comprehensibility e.g. AECMA Simplified English • Machine-Orientated Controlled Language (MOCL): improved translatability, MT system specific (Huijsen, 1998)

  9. Controlled Language • Examples of CLs: AECMA Simplified English, Sun Microsystem’s Controlled English, IBM Easy English, Caterpillar Technical English, GM... • Usage (mostly English, but…) • Symantec (CNGL Industry Partner)

  10. Controlled Language • Roturier (2006): • Consistent spelling (54) • Do not use pronouns that have no specific referent (19) • Avoid unusual punctuation (35) • Avoid embedded clauses introduced by commas or dashes (41) • Do not use more than 25 words per sentence (5) • Use a question mark only at the end of a direct question (48)

  11. Controlled Language • O’Brien (2003) - three types of rule categories: • Lexical (e.g. Rules that allow or rule out the use of specific acronyms or abbreviations) • Syntactic (e.g. specifying when and where past participles can be used and avoiding the present participle) • Textual: • Text Structure (e.g. Specifying admissible sentence length) • Pragmatic (e.g. Using certain verb forms for specific text purposes – imperative for instructions)

  12. Research Proposal A comparative investigation of the readability and comprehensibility of SMT and RBMT output for controlled and uncontrolled input

  13. HypothesesI. Controlled input to an MT system results in a higher level of readability and comprehensibility than uncontrolled inputII. The above is true regardless of whether the MT system is rule-based or statistics-based

  14. Proposed MethodologyA corpus will be gathered to train the MT system (DCU School of Computing)A set of CL rules (Symantec)Four corpora (Symantec):1. Uncontrolled English – IT security domain2. Same corpus but with Symantec CL rules applied using Acrocheck, an authoring control tool3. RBMT output in French for corpus one4. RBMT output in French for corpus two

  15. Proposed MethodologyMost of the uncontrolled and controlled bi-lingual corpora (the training data) will then be used to train the SMT system.The remaining subset of source-language side of corpora one and two (the test data) will then be translated using the resulting MT system (exact size/composition to be decided).

  16. Evaluation • Both automatic and human evaluation (focus) • Automatic evaluation (Blue…) • Human evaluation: eye tracking & retrospective protocols (recall tests & interviews)

  17. Evaluation • Eye Tracking: • What is it exactly? (background) • Successful application in this research area • Tobii Eye Tracker & ClearView software • Additional video recording, keystroke & mouse logging

  18. Tobii 1750 Eye Tracker (www.tobii.se)

  19. Evaluation • Recall tests (comprehensibility) • Retrospective interviews (generation of additional data & resolving possible issues)

  20. In Conclusion…What: SMT & RBMT output given controlled and uncontrolled input How: Automatic and human evaluation (eye tracking)Why (Future): Success of application of CL, comparison of MT systems with & without CL usage, Controlled Translation, implementing new technology & methodologies in research area, commercial benefits...

  21. Thanks for your attention!Questions?

More Related