120 likes | 517 Views
LanguageTool - Part A. 23-10-2017 David Ling. LanguageTool. LanguageTool -- Open source Java program Language_check -- python wrapper of LanguageTool , supports only up to v3.5 (currently v3.9) To use, you can double click ‘languagetool.jar’, or
E N D
LanguageTool - Part A 23-10-2017 David Ling
LanguageTool • LanguageTool -- Open source Java program • Language_check -- python wrapper of LanguageTool, supports only up to v3.5 (currently v3.9) • To use, you can • double click ‘languagetool.jar’, or • Run as a local host http server via cmd • Main papers • Daniel Naber, A Rule-Based Style and Grammar Checker, Diploma Thesis, University of Bielefeld, 2003 • Marcin Miłkowski, Developing an open-source, rule-based proofreading tool, Software – Practice and Experience 2010, 40 (7), pp. 543-566. DOI: 10.1002/spe.971
Rules in LanguageTool • Xml rules • grammar.xml (collaborative) • Java rules • Rules cannot be handled by xml rules (eg. missing of closing parenthesis, a space after comma) • Spell checking • n-gram frequency for potential homophones (like there - their) • There are only a few Java rules (according to Marcin’s paper in 2010) • xml rules use the following input features: • word token • part of speech of the token – postag(from dictionary) • chunk tag of the (by opennlp)
Xml rules Total: 1704
Xml rules – possible typo Notes: MD: modal words JJ.? : adjective VBN: verb, past participle DT: determiner: an, an, all, … • rule name = "'as follow' (as follows) " • as • follow • [\.:,—\-–] suggests “as follows” • rule name = "'by' + passive participle (be) " • postag = "MD " • by • postag = "JJ.?|VBN“, except postag = "DT" suggests “be” Example: This can by consistent with… This can beconsistent with Example: It can by found.It can be found.
Notes: VB[DNPZ]?“: verb infected: use, uses, used, … Xml rules – possible typo • rule name="miss use (misuse) “ • miss • understand|spell|use|place|lead|…|dial, inflected, postag="VB[DNPZ]?“ suggests “mis”+token Example: These words are miss used. These words are misused. • Other randomly selected rules: • land lover (landlubber) <correction="landlubber">The sailors considered John to be a serious land lover. • I/you/... thing (think) <correction="think|thinks">I thing that's a good idea. • to get ride (rid) of <correction="rid"> Let's get ride of that broken chair.
Notes:WP: wh-pronoun: that, whatever, what,… WRB: wh-adverb: however, how,… VB.*: verb MD: modal words infected: be, is, am, are Xml rules - Grammar • Rule name = "will follows be ('he is would') " • postag = " W(RB|P) " • be, infected • will|must, infected message: redundant Example: How is would this approach be useful?How is this … or How would this… • Rule name="missing verb after 'if there'“ • if, <exception scope="previous">as</exception> • there • <exception postag="VB.*|MD" /> <exception>[´`'’]</exception> message: missing verb Example: If there one who has … If there is one who has …
Randomly selected xml rules in Grammar • some faculty... (some faculty members...) < correction="faculty members">Three facultysupport the change. • all/most/some (of) + noun < correction="All students|All of the students">All of studentslike mathematics. • both... as well as (and) < correction="and">He is both very rich as well ashandsome. • Use of past form with 'going to ...' < correction="write">I'm going to wrote him. • Who + verb (who know's/knows) < correction="Who cares">Who care's? • inspired with (by) < correction="inspired by">The artist was inspired withthe beauty of the mountains. • beware PREPOSITION < correction="Beware of">Beware aboutmalware. • objective case after with(out)/at/to/... < correction="to me|toher|tohim|tous|to them">Give it to I.
xml rules – commonly confused words • rule name ="and than (then) " • and|since • than suggest: then • rule name="rather/other/different then (than) " • rather|other • then suggest: than • Other rule names: • turned of (off) • 'economical (economic) growth' etc. • in the passed (in the past) • too go (to go)
xml rules – redundant phrases & punctuations Redundant phrases • absolutely essential/necessary (essential/necessary)< correction="essential">This is absolutely essential. • established fact (fact)< correction="a fact">This is an established fact. • there are also other (also)< correction="there are other|there are also">However, there are also othermarbles in the jar. Punctuations • extraneous apostrophes before ‘are’< correction="cars">The car'sare cheap. • Comma after a month< correction="October 1958">The store closed its doors for good in October, 1958. • Missing comma between day of month and year< correction="October 18,">My birthday is October 181983.
N-gram data rule • Resolve confusing words pair, like their and there • Given a confusion list (currently ~600 pairs): eg. (their, there; adapting, adopting) • Input sentence: This is there last chance to escape. • System will consider 3-gram frequency of ‘there’ with ‘their’: This is there, is there last, there last chance This is their, is their last, their last chance • Recommend using their if the probability ratio is greater thana ratio Remarks: n-gram data is from google book ngramviewer • Someone is developing word2vec to calculate the probability instead of the 3-gram (context: {this, is, last, chance}, guessing {there, their})
Next time • other xml rules • spell check • chunking by opennlp • references: • http://wiki.languagetool.org • https://community.languagetool.org/rule/list