110 likes | 361 Views
Sketch engine for Chinese. Discussion notes. Wordsketch, subsequently Sketch Engine. Was developed by Kilgarriff et al at Brighton Gives automatic, corpus-based summaries of a word’s grammatical and collocational behaviour
E N D
Sketch engine for Chinese Discussion notes
Wordsketch, subsequently Sketch Engine • Was developed by Kilgarriff et al at Brighton • Gives automatic, corpus-based summaries of a word’s grammatical and collocational behaviour • Captures information in a more accessible way then hundreds of KWIC lines • Uses MI based salience algorithm
Other corpus query tools do collocational salience too, but… • Sketch engine uses lemmata not word-forms • So that eat and eats are treated the same • And it takes account of grammatical relations • So that The plane banks and The investment banks are treated separately • And (if the corpus is appropriately parsed) He robs banks and He robbed the bank would be accorded similar treatment
Grammatical relations example Unary relations Word2 and Prep are not specified Binary relations Prep not specified Binary relations, Word2 not specified Trinary relations
Sketch engine modules • Concordance • KWIC or sentence context • Thesaurus • A list of “similar” words • Sketch differences, for distinguishing near-synonyms • If both lemmata x and y have strong collocational salience with a, then they are near-synonyms • Wordsketch
Sample of grammatical relation definitions script (M language) • define(`wh_word',`[tag=3D"AVQ"|tag=3D"D`$ p& TQ"|tag=3D"PNQ"]') • define(`whether_if',`[tag=3D"PNQ" & word=3D"if" |word=3D"whether"]') • define(`determiner',`[tag=3D"AT."|tag=3D"DT."|tag=3Dposs_pro]') • define(`conjunction',`"CJC"') • define(`simple_neg',`"XX."') • define(`rel_start',`[tag=3D"DTQ"|tag=3D"PNQ"|tag=3Dthat_comp]') • define(`adv_neg',`[tag=3Dany_adv|tag=3Dsimple_neg]') • define(`number',`"[OC]RD"') • define(`goal_adv',`[word=3D"back"|word=3D"over"|word=3D"home"|word=3D"awa= • y"|word=3D"out"]') • define(`long_np',`[tag=3D"AT."|tag=3D"DT."|tag=3Dposp& €( s_pro|tag=3Dnumber|ta= • g=3Dany_adv|tag=3Dany_adj|tag=3Dgenitive]{0,3} any_noun{0,2} 2:any_noun = • [tag!=3Dany_noun & tag !=3D genitive]') • define(`np_start',`[tag=3D"AT."|tag=3D"DT."|tag=3Dposs_pro|tag=3Dnumber|t= • ag=3Dany_adj|tag=3Dany_noun]')
Applications • Intended as an aid to lexicographers • At least one paper on MT application • Could be used in pedagogical applications • Earlier NSF grant aimed at a complete Chinese learning platform, with Wordsketch as a module • Comparison of similar lexemes cross-linguistically • Yiching is publishing about express vs biaoshi, and this work may use Wordsketch
Chinese Wordsketch • Kilgarriff et al report that Wordsketch can be ported to any language • Pavel Rychly in Czech Rep has implemented concordancing at Chinese character level only • AS has acquired Chinese Gigaword, and POS-tagged it automatically • No parsing has been attempted so far • Grammatical relations ruleset for Chinese is needed • I would plan to • contribute to the writing of this ruleset • collaborate on cross-linguistic lexical analyses, using Wordsketch where possible
links • http://nlp.fi.muni.cz/projects/bonito2/chinese/ • test chin • http://www.sketchengine.co.uk/sampler/ • ssmith ssmith