250 likes | 455 Views
Chinese WordSketch Online, corpus-based summaries of word usage. Participants. Adam Kilgarriff, Lexical Computing, UK David Tugwell, Tech University Budapest Pavel Rychly, Brno University Simon Smith, 銘傳大學 ( 中研院 ) 黃居仁 , 中研院 巫宜靜 , 清華大學 ( 中研院 ). Facing the problem: lexical choice.
E N D
Chinese WordSketchOnline, corpus-based summaries of word usage
Participants • Adam Kilgarriff, Lexical Computing, UK • David Tugwell, Tech University Budapest • Pavel Rychly, Brno University • Simon Smith, 銘傳大學 (中研院) • 黃居仁, 中研院 • 巫宜靜, 清華大學 (中研院)
Facingthe problem: lexical choice • “You shall know a word by the company it keeps” (Firth, 1957) • The meaning of face depends on the collocation (詞語搭配) • 學漢語的外國人要面對詞語選擇的問題 • 許多種動物正在面臨絕種 • Similarly with save • Save money • Save life • Save a seat for me
Look in a dictionary? A corpus? • Some modern English dictionaries give some collocation (詞語搭配) information • Chinese dictionaries give very limited help • Since the 1980s, corpus KWIC (KeyWord In Context) concordances have been available
Pre-computer corpus! • Oxford English • Dictionary: • 20 million • index cards
The coloured pens method 1political association 4 person in an agreement/dispute 2 social event 5 to be party to something... 3 group of people
Limitation of KWIC analysis • As corpora get bigger: too much data • 50 lines for a word: read all • 500 lines: could read all, takes a long time • 5000 lines: no • Instead, create a statistical summary of word usage • Show most salient 最有顯著性 collocates (Mutual Information)
Mutual Information • Church and Hanks 1989 • MI: How much more often does a word pair occur, than one might expect by chance:
Collocation listing For right collocates of save (>5 hits)
Limitations of collocation listing • Some items are not genuine collocates • yours appears only because it is adjacent to save • The collocates can belong to any part of speech • It would better if they were classified into POS • and the role they play in the sentence • Thus, • for arrest in “The police were quick to arrest a number of suspects on the spot” • We would like to see • Keyword: arrest • Subject: police • Object: suspect(s) • Modifier: on the spot
Wordsketch • Attempts to meet these requirements • A corpus-derived one-page summary of a word’s grammatical and collocational behaviour • Implemented for English and Czech • Chinese and Irish implementations in progress
The corpus: Chinese Gigaword • A Linguistic Data Consortium corpus • Very large: over 1 billion characters • Compiled by David Graff & Ke Chen in 2003 • Minimally tagged • 286 newswire stories, half from each of: • CNA Taiwan (740 million traditional characters) • Xinhua PRC (380 million simplified characters) • Corpus was segmented and tagged using Academia Sinica tools
http://corpora.fi.muni.cz/chinese/ • 逮捕 • 教 • 學習 • 銀行 • 捉 • http://corpora.fi.muni.cz/chinese/
Functions • KWIC concordance • Sorting, filtering etc • Word sketch • Automatic thesaurus • Sketch difference • discriminate near-synonyms • In development • key words in a subcorpus / text type • how word varies with text type
Grammar writing • Uses CQL (Corpus query language) • Christ and Schulze, U. Stuttgart, 1994 • defining an object: v (adj|n|det|num|adv)* n rewriting in CQL with BNC/CLAWS-5 tags [tag="VV.*"] [tag="(A[JTV]|D|O).*"]* [tag="NN.*"]
Further work • Improve grammatical relations, especially sentence objects, to account for • topicalization (啤酒,葡萄酒,他都愛喝) • 把 fronting (請把啤酒喝完) • Create “Dr Eye” style interface, to show common collocations online, in a text
English version available • For personal use • www.sketchengine.co.uk • 歡迎註冊及多善加利用!