340 likes | 505 Views
Using corpora to study Classifiers in Mandarin Chinese. Richard Xiao z.xiao@lancaster.ac.uk. Chinese corpus linguistics. In relation to English, Chinese has a much shorter history of using corpora Sinica Balanced Corpus of Chinese The first annotated corpus of Mandarin
E N D
Using corpora to studyClassifiers in Mandarin Chinese Richard Xiao z.xiao@lancaster.ac.uk
Chinese corpus linguistics • In relation to English, Chinese has a much shorter history of using corpora • Sinica Balanced Corpus of Chinese • The first annotated corpus of Mandarin • Freely accessible online since the mid-1990s • Rapid progress over the last decade • Corpus building and exploration technology • Publicly available corpus resources COST Action A31 WG1 Meeting
Chinese text processing • Computational processing of Chinese text is more complex than English • Chinese text is encoded in double-byte native encodings • Potential confusion of bytes in running text • GB2312 for SC and Big5 for TC • The advent of Unicode has facilitated Chinese computing • But most existing data and tools are based on native encoding • Word tokenization is an essential first step in serious Chinese computing • Defining legitimate “words” in running text • Involving dictionary matching and the use of statistic models • Part-of-speech tagging depends on the results of tokenizaton • Accuracy of accuracy: 98% • Accuracy of POS tagging: 96% COST Action A31 WG1 Meeting
Concordancers for Chinese • Many concordancers designed for English do not work well with Chinese data • There are presently three types of tools for Chinese • Unicode-based tools • WordSmith version 4 (Commercial product) • Xaira (open source freeware) • Concordancers dependent on language support packs (or in WinXP, default non-Unicode font set as Chinese) • AntConc (freeware) • ConcApp (freeware) • MonoConc Pro (commercial product) • Concordance (shareware) • Web-based query systems bundled with specific online corpora COST Action A31 WG1 Meeting
Chinese corpus resources • Sinica Balanced Corpus • http://www.sinica.edu.tw/SinicaCorpus/ • Sinica Tagged Corpus of Early Mandarin • http://www.sinica.edu.tw/Early_Mandarin/ • Modern Chinese Language Corpus • http://219.238.40.213:8080/CpsQrySv.srf • PKU-CCL Chinese Corpus • http://ccl.pku.edu.cn/YuLiao_Contents.Asp • BLCU Modern Chinese Corpus • http://202.112.195.8:8089/ccir_login?input=* • Chinese Internet Corpus • http://corpus.leeds.ac.uk/query-zh.html • Lancaster Corpus of Mandarin Chinese • http://www.ling.lancs.ac.uk/corplang/lcmc/ • Lancaster LOS Angeles Spoken Chinese Corpus • http://www.ling.lancs.ac.uk/corplang/llscc/ • More details of more corpora in more languages are on the handout COST Action A31 WG1 Meeting
Lancaster Corpus of Mandarin Chinese (LCMC) • Designed as a Chinese match for FLOB and Frown • Representing written Mandarin as used in mainland China in the early 1990s • A balanced corpus of one million words in 500 samples proportionally taken from 15 text categories • Marked up in XML and Encoded in Unicode • Tokenized and POS tagged • Freely searchable online • http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl • Released by ELRA and OTA free of charge for academic and educational purposes • An indexed version for use with Xaira is available • V1.2 incorporates validated details of classifier use COST Action A31 WG1 Meeting
Lancaster LOS Angeles Spoken Chinese Corpus (LLSCC) • One million words of spoken Mandarin • Both dialogues (55%) and monologues (45% ) • Both spontaneous (57% ) and scripted (43%) speech • Seven spoken registers • face-to-face conversation, telephone conversation, play/movie scripts, TV talk show transcripts, formal debates, spontaneous oral narrative, edited oral narrative • Marked up in XML and encoded in Unicode • Tokenised and POS tagged • The Telephone Conversation part is tagged with details of classifier use • The unannotated version of this part is available from the LDC as CallHome Mandarin Transcripts • More information • http://www.ling.lancs.ac.uk/corplang/llscc/ COST Action A31 WG1 Meeting
Annotation scheme for classifiers (q) COST Action A31 WG1 Meeting
Why classifiers are necessary (1) • Grammatically mandatory san ben shu *san shu three CL book three book three books three books • Distinguishing between word senses yi tiao xian yi gen xian one CL line one CL thread a line a thread COST Action A31 WG1 Meeting
Why classifiers are necessary (2) • Resolving syntactic ambiguity • Example A) Ho laozong gei-le ta yi-ba shouqiang Ho general give-Asp him one-CL pistol General Ho gave him a pistol. • Example B) Ho laozong gei-le ta yi shouqiang Ho general give-Asp him one pistol (CL) General Ho shot him once with a pistol. COST Action A31 WG1 Meeting
Use and name of classifiers • The use of “classifiers” dated back as early as over 3,300 years ago • Oracle bone inscriptions excavated from the Yin Ruins (1300-1100 B.C.) • Classifiers became established as a separate word class in Chinese only in the 1950s • Ding et al (1952): A Talk on Grammar in Modern Chinese • Different terms had been used for classifiers • But mainly treated as a subclass of nouns COST Action A31 WG1 Meeting
Syntactic features of classifiers • Classifiers were the last to have become one of the 11 word classes in Chinese because they cannot be used independently as sentential constituents • Typically following a numeral or demonstrative pronoun zhe这 ‘this’, na(那) ‘that’, or na (哪) ‘which’ • Monosyllabic classifiers can be reduplicated to function as different sentential constituents, expressing a general grammatical meaning with different situational variants (Guo 1999) • Co-existence or repetition of entities or events • “All around”, “many”, “one by one”, “continuous” COST Action A31 WG1 Meeting
Levels of grammaticalization • Specialised classifiers • Fully grammaticalized • Functioning as classifiers only • Bleaching of lexical meaning, difficult to find translation equivalents in a non-classifier language • E.g. (n) 个,件,块,颗,辆,枚,匹,幢; (v) 次,遍,场,顿,番,回,通,趟,下,阵 • Concurrent classifiers • Mainly derived from nouns and verbs • Can be used as nouns/verbs and classifiers • The classifier use is semantically related to the lexical meaning of the original noun/verb • E.g. 口,头,台;瓶,碗; 包,封,卷,捆,束 • Temporary borrowings • Mainly borrowed from nouns, verbs, and adjectives • Functioning as classifiers only on an ad hoc basis • Full lexical meaning • E.g. 脸 (face),屋子 (house); 刀 (knife),枪 (gun),脚 (foot),拳 (fist) COST Action A31 WG1 Meeting
Semantic types of classifiers (1) • Nominal classifiers (6 types): Quantifying nouns • Unit classifiers • Count individual entities • E.g. 个(63.5% of unit classifiers, 38.8% of all classifiers),位,条,张,名,件,句,家,项,封,只,片,步,块,部,份,座,届,口,支 • Collective classifiers • Provide a collective reference for separate entities • E.g. 套 ‘set’ ,批 ‘batch’ ,双 ‘pair’ ,系列 ‘series’ ,副 ‘pair’ ,群 ‘group’ ,代 ‘generation’ ,组 ‘group’ ,对 ‘pair’ ,队 ‘team’ • Arrangement classifiers • Also refer to a collection, but focus on constellation aspect (shape), i.e. how entities are arranged or grouped together • E.g. 层 ‘layer’,堆 ‘pile’,团 ‘ball’,沓 ‘pad’,串 ‘string’,丝 ‘thread’,排 ‘row’,把 ‘handful’,滴 ‘drop’,束 ‘bunch’,缕 ‘thread’,行 ‘row’ COST Action A31 WG1 Meeting
Semantic types of classifiers (2) • Nominal classifiers: Quantifying nouns • Standard measure classifiers • Express exact measures of various kinds, in local or international units • E.g. 元,块,米,吨,克,美元,里,厘米,亩,度,平方米,斤,公里,公斤,分,尺,升,丈,℃ • Container classifiers • Denote types of containers, which are borrowed temporarily to provide an inexact measure of mass or entities usually associated with such containers • E.g. 杯,碗,盒,袋,桶,脸,瓶,壶,盆,盘,锅,瓢,箱,筐,包,匙,罐,腔,坛,锹,盅,车,斗,肚子 • Special container classifiers, can only take yi -> ‘full’, more descriptive than quantifying • Species classifiers • Denote the type of entities grouped together • E.g. 种(kind, over 90%),类 (sort),级 (grade),样 (type),等 (grade),品 (class) COST Action A31 WG1 Meeting
Semantic types of classifiers (3) • Verbal classifiers: quantifying verbs • 9 specialised verbal classifiers • E.g. 次(times, 40.8% of all verbal classifiers),下(stroke),场(course of action),番(once over),阵(step of action),趟(return journey),回(times),遍(once through),顿(criticising, abusing) • Borrowed verbal classifiers • An open set, mostly nouns denoting tools and related items • E.g. 声,眼,口,刀,脚,拳,巴掌,枪,棒 • Temporal classifiers: measuring time • Exact measures • 年,天,岁,分钟,小时,夜,周,周年,日,周岁,月,载,星期,昼夜,刻,宿,宵,礼拜,旬 • Inexact measures • E.g. 会儿,段,辈子,阵子,会,阵,瞬间 COST Action A31 WG1 Meeting
Classifiers in writing and speech • Unit classifiers by far most common, in speech and writing • Because of the weight of generalised classifier ge, unit classifiers are particularly frequent in speech • Other common types: temporal, verbal • Infrequent types: container, arrangement, collective COST Action A31 WG1 Meeting
Variation across genres • Apart from the speech-writing difference, various genres also differ in classifier use • Most frequent in news reportage (A), humour (R), and speech (S): over 3K in 100K • Least common in news review (B), news editorial (C), religious writing (D), and academic prose (J): below 2k in 100k • Generally more common in imaginative (K-R) writing and speech (S) than in informative writing (A-J) COST Action A31 WG1 Meeting
Distribution of classifier types • Distribution of different types of classifiers also varies across genres • Unit classifier is the most common type in all genres (2/3 of all classifiers) • Container, arrangement, and collective classifiers are relatively rare in all genres • Std measure classifiers are most frequent in news reportage (A) and official docs (H) • Species classifiers are more common in informative than imaginative writing COST Action A31 WG1 Meeting
Cognitive basis of classifier use • Allan (1977): number of dimensions • Adams and Conklin (1973): elasticity, hardness, discreteness • Shi (2001): ratio between different dimensions, and materiality • Dimensions and use of classifiers • 0-D:point,e.g. yi dian (点) mo ‘a point of ink’ • 1-D: line, e.g. yixian (线) xiwang ‘a thread of hope’ • 2-D: area (Y being the longer dimension) • Y/X>>1 –> zhang (张): e.g. yi zhangzhaopian ‘a photo’ • Y/X>>0 –> tiao (条): e.g. yi tiaomalu ‘a road’ • 3-D: block (Q=Y/X) • Z/Q >> 0 –> pian (片): e.g., yi pian shuye ‘a leaf’ • Z/Q >> 1 –> kuai (块): e.g. yi kuai tang ‘a lump of sugar’ • Z/Q >> sufficiently large –> gen (根): e.g. yi gen dianxian ‘a cable’ • While the use of nominal classifiers is closely associated with shape, this is not the only criterion nouns and classifiers co-select each other • Five co-selection criteria COST Action A31 WG1 Meeting
Co-selection by similarity • Classifiers are closely related to shapes which are historically associated with the nouns that have given rise to these classifiers, e.g. tiao (条) • tiao: ‘small branch/twig’ –> ‘long, narrow, flexible’: jie (街) ‘street’, tui (腿) ‘leg’, lu (路) ‘road’, xian (线) ‘line; thread’, he (河) ‘river’, yu (鱼) ‘fish’, etc; ‘bamboo slips for writing’ –> guiding(规定) ‘regulation’, jianyi(建议) ‘suggestion’, falu(法律) ‘law’, xinwen(新闻) ‘news’, etc • kuai (块) (‘soil lump/block’ –> something of a lumpy/blocky shape, e.g. a wrist watch; ‘territory soil’ –> something with a boundary, e.g. a scar COST Action A31 WG1 Meeting
Co-selection by metonymy • The original lexical meanings of classifiers refer to the most salient features of the entities being classified, e.g. • kou (口)‘mouth’ (for pigs), tou(头) ‘head’ (for cattle), wei(尾) ‘tail’ (for fish), ding(顶) ‘top’ (for hats,sedan chairs etc) • BUT long term linguistic conventions are always important in language use • *tou: rabbit, cat • *wei: peacock, squirrel COST Action A31 WG1 Meeting
Co-selection by relatedness • The original lexical meanings of classifiers refer to actions closely related to entities being classified, e.g. • bao (包) ‘wrap-> pack (resulting of packing)’ • chuan (串) ‘string together-> string, bunch’ • kun (捆) ‘tie up, fasten -> bundle’ • peng (捧) ‘hold in both hands -> a double handful’ COST Action A31 WG1 Meeting
Co-selection by association • The original lexical meanings of classifiers refer to tools, containers, and places, etc closely associated with the entities being classified, e.g. • dao (刀) ‘knife -> a cut of (meat)’ • wan (碗) ‘bowl -> a bowl of (rice)’ • chuang (床) ‘bed -> a bed of (quilt/sheet etc)’ • mu (幕) ‘curtain -> an act of (play)’ COST Action A31 WG1 Meeting
Co-selection by conventions • Sometimes, co-selection has to be interpreted by following linguistic conventions because it is not always possible to track the grammaticalization path of a classifier to ascertain the relationship between its original lexical meaning with the entities being classified • In what way is tiao historically related to renming ‘human life’? • Why is tou used for pigs and cattle but not rabbits or cats? • Why is wei used for fish but not for peacocks or squirrels even though they have tails that are as salient as, if not more so, than that of fish • Such missing links have to be accounted for by linguistic conventions of the speech community COST Action A31 WG1 Meeting
Collocates • Let’s now have a look at the noun collocates of some common classifiers in Chinese to see how well the proposed co-selection criteria work • Defining collocates (in 2 million words) • Window span of L5-R5 • z>3.0 • Minimum co-occurrence frequency of 5 COST Action A31 WG1 Meeting
Collocates of zhang(张) COST Action A31 WG1 Meeting
Collocates of tiao(条) – 1 COST Action A31 WG1 Meeting
Collocates of tiao(条) - 2 COST Action A31 WG1 Meeting
Collocates of kuai(块) COST Action A31 WG1 Meeting
Collocates of ge(个) • Generalised classifier ge (个): bamboo (竹) split into halves, initially as a counter for bamboos and arrows; when a bamboo chip is used for counting, it becomes a symbol of the entity being counted. In other words, the entity loses its shape, colour, function or any other attribute and becomes a unit of counting, ge. • Ge can be used for any noun (people or things, large or small) that does not have a specific classifier, and it can be used to replace specific classifiers of many nouns. • A total of 115 noun collocates • 29 refer to human beings, 86 to non-human entities • 66 refer to concrete entities, 49 to abstract entities • 12 related to time • Top 20 noun collocates (z>8.8, F>5, in the order of z-scores) • 月 ‘month’, 星期 ‘week’, 人 ‘person’, 小时 ‘hour’, 电话 ‘phone call’, 礼拜 ‘week’, 字 ‘character’,百分点 ‘percentage’, 地方 ‘place’, 角落 ‘corner’, 项目 ‘project’, 钟头 ‘hour’, 问题 ‘problem, question’, 电饭锅 ‘rice cooker’, 女人 ‘woman’, 字儿 ‘character’, 例子 ‘example’, 盒子 ‘box’, 照相机 ‘camera’, 东西 ‘stuff’ COST Action A31 WG1 Meeting
Classifiers for dongxi (东西) • A noun with a rather general and vague referent; can refer to anything, but not human being • It is an insult to say someone is a dongxi, or is not a dongxi • The vagueness in reference makes it possible to use a nominal classifier of any type for dongxi • Unit classifier • (General) ge (个), jian (件) ‘piece’, fen (份) ‘portion’ • (Shape) tiao (条), zhang (张), and kuai (块) • (Book/paper) ben (本 for books), pian (篇 for a piece of writing) • Collective classifier • tao (套) ‘set’ • Arrangement classifier • dui (堆) ‘pile’ • Container classifier • xiangzi (箱子) ‘box’, bao (包) ‘pack’ • Standard measure classifier • dun (吨) ‘ton’ • Species classifier • yang (样) ‘type’, zhong (种) ‘kind’, lei (类) ‘class’ COST Action A31 WG1 Meeting
Variations • Not all instances of classifier use are in line with these co-selection criteria • Regional variation • dao (刀) ‘knife’ • Mandarin: yi-ba(把) dao ‘a knife’ • Cantonese: yi-zhang (张) dao ‘a knife’ • niu (牛) ‘cattle’ • Mandarin: yi-tou (头) niu ‘a cow’ • Wu: yi-zhi (只) niu ‘a cow’ • ren (人) ‘person’ • Mandarin: yi-ge(个)ren • Fuzhou: yi-zhi(只)ren • Unconventional, creative use of classifiers often found in literary works • Diachronic variaion COST Action A31 WG1 Meeting
Thank you! COST Action A31 WG1 Meeting