250 likes | 405 Views
PNC2013 Kyoto University December 10-11 2013. New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20 th Century Hong Kong Cantonese. Andy C. Chin The Hong Kong Institute of Education andychin@ied.edu.hk. Outline . Why “Cantonese”?
E N D
PNC2013 Kyoto University December 10-11 2013 New Language Resources for Cantonese Linguistics Research: A Linguistic Corpus of Mid-20th Century Hong Kong Cantonese Andy C. Chin The Hong Kong Institute of Education andychin@ied.edu.hk
Outline • Why “Cantonese”? • Research on early Cantonese (19th - mid-20th C) – Diachronic development • The corpus • Source of data • Demonstration of search engine
Cantonese • One of the dialects of the Chinese language family • In spite of being a dialect, Cantonese serves as a lingua franca in Hong Kong, Macau and most part of Guangdong Province of China
“Cantonese” in early Hong Kong • A fishing village • Population: 1851: ~33,000 • Four major ethnic groups: • Guangfu廣府 (本地) • Danjia蛋家 (seafaring people) • Hakka客家 • Min閩語(鶴佬/潮州) • Their languages are mutually unintelligible
Given the long history of Cantonese in HK • We are interested in understanding its development in the past 200 years • Are there any differences between early Cantonese and modern Cantonese? • How can we capture these differences?
Diachronic studies of Cantonese • Two approaches • Apparent time approach • Real time approach
Apparent time approach • age-stratified variation in a linguistic form is often indicative of a change in progress • 75 vs. 50 vs. 25 y/o changes over 50 years • language of 200 years ago? • language change:Can we assume a speaker still speak the language of his time? • if two speakers show no difference with respect to a linguistic feature, does it mean that there has been no change?
Real time approach • samples the population over an extended period of time – longitudinal study • To collect data produced in the period concerned
Limitations on Research in Cantonese • Cantonese is a vernacularlanguage • Spoken data is needed • Any records of Cantonese of early 19th-C? - spoken data vs. written records
With these early materials, • We are able to reconstruct the early stage of the Cantonese language (about 200 years ago) • Some of the linguistic features are very different from those in modern Cantonese
Previous research on Cantonese Neutral Qs Directional complements Aspect markers demonstratives phonology Verb complement … Comparative construction Lexicon (sociolinguistics) Dative verb GIVE Sentence final particles Grammar of the late Qing period …
Furthermore, • Some linguistic changes took place/completed around the mid-20th century • Dative marker: 過 畀 (送本書過/畀佢) • Neutral Q:你去睇戲唔呀 你去唔去睇戲呀 • … • New and old features might co-exist in mid-20th C
~66 years 120 years Morrison (1828) Chao (1947) 2013
Existing Cantonese corpora • The Hong Kong Cantonese Child Language Corpus • The Hong Kong Bilingual Child Language Corpus • Hong Kong Cantonese Corpus • The Hong Kong Cantonese Adult Language Corpus • 19th Century Cantonese Corpus
Source of corpus data • Real time vs.Apparent time • Naturally occurring data • HK Cantonese movies(粵語長片)
HK Movie Industry in mid-20th C. Year No. of Cantonese movies No. of PTH movies 1952 - 1955 627 222 1956 - 1960 963314 1961 - 1965 928 206 1966 - 1970 361 286 Total 2879 1028 Source of data:Chung (2004:177)
About the corpus • 21 movies have been transcribed with Chinese characters: ~200k characters • Word segmentation • search engine (14 movies, since Apr 2012) • http://corpus.ied.edu.hk/hkcc/ • 350+ registered users
Search criteria • Characters or words (segmented units) • Cantonese pronunciation • Movie names • Names of speakers • Gender of speakers • …
契爺艷史(1952) • Yes-No question • VP-Neg: 你位千金有讀書冇呀? • V-Neg-VO: 呢道係咪有位黃小姐? • Dative marker • 重要畀錢過人? • 咪可以快啲還清啲債畀人?
Some challenges • Quality of speech • Overlap of speech • Representations of colloquial vocabulary • Parts-of-speech: How many types? • Discourse features • …
Acknowledgments • ECS research grants, RGC: • Linguistic Analysis of Mid-20th Century Hong Kong Cantonese by Constructing an Annotated Spoken Corpus (2013/2015) • HKIEd Internal Research Grants: • RG41/2010-2011: Spoken Corpus Construction and Linguistic Analysis of Mid-20th Century Cantonese • RG62/12-13R: A Preliminary Linguistic Analysis of Mid-20th Century Cantonese from a Corpus-based Approach
Demonstration • http://corpus.ied.edu.hk/hkcc/