Synopsis Word segmentation is important in developing a text-to-speech (TTS) system for Cantonese for several reasons.

Synopsis Word segmentation is important in developing a text-to-speech (TTS) system for Cantonese for several reasons. (1) For any type of synthesis, words must be identified in the text in order to model positional effects such as fusion of coda /p/ with initial /h/ into aspirated [ph] within a word. (2) Concatenative synthesis also requires a list of words large enough to identify all word-internal sequences to record to model such positional effects. The only way to get such a list is to segment a large corpus. (3) Concatenative synthesis with a fixed inventory of units also requires such a word list to identify the best basic units, and determine the optimal inventory of such units. This paper describes our use of the Segmentation Corpus (a lexicon of 33k words extracted from a large corpus of Cantonese newspapers) to define and constrain an inventory of concatenative units.

1. Some facts about Cantonese • The term “Cantonese” is used here to refer to standard Hong Kong Cantonese (i.e. not to the original Canton City standard or other regional varieties spoken in neighboring counties). • Cantonese is written with Chinese characters, which provides the usual problems for text analysis, plus some (see section 3). • Cantonese morphology and word-level phonology are not well studied relative to Mandarin varieties (see section 2), and there are no standard dictionaries of polysyllabic words comparable to the Xiandai Hanyu Cidian現代漢語詞典. However, ... • There are newspapers written in Hong Kong Cantonese, which provide a basis for developing a segmented word list, with compiled text-frequencies (see section 2).

2. The Segmentation Corpus • Created using word-segmentation criteria developed by researchers at the Chinese Language Centre and the Dept. of Chinese and Bilingual Studies, Hong Kong Polytechnic University.The Cantonese corpus that we used is part of this larger corpus of segmented Chinese texts. • The Cantonese corpus is an electronic database of around 33k Cantonese word types extracted from a 1.7 million character corpus of Hong Kong newspapers, along with a tokenized record of the text. An example of segmented text: ^^此外^^, ^^ ^^**懲教署^^人員^^亦^^將^^會^^加強^^在^^營^^內^^搜索^^武器^^的^^行動^^, ^^ ^^在^^有^^需要^^時^^會^^有^^警方^^支援^^。^^ ^^政府^^亦^^將^^儘快^^安排^^加強^^圍欄^^的^^穩堅性^^, ^^ ^^及^^加強^^船民^^中心^^外圍^^的^^保安^^。^^

A snippet of the resultant word list, where each word entry is a string of Chinese characters followed by a pronunciation field and the token frequency. 有 jau5 9292 ‘have’ 警方 ging2 fong1 493 ‘police’ 支援zi1 wun4 45 ‘support’ 政府 zing3 fu2 2051 ‘government’ 亦 jik6 2716 ‘also’ 將 zoeng1 4097 ‘will (aux.)’ 儘快 zeon6 faai3 86 ‘as soon as possible’ 安排 on1 paai4 364 ‘arrange’ 加強 gaa1 koeng4 305 ‘strengthen’ 圍欄 wai4 laan4 3 ‘fence’

Segmentation criteria — A “word” is a string of Chinese characters that: (1) is an independent part of speech e.g. 盒子 ‘box’ (noun) (2) has a meaning that is not simply a sum of its parts e.g. 火車 ‘train’ (noun) ≠火‘fire’ and 車 ‘vehicle’ (3) consists of no more than four characters (4) either is listed in Xiandai Hanyu Cidian 現代漢語詞典 or Zhongguo Chengyu Da Cidian 中國成語大辭典 or meets a predetermined frequency threshold (for strings of text not listed in these two dictionaries). However, segmentation is only half the work of developing the word list, because of the nature of the writing system ...

3. The Cantonese writing system Multiple readings of a character pose a problem: • Orthographic forms where the variation is stylistic. e.g. 支援tsi:1wu:n4 ~ tsi:1jyu:n4 ‘support’ Orthographic forms where the variation in pronunciation corresponds to different words, with different meanings. e.g. 正當 tse3t:1‘while’ (function word) 正當 tse3t:3‘proper’ (content word) • Particles can be written with special Cantonese characters. e.g. 囉 l:1, 咩 m:1,喎 w:3 or, in more formal writing, they may be left to the reader to interpolate from a character “borrowed” from some other morpheme: e.g. 的writes the second morpheme in 目的‘aim’, but suggests k:3‘genitive particle’, because it also writes a genitive particlede (in Pinyin) in Mandarin.

Therefore, to use the Segmentation Corpus word list for TTS: • The first author examined each entry in the wordlist of the Corpus; • Corrected many transliterations; and • Adjusted frequencies when a single orthographic form writes more than one (phonological) word. Subsequently, approximately 90 original entrieswere split into separateentries by this processing. That is, 32,840 entries became 33,037 entries.

4. Cantonese phonology Syllable structure Syllabic nasals: m

19 Consonants Consonants marked inredcan occur in syllable final position.

11 Vowels

F0 contours for six words [wj] with different tones. Numbers to the right identify the endpoints of the two rising tones (in grey) and numbers to the left identify starting points of the other four tones (in black). The discontinuities in [wj4] are where the speaker breaks into creaky voice. • HK Cantonese has five tones (i.e all tones except tone 5) in contrast on syllablesclosed with [p, t, k]. 6 Tones tone 1 tone 3 tone 6 tone 4 tone 2 tone 5

Onset and rhyme counts: If there are no phonotactic restrictions on VC combinations The simplicity of the syllable structure, and the small number of phonotactically possible syllable types makes the syllable an attractive candidate basic unit for TTS (cf. Chu & Ching 1997). However, ...

E.g.2. Syllable fusion and phrase-final effects E.g.1. 集 tsa:p6 ‘to collect’ ([p] unreleased) 集合 tsa:p6hp6 ‘to assemble’([p] “fuses” with [h] to become released & aspirated) An utterance of the sentence o5 jyun4loi4 hai6 wai3 ‘Oh, I get it. It was the character慰!’ (The context is a dictation task.) The labelling window above the signal view shows a partial transcription in the annotation conventions proposed by Wong, Chan & Beckman (in press), with a syllable-by-syllable Jyutping(粵拼) transliteration (top tier), a transcription of the (canonical) lexical tones and boundary tone, and a phonetic transcription of fused forms (lowest tier). Notice the fused form [jy:n21la:212] for the phrase 原來係 jyun4loi4 hai6 ‘was’ (with the verb cliticized onto the preceding tense adverb).The HL% boundary tone is a pragmatic morpheme, which we have translated with the ‘Oh, I get it.’ phrase.

5. Choosing a basic unit for concatenative TTS Compare 3 strategies of unit selection: ‘economist’經濟學家basic units Jyutping ging1 zai3 hok6 gaa1 (except. units) Chu & Ching ke tsj h:k ka:# 1042 (1042) Law & Lee #k e$ts j$h :k$k a:# 1801 diphones #ke e$ts tsj j j$h: :k ka: a:# 1097 The table above illustrates the string of basic units and exceptional units (underlined) that would be needed to synthesize an utterance of the word ‘economist’. (Tones ignored; last column shows the theoretically possible number of basic units.) • Chu & Ching (1997) use the syllable as the basic concatenative unit. • Law & Lee (2000) replace the syllable with a necessarily cross- syllabic unit, the “final-initial combination”, as the basic unit, augmented with word-initial onsets and word-final rhymes for the transitions out of and into a pause. • Our diphone model uses positionally sensitive diphones as the basic concatenative units.

The counts (Rhyme counts in all three models adopt that in the standard syllabary of Jyutping: 52 rhyme types + 2 syllabic nasals = 54 rhymes) Chu & Ching model: (19 onsets * 52 rhymes) + 52 rhymes + 2 syllabic nasals =1042 syllable types Law & Lee model: onsets = 19 in initial position rhymes = 54 in final position cross-syllabic units = 54 rhymes * 32 ways to start a syllable [i.e. 19 initial onsets + 11 vowels + 2 syllabic nasals] = 1728 SUM(subtotals) = 19 + 54 + 1728 = 1081 unit types

The counts (cont’d) Our diphone model: #(C)V = 209 combination of cons. onsets followed by a vowel + 13 ways to begin a word with  onset [2 syll. nasals included] = 222 word-final rhymes = 54 rhymes * 2 positions (non- vs. phrase/word-final) = 108 cross-syllabic diphones after open syllables = 13 ways to end a syllable w/out a coda cons. * 42 onset types [i.e. 18 initial cons. other than /h/ + 11 qualities to /h/ before the different vowels + 11 vowels when  onset + 2 syllabic nasals] = 546 cross-syllabic diphones where 1st syll. has a sonorant coda cons. = 5 sonorant coda cons. * 42 onset types[see above]= 210 p-fusion = /p/ coda * 11 vowel qualities to initial /h/ = 11 SUM(subtotals) = 222 + 108 + 546 + 210 + 11 =1097unit types

Advantages of our diphone model • It differentiates codas from onset consonants. I.e. rhyme aak$ ≠cross-syllabic diphone aa$k. • Spectral continuity between the initial and rhyme is captured in the CV diphones (e.g. #gi and zai). • The diphones capture the dependency between the quality of the [h] and that of the following vowel (i.e. one records separate cross-syllable diphones for i$ho, i$hi, i$haa, and so on). • The number of theoretically possible units is smaller compared with Law & Lee’s model, because we do not record consonant sequences that abut silence with silence. E.g. aak$ can be combined directly with $ka or $ta, so no cross-syllabic units need to be recorded for k$k and k$t.

Segmentation Corpus Attested Diphone Types Using Tones: 2292 For comparison, the number ofattested diphones ignoring tone: 634 Recording each diphone in a disyllabic carrier word, a Cantonese speaker could speak all of the words to make a new voice in a single recording session. Why use tones? — For naturalness. • In Cantonese, every syllable bears a (full) tone; tones are rarely deleted in running speech. • Voice quality is part of the tonal specification as suggested by the contour for tone 4. Recordingdifferent units for rhymes with different tones should be desirable. • Need to insure tonal continuitywhen sonorant segments of different tone sequences abut atsyllable edges in different cross-syllabic units.

6. Conclusion • We have shown one way of using a segmented database to inform the design of a unit inventory for TTS. • We have augmented the Segmentation Corpus with transliterations that would let us predict more accurately the pronunciation that a Cantonese speaker adopting a careful speaking style would be likely to produce for a character sequence. • Judgements about the phonology of Cantonese, in combination with the new word list, and the associated word frequency data, can be used to assess the costs and likely benefits of different strategies for unit selection in Cantonese TTS. • We present data indicating the feasibility of a new diphone selection strategy that finesses some of the problems in modelling the interactions between tone and segmental identity. • It remains to be demonstrated that this strategy can actually deliver the results which it appears to promise.

7. References • Chan S. D. and Tang Z. X. (1999) Quantitative Analysis of Lexical Distribution in Different Chinese Communities in the 1990’s. Yuyan Wenzi Yingyong [Applied Linguistics], No.3, 10-18. • Chu M. and Ching P. C. (1997) A Cantonese synthesizer based on TD-PSOLA method. Proceedings of the 1997 International Symposium on Multimedia Information Processing. Academia Sinica, Taipei, Taiwan, Dec. 1997. • Law K. M. and Lee Tan (2000) Using cross-syllable units for Cantonese speech synthesis. Proceedings of the 2000 International Conference on Spoken Language Processing, Beijing, China, Oct. 2000. • Wong W. Y. P., Chan M. K-M., and Beckman M. E. (in press) An autosegmental-metrical analysis and prosodic conventions for Cantonese. To appear in S-A. Jun, ed. Prosodic Models and Transcription: Towards Prosodic Typology. Oxford University Press.

Synopsis Word segmentation is important in developing a text-to-speech (TTS) system for Cantonese for several reasons.