Investigating Information Distribution in Chinese and Chinese English

Investigating Information Distribution in Chinese and Chinese English Ting Qian Human Language Processing Lab Brain and Cognitive Sciences

Acknowledgements • Dr. T. Florian Jaeger • My father • My friends who have voluntarily given me their Chinglish essays • People at HLP lab

Reorder the sentences • Meanwhile, Bren crude hit an all-time peak of $112.73 before falling back. • Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected. • US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to $112.63.

Correct order - 312 • US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to $112.63. • Meanwhile, Bren crude hit an all-time peak of $112.73 before falling back. • Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected.

Why are we able to do that? • If humans try to communicate in the most efficient way, they should produce language: Humans as rational agents who optimize the flow of information in language production

Prediction • Uniform Information Density (UID)

Good Encoding An engineering perspective • The most efficient way of communicating through a noisy channel is to send information at a constant rate.(Information Theory, Shannon 1948).

But… • No good models of the information of a sentence in context exist However… • Methods from natural language processing provide reasonably good estimates of out-of-context information of sentences

Revised Prediction • Intuitively, less contextual information is available at the beginning of a discourse. • If speakers/writers communicate efficiently, early sentences should be made more predictable (easier for listeners). • Theout-of-context information at the beginning of a discourse should be lower than later in the discourse.

Revised Prediction

Good Encoding: Evidence • Genzel & Charniak (2002) provided evidence for the hypothesis of uniform information by analyzing English discourse. • They found that: • Information of sentences increases with sentence numbers in a discourse. • The effect of increase is due to both lexical (what words are used) and non-lexical (how words are used) factors.

Outline: The current work • Evaluate UID on Chinese written corpora by measuring information content. • Evaluate UID on a Chinese English (Chinglish) corpus • Ultimately: why is Chinese English harder to understand for native English speakers, but relatively easy for native Chinese speakers?

Study #1 – UID in Chinese • Four corpora are used • XIN – Beijing Xinhua News • SINO – TaiwanSinorama Magazine • HK – Hong Kong News (too little data) • VOA – Voice of America Chinese News • We build n-gram language models to measure the (un)predictability of written Chinese sentences.

N-gram Language Model 二十年前，许多中国家庭的梦想是拥有一部电话。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 二十年前 P(二十年前) = 0.1% 年前， Trigrams 前，许多，许多中国 …... 部电话。

N-gram Language Model • Lexicalized part-of-speech n-gram 二十_CD 年_M 前_LC ，_PU 许多_CD 中国_NR 家庭_NN 的_DEG 梦想_NN 是_VC 拥有_VV 一_CD 部_M 电话_NN 。_PU

Global effects • With respect to an entire document • Sentence effect in a document • Paragraph effect in a document

Sentence effect

Paragraph effect

Local effects • With respect to the immediate containing domain of the linguistic unit in question. • Predictors 1. Sentence position in paragraph 2. Paragraph position in document 3. Word position in sentence Multiple regression on the above three predictors

Sentences & Paragraphs Sentence position in paragraph

Words in Sentences Information goes up and converges (after removal of early words) • Limited amount of context information available.

N-gram Language Model 二十年前，许多中国家庭的梦想是拥有一部电话。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 二十年前年前， Trigrams 前，许多，许多中国 …... 部电话。

Summary • We replicated Genzel & Charniak’s study on Chinese corpora. • Sentence effect within documents is not found. • However: • Paragraph effect within documents is consistent with UID. • Sentence effect within paragraphs is also found. • Due to the size of data, effects are observable only early in discourse (viable cut-offs are low).

Summary • We are the first to look at the effect of word position within sentences. • Information content increases with word position. • Context estimation leads to early convergence. • Does increase of information only occur locally in Chinese? • Current data seem to support this idea.

Discussion • Writing style? Could be. • Chinese – Summarization & Expansion • English – Narrative style

Study #2 – UID in Chinese English • A collection of English essays written by native Chinese speakers. • Corpus of English as a Second Language (CESL) • We trained a language model based on the Brown Corpus (American English) and use the model to measure information content of Chinese English sentences.

Preliminary Results XIN: - p<0.001*** CESL: - p=0.0167 *

Observations • The average information content is much higher in Chinese English (8.2~8.4) than in Chinese (4.5~5.0). • It is also higher than information content of English, which converges at 7.0 bits (Paintadosi, CUNY 2008).

Summary • Chinese, English, and Chinglish • Globally, Chinglish essays fail to exhibit the information distribution as predicted by UID, either. • Further studies needed to discover more properties of Chinglish. • Possible reasons that explain why Chinglish is harder to understand • Higher information content • Again, writing style

Discussion Questions?

Investigating Information Distribution in Chinese and Chinese English