310 likes | 421 Views
Investigating Information Distribution in Chinese and Chinese English. Ting Qian. Human Language Processing Lab Brain and Cognitive Sciences. Acknowledgements. Dr. T. Florian Jaeger My father My friends who have voluntarily given me their Chinglish essays People at HLP lab.
E N D
Investigating Information Distribution in Chinese and Chinese English Ting Qian Human Language Processing Lab Brain and Cognitive Sciences
Acknowledgements • Dr. T. Florian Jaeger • My father • My friends who have voluntarily given me their Chinglish essays • People at HLP lab
Reorder the sentences • Meanwhile, Bren crude hit an all-time peak of $112.73 before falling back. • Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected. • US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to $112.63.
Correct order - 312 • US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to $112.63. • Meanwhile, Bren crude hit an all-time peak of $112.73 before falling back. • Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected.
Why are we able to do that? • If humans try to communicate in the most efficient way, they should produce language: Humans as rational agents who optimize the flow of information in language production
Prediction • Uniform Information Density (UID)
Good Encoding An engineering perspective • The most efficient way of communicating through a noisy channel is to send information at a constant rate.(Information Theory, Shannon 1948).
But… • No good models of the information of a sentence in context exist However… • Methods from natural language processing provide reasonably good estimates of out-of-context information of sentences
Revised Prediction • Intuitively, less contextual information is available at the beginning of a discourse. • If speakers/writers communicate efficiently, early sentences should be made more predictable (easier for listeners). • Theout-of-context information at the beginning of a discourse should be lower than later in the discourse.
Good Encoding: Evidence • Genzel & Charniak (2002) provided evidence for the hypothesis of uniform information by analyzing English discourse. • They found that: • Information of sentences increases with sentence numbers in a discourse. • The effect of increase is due to both lexical (what words are used) and non-lexical (how words are used) factors.
Outline: The current work • Evaluate UID on Chinese written corpora by measuring information content. • Evaluate UID on a Chinese English (Chinglish) corpus • Ultimately: why is Chinese English harder to understand for native English speakers, but relatively easy for native Chinese speakers?
Study #1 – UID in Chinese • Four corpora are used • XIN – Beijing Xinhua News • SINO – TaiwanSinorama Magazine • HK – Hong Kong News (too little data) • VOA – Voice of America Chinese News • We build n-gram language models to measure the (un)predictability of written Chinese sentences.
N-gram Language Model 二十 年 前 ,许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 二十 年 前 P(二十 年 前) = 0.1% 年 前 , Trigrams 前 ,许多 ,许多 中国 …... 部 电话 。
N-gram Language Model • Lexicalized part-of-speech n-gram 二十_CD 年_M 前_LC ,_PU 许多_CD 中国_NR 家庭_NN 的_DEG 梦想_NN 是_VC 拥有_VV 一_CD 部_M 电话_NN 。_PU
Global effects • With respect to an entire document • Sentence effect in a document • Paragraph effect in a document
Local effects • With respect to the immediate containing domain of the linguistic unit in question. • Predictors 1. Sentence position in paragraph 2. Paragraph position in document 3. Word position in sentence Multiple regression on the above three predictors
Sentences & Paragraphs Sentence position in paragraph
Words in Sentences Information goes up and converges (after removal of early words) • Limited amount of context information available.
N-gram Language Model 二十 年 前 ,许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 二十 年 前 年 前 , Trigrams 前 ,许多 ,许多 中国 …... 部 电话 。
Summary • We replicated Genzel & Charniak’s study on Chinese corpora. • Sentence effect within documents is not found. • However: • Paragraph effect within documents is consistent with UID. • Sentence effect within paragraphs is also found. • Due to the size of data, effects are observable only early in discourse (viable cut-offs are low).
Summary • We are the first to look at the effect of word position within sentences. • Information content increases with word position. • Context estimation leads to early convergence. • Does increase of information only occur locally in Chinese? • Current data seem to support this idea.
Discussion • Writing style? Could be. • Chinese – Summarization & Expansion • English – Narrative style
Study #2 – UID in Chinese English • A collection of English essays written by native Chinese speakers. • Corpus of English as a Second Language (CESL) • We trained a language model based on the Brown Corpus (American English) and use the model to measure information content of Chinese English sentences.
Preliminary Results XIN: - p<0.001*** CESL: - p=0.0167 *
Observations • The average information content is much higher in Chinese English (8.2~8.4) than in Chinese (4.5~5.0). • It is also higher than information content of English, which converges at 7.0 bits (Paintadosi, CUNY 2008).
Summary • Chinese, English, and Chinglish • Globally, Chinglish essays fail to exhibit the information distribution as predicted by UID, either. • Further studies needed to discover more properties of Chinglish. • Possible reasons that explain why Chinglish is harder to understand • Higher information content • Again, writing style
Discussion Questions?