1 / 31

Investigating Information Distribution in Chinese and Chinese English

Investigating Information Distribution in Chinese and Chinese English. Ting Qian. Human Language Processing Lab Brain and Cognitive Sciences. Acknowledgements. Dr. T. Florian Jaeger My father My friends who have voluntarily given me their Chinglish essays People at HLP lab.

Download Presentation

Investigating Information Distribution in Chinese and Chinese English

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Investigating Information Distribution in Chinese and Chinese English Ting Qian Human Language Processing Lab Brain and Cognitive Sciences

  2. Acknowledgements • Dr. T. Florian Jaeger • My father • My friends who have voluntarily given me their Chinglish essays • People at HLP lab

  3. Reorder the sentences • Meanwhile, Bren crude hit an all-time peak of $112.73 before falling back. • Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected. • US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to $112.63.

  4. Correct order - 312 • US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to $112.63. • Meanwhile, Bren crude hit an all-time peak of $112.73 before falling back. • Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected.

  5. Why are we able to do that? • If humans try to communicate in the most efficient way, they should produce language: Humans as rational agents who optimize the flow of information in language production

  6. Prediction • Uniform Information Density (UID)

  7. Good Encoding An engineering perspective • The most efficient way of communicating through a noisy channel is to send information at a constant rate.(Information Theory, Shannon 1948).

  8. But… • No good models of the information of a sentence in context exist However… • Methods from natural language processing provide reasonably good estimates of out-of-context information of sentences

  9. Revised Prediction • Intuitively, less contextual information is available at the beginning of a discourse. • If speakers/writers communicate efficiently, early sentences should be made more predictable (easier for listeners). • Theout-of-context information at the beginning of a discourse should be lower than later in the discourse.

  10. Revised Prediction

  11. Good Encoding: Evidence • Genzel & Charniak (2002) provided evidence for the hypothesis of uniform information by analyzing English discourse. • They found that: • Information of sentences increases with sentence numbers in a discourse. • The effect of increase is due to both lexical (what words are used) and non-lexical (how words are used) factors.

  12. Outline: The current work • Evaluate UID on Chinese written corpora by measuring information content. • Evaluate UID on a Chinese English (Chinglish) corpus • Ultimately: why is Chinese English harder to understand for native English speakers, but relatively easy for native Chinese speakers?

  13. Study #1 – UID in Chinese • Four corpora are used • XIN – Beijing Xinhua News • SINO – TaiwanSinorama Magazine • HK – Hong Kong News (too little data) • VOA – Voice of America Chinese News • We build n-gram language models to measure the (un)predictability of written Chinese sentences.

  14. N-gram Language Model 二十 年 前 ,许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 二十 年 前 P(二十 年 前) = 0.1% 年 前 , Trigrams 前 ,许多 ,许多 中国 …... 部 电话 。

  15. N-gram Language Model • Lexicalized part-of-speech n-gram 二十_CD 年_M 前_LC ,_PU 许多_CD 中国_NR 家庭_NN 的_DEG 梦想_NN 是_VC 拥有_VV 一_CD 部_M 电话_NN 。_PU

  16. Global effects • With respect to an entire document • Sentence effect in a document • Paragraph effect in a document

  17. Sentence effect

  18. Paragraph effect

  19. Local effects • With respect to the immediate containing domain of the linguistic unit in question. • Predictors 1. Sentence position in paragraph 2. Paragraph position in document 3. Word position in sentence Multiple regression on the above three predictors

  20. Sentences & Paragraphs Sentence position in paragraph

  21. Words in Sentences Information goes up and converges (after removal of early words) • Limited amount of context information available.

  22. N-gram Language Model 二十 年 前 ,许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。 Twenty year ago, many Chinese family ‘s dream is have a piece telephone. 二十 年 前 年 前 , Trigrams 前 ,许多 ,许多 中国 …... 部 电话 。

  23. Summary • We replicated Genzel & Charniak’s study on Chinese corpora. • Sentence effect within documents is not found. • However: • Paragraph effect within documents is consistent with UID. • Sentence effect within paragraphs is also found. • Due to the size of data, effects are observable only early in discourse (viable cut-offs are low).

  24. Summary • We are the first to look at the effect of word position within sentences. • Information content increases with word position. • Context estimation leads to early convergence. • Does increase of information only occur locally in Chinese? • Current data seem to support this idea.

  25. Discussion • Writing style? Could be. • Chinese – Summarization & Expansion • English – Narrative style

  26. Study #2 – UID in Chinese English • A collection of English essays written by native Chinese speakers. • Corpus of English as a Second Language (CESL) • We trained a language model based on the Brown Corpus (American English) and use the model to measure information content of Chinese English sentences.

  27. Preliminary Results XIN: - p<0.001*** CESL: - p=0.0167 *

  28. Observations • The average information content is much higher in Chinese English (8.2~8.4) than in Chinese (4.5~5.0). • It is also higher than information content of English, which converges at 7.0 bits (Paintadosi, CUNY 2008).

  29. Summary • Chinese, English, and Chinglish • Globally, Chinglish essays fail to exhibit the information distribution as predicted by UID, either. • Further studies needed to discover more properties of Chinglish. • Possible reasons that explain why Chinglish is harder to understand • Higher information content • Again, writing style

  30. Discussion Questions?

More Related