280 likes | 293 Views
This paper discusses the back-transliteration problem in representing Chinese place names in English texts using Pinyin codes. It proposes a GeoName system and evaluates its approach. The paper also highlights issues and ambiguities involved in cross-language referencing of Chinese place names.
E N D
GeoName: a system for back-transliterating pinyin place names Kui-Lam Kwok & Qiang Deng Computer Science Dept., Queens College City University of New York email: kwok@ir.cs.qc.edu email:peterqc@yahoo.com
Or: issues involving cross language referencing of a Chinese place by name
Content: 1. Back-transliteration problem 2. GeoName system - a proposed approach 3. Evaluation 4. Observation/conclusion
Transliteration: • ‘alphabet mismatch’ when expressing • Chinese (place) names in English Texts • names represented by PRC Pinyin code: • e.g. Beijing, Shenzhen
Back-Transliteration: given the Pinyin code, what are the original Chinese characters?
Back-Transliteration: • Why Chinese Characters are needed? • remove ambiguity of referenced Pinyin place • reconcile names in English & Chinese texts • may assist alignment in E/C parallel texts • necessary for E-C Cross Language IR • (when translating English queries containing • Pinyin place, person, organization names)
4 Possible Ambiguities in English–Chinese cross language place name references
Ambiguity #3: Back-transliteration • --> which character string is correct? • e.g. • China’scapital in Chinese - 北京 • PRC Pinyin (1 char, 1 syllable) - • 北 --> bei 京 --> jing • map back from Pinyin to characters – • bei --> {北,贝,被,背,碑,杯,备,鐾, …} (total 23) • jing--> {京,景,井,静,敬,竞,精,荆, …} (total 20) • ambiguous candidates:北井,贝京,贝荆,北京… • which one?
Ambiguity #4: Name Reference --> same name, different places Suppose result of back-transliteration is: beijing -->贝荆, then which 贝荆 ? (longitude, latitude)
Ambiguity #1: E/C Pinyin Systems --> which Pinyin system was used ? e.g. ‘Hong Kong’ in characters - 香港 PRC Pinyin: 香 -> xiang, 港 -> gang Wade-Giles: 香 -> hsiang, 港 -> kang Hong Kong: 香 -> hong, 港 -> kong … ‘hong kong’ back-transliterate using PRC Pinyin: hong -> {红洪鸿宏虹弘泓闳烘项黉哄 …} (19) kong -> {孔空恐崆控箜倥} (7) Original ‘香港’ is NOT one of these 7x19 combinations !
Ambiguity #2: Syllable Segmentation • which segmentation is correct? • e.g. 秦皇岛 - possible pinyin writing styles: • Qin Huang Dao • QinHuangDao • Qinhuangdao <-- most common, used in NYT • --> how many syllables? • Qin huang dao 3 char • Qin huang da o 4 char • Qin hu ang dao 4 char • Qin hu ang da o 5 char
Summarize: given a Pinyin geographic name • Pinyin system -- which? • segmentation -- how many syllables? • 3. back-transliterate -- which candidate • character string? • 4. resolve same name, different places.
GeoName: a system for back-transliterating Pinyin place names
GeoName:E-C cross language place reference • which Pinyin system? • -- user chooses; or allow both PY &WG • 2. how many segmented syllables? • -- fewest syllables ranked first • 3. back-transliterate: which candidate ? • --a) bi-list; b) confirm by web/Chinese place list; • c) rank candidates by frequency • 4. resolve same name different places • -- not considered
GeoName – Given English Pinyin place E =e1e2.. en (n syllables), many possible Chinese character string candidates: C* = c1c2.. cn = argmaxC P(C|E) = argmaxC P(E|C)*P(C) ~ argmaxC P(C), by assuming P(E|C) ~ Πi P(ei|C) i.e. ei, ek independent ~ Πi P(ei| ci) i.e. ei, ck independent ~ 1 i.e. all ci map to unique ei
GeoName – P(C) = language model of Chinese place names <obtain training data by processing TREC, NTCIR Chinese collections using BBN IdentiFinder: ~80K approximate unique place names> Use P(C) to sort candidates; fewest syllables ranked earlier <bigram model P(c2|c1)P(c3|c2).. not too good>
GeoName – A heuristic weighting formula based on whole string, bigram and character frequencies: g(C) = a1*log [f(C)+a1] + a2*log [f(cicj)+a2] + a3*log [f(ci)+a3], - factor ignored if f(.) = 0; a1>a2>a3 -a1*log [f(C)+a1] => a string seen before is probably correct
Evaluation Use frequency formula only on 162 Pinyin city names from bilingual map (no bilingual pair list were employed)
Examples of Correct Names ranked #1 • Daqiu (大丘), Wanbi (湾碧), Gongzhuling, .. • (公主岭) • Examples of Failed Names • Non-Pinyin: • Qarqi, Yengisar, Jorra, Dongkar, .. • (察尔齐) (阳霞) (觉拉) (洞嘎) • mainly longer names: • Tuolu, Fenglingguan, Qingguandu, • (驮芦) (枫岭关) (清官渡) • Dating, Shasonggang, Denglonghe, .. • (大亭) (杉松岗) (灯笼河)
GeoName – further improvement • Hypothesis:prefer candidate strings that have • been seen before as location names • confirm candidates on: • a bilingual list (~4K) – tag: 100 • ftp://ftpserver.ciesin.columbia.edu/pub/data/China /CITAS/gb_code/ • Chinese monolingual place name • list (~80K+4K) – tag: 010 • web data via Google search – tag: 001
GeoName – flowchart 1. Pinyin place name input; user indicates PRC or WG system. 2. Pinyin segmentation; map to all possible GB character strings. tag 000 3. Bilingual table(4k) lookup. tag 100 4. Merge GB candidates 5. Monolingual name list (84k) confirmation. tag 110, 010 6. WWW confirmation. tag 101, 001 tag 111, 011 7. Evaluate weight g(C); rank according to: (1) tag, (2) name character length, (3) g(C).
GeoName – Evaluation Evaluate system result using: tag=000, rank by g(C) tag=001, web confirmation + g(C) tag=010, mono-list confirmation + g(C) tag=111, bi-list + all above
Example of back-transliteration: web & no-web Tag = 111 (with web confirmation) Chagugang001 1.38629436 汊沽港 000 15.68423107 查古港 000 9.24647942 诧古港 000 9.24647942 岔古港 000 8.55333224 锸古港 000 8.55333224 槎古港 000 8.55333224 楂古港 000 8.55333224 汊古港 000 8.55333224 嚓古港 000 8.55333224 刹古港 Tag = 110 (without web confirmation) Chagugang 000 15.68423107 查古港 000 9.24647942 诧古港 000 9.24647942 岔古港 000 8.55333224 锸古港 000 8.55333224 槎古港 000 8.55333224 楂古港 000 8.55333224 汊古港 000 8.55333224 嚓古港 000 8.55333224 刹古港 000 8.55333224 差古港
Examples: Luliangqu 010 40.02587171 吕梁区 000 9.24647942 吕梁瞿 000 9.24647942 吕梁衢 000 9.24647942 吕梁渠 000 9.24647942 吕梁曲 000 9.24647942 陆良瞿 000 9.24647942 陆良衢 000 9.24647942 陆良渠 000 9.24647942 陆良曲 000 9.24647942 陆良区 district/region Xiaoyishi110 40.18588115 孝义市 000 9.24647942 孝尾市 000 9.24647942 萧尾市 000 8.55333224 箫尾市 000 8.55333224 筱尾市 000 8.55333224 骁尾市 000 8.55333224 潇尾市 000 8.55333224 崤尾市 000 8.55333224 哓尾市 000 8.55333224 效尾市 city Yimaxiang000 15.68423107 义马乡 000 9.24647942 义马缃 000 9.24647942 义马巷 000 9.24647942 义马祥 000 9.24647942 义马湘 000 9.24647942 义马襄 000 9.24647942 义马香 000 9.24647942 伊玛缃 000 9.24647942 伊玛巷 000 9.24647942 伊玛祥 village Mengnanzhuang 000 14.95494484 蒙南庄 000 8.51719319 懵南庄 000 8.51719319 孟南庄 000 8.51719319 盟南庄 000 8.51719319 萌南庄 000 7.82404601 虻南庄 000 7.82404601 勐南庄 000 7.82404601 梦南庄 000 7.82404601 猛南庄 000 7.82404601 锰南庄 place
Conclusion: • reasonable back-transliteration results • for map cities • longer names (>2 char), more error • non-pinyin names, does not work • Future Work: • increase training data • improve ranking function • direct translation (not just confirmation) • using web • better/more realistic evaluation
If interested: can demonstrate GeoName (needs Linux re-boot) Try GeoName at: http://post.cs.qc.edu/spell2gb/ (needs Chinese character display) feedback appreciated Thank You!