290 likes | 484 Views
A Method for Enhancing Search Using Transliteration of Mandarin Chinese Vijay John vijayjohn@mail.utexas.edu. Transliterated Mandarin Search. Google suggests spelling correction. Alternate Transliterations?. Want to say “Did you mean Peiching ?”. Transliteration Problems.
E N D
A Method for Enhancing Search Using Transliteration of Mandarin ChineseVijay Johnvijayjohn@mail.utexas.edu
Transliterated Mandarin Search Google suggests spelling correction
Alternate Transliterations? Want to say “Did you mean Peiching?”
Transliteration Problems • “Beijing” provides many results • Google doesn’t find “Peiching,” “Peking,” “Bukgyeong,” etc. • Many pages using variety of transliterations • Transliterations unorganized • This paper organizes for Mandarin Chinese
The Problem (Cont’d) • Why variety of transliterations? • Web content: 82% Romanized • Majority’s native languages: other scripts • Standard keyboards • Non-Romanized sources normally transliterated (esp. on Web) • Transliteration variations
Example 1: Tibetan • Four languages: transliteration problems • Hello in Tibetan • Wylie (bkra shis bde legs) • Tibetan Pinyin • Several unofficial systems based on pronunciation • Spelled/transcribed in several ways (with some guidelines)
Example 2: Malayalam • No official transliteration system • Transliteration based on personal preference (many unorganized variations) • Script conversion programs: more consistent systems • /maleja:m/ usu. transcribed “Malayalam” • malayaaLam (Maya), Malajal- (Slavic)
Example 3: Romani • Vlax Romani standard • Literacy → few adopt standard • Different countries, different official languages → different spellings • No official systems (government) • Several transliteration systems exist (often inconsistent)—as in last 2 languages
Example 4: Mandarin • Hànyŭ Pīnyīn • Tōngyòng Pīnyīn • Wade-Giles • Gwoyeu Romatzyh • (Yóuzhèngshì Pīnyīn) (etc.)
Prior Work • In Mandarin: geared towards Chinese users searching for information from West • Western names-Hànzĭ-Hànyŭ Pīnyīn-Hànzĭ • Algorithms designed for Arabic & Japanese transliteration • Google • This method designed for Western users searching for Chinese information
Initial Effort on Mandarin • Practical first step: increased trade with China • Simple transliteration problem (relatively) • Modifications for Tibetan, Romani, Hindustani, etc. • Intact for some other languages? (e.g. Russian, Arabic, Japanese, Korean) • Input = Hànyŭ Pīnyīn; output = other systems
Initial Program • Combined many systems • Ying – yink – yenk – yenk’ – yemk’ – yermk’ – yarmk’ • Instead of “victory,” searched for “Yarmuk” River in Middle East • Transliteration systems organized by row but not by column
Organize into Transliteration TableEntries for “beijing” in two systems(Purpose is to go from one column to another)
Decomposition • Search for “Beijing” in table • Delete one letter; search for “Beijin” • Beiji, Beij…B • Search for “eijing” (beijing – b) similarly • Ei found, search for “jing” • “J” found, search for “ing”
Composing new search terms • Components: b, ei, j, ing • B → b, p • ei → ei • j → j, ch • ing → ing
Implementation • Java program • After composition, how does algorithm search? • Connects to Google via Google API (Application Programming Interface) • Google searches • 1-2 second delay (due to Google)
Transliteration Patterns • Transliterations organized into table • {"üe", "yue", "yue", "ue", "ve", "üeh", "üeh", "üeh"} • lüe, lyue, lue, lve, lüeh • 3 transliteration systems; at most 5 patterns • First column Hànyŭ Pīnyīn like “ing” “b” “ei”
Transliteration Systems By Column • Only 3 systems (in effect) • Hànyŭ Pīnyīn (HP) • Tōngyòng Pīnyīn #1 (TP1) & Tōngyòng Pīnyīn #2 (TP2) • Modified Hànyŭ Pīnyīn #1 (MHP1) & Modified Hànyŭ Pīnyīn #2 (MHP2) • Wade-Giles #1 (WG1), Wade-Giles #2 (WG2), & Wade-Giles #3 (WG3)
Differences Between Transliteration System Variants • TP1- iu, ui, ‘ • TP2- iou, uei, - • WG2- h’ung (not hung) • WG3- ts’u (not tz’u) • WG1- szu (not ssu)
What is the effect? • Search for 130 Pinyin cities/regions • 16 – no other transliteration • 60 – at least two others • 6 – three or more • How much did Xiaozhi find? (8% more) • 5 min. 12 sec. – entire search
Further work 1 • Include Yale, GR (Gwoyeu Romatzyh), &c. • YZSPY (Yóuzhèngshì Pīnyīn) • Accents • Hanja- and Kanji-based transliterations • Application to research archives
Further Work 2 • Improvements in accuracy of transliteration • Search in other transliterations • Japanese version of current paper • Hindustani version • Romani with Indic cognates • Extension to translation (transliterated Mandarin-Cantonese characters)
Solutions for Tibetan • Start with Wylie • Xiaozhi with adjustments • Dzongkha • Dzongkha-based variations? • Analysis of common transliteration patterns (usu. based on closest pronunciation)
Solutions for Malayalam • Start with Maya (script conversion program) • Include minor variations from other script conversion programs • Analysis of transliterations used
Solutions for Romani • Start with Vlax Romani Standard • Regional variations • Some transliterations easier to use on computers • e.g. chh, sh to omit hacek
Conclusions • Enhances search by finding alternate transliterations • Applied to Mandarin • Applicable to other languages • Applicable to lesser-studied (& other) languages • Language- (or script-) specific