Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

Acquisition of English-Japanese proper nouns from noisy-parallel newswire articles using KATAKANA matching Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management Toshiba Corp. R&D Center

Outline • Motivation • Objective • Introduction • Background • Method • Simulations • Discussion • Conclusion

Motivation • Limitation of statistical approaches

Objective • Superiority of linguistic approaches

Introduction • A tool for extracting bilingual knowledge from noisy-parallel English-Japanese text • Dynamic programming • Phonetic similarities • Partial matching of English-Japanese • Extract a small reliable bilingual lexicon of anchor points • Establish further bilingual correspondences

Introduction • Type of bilingual knowledge acquisition from parallel corpora • Statistical • Internal distributional evidence of bilingual word pairs • Linguistic • External evidence provided by bilingual lexicons to establish anchor points between pairs of bilingual phrases

Background • The challenge for establishing a bilingual correspondance between English-Katakana • Lose information when English-Katakana • `r' and `l' or `b' and `v' • Redundant vowel sounds when Katakana-English • `fra' in “Frankfurt” • `フラ‘ translate into ‘fura’

Background • Deal with these problems in previous researches • Transcribe into intermediate representations and match these. • The matching knowledge may be biased towards English pronunciation. “Chirac” => “シラク” `シ' is pronounced as shi.

“パレスチナ” “Palestinians” “Palestine” “Palestinian” Background • A neutral intermediate representation allows for partial matching • When intermediate representation match above a certain threshold then they are in a translation relation.

Method • NPT (Nearest Phonetic Transliteration) • Takes each Katakana word and converts it to a phonetic string representing all English spelling combinations of the word. • “ブルンジ” which is “Burundi” in English ‘ルー> rloue’ “buorlouenmgesdjgiou”

Method – NPT_score “Burundi” “buorlouenmgesdjgiou” npt: NPT string e: English string md: maximum depth d: depth count s: score

Method • Save search time and detect substrings • Several heuristics • First letter is in upper case for obtaining candidate proper nouns in the English text. • Limit the minimum length of Katakana words available for matching. “クリスマス” (=“Christmas”) and “Mass”

Simulations • Two corpora of English and Japanese headline newswire articles. • The test corpus had 150 aligned articles • 1730 English paragraphs and • 771 Japanese paragraphs • 871 Katakana words • 9742 potential English proper nouns • 65 comparisons for each Katakana word in each article.

Simulations • Baseline • Soundex algorithm • K&H • Convert the Katakana and the English word to a simplified disjunctive phonetic form. • Does not allow either partial matches or matching of substrings.

Results F-measure 81% 58% 39%

Discussion • NPT yielded the best result overall. • Higher threshold and higher precision. • K&H can’t handle partial match and intermediate form may lose information. • Partial matching • Finding substrings • Identify cognatively connectd translation pairs “インドネシア” => “Indonesia” “Indonesian”, “Indonesians”, “Indonesias"

Conclusion • Back-transliterating from Katakana to English is unexpectedly difficult. • The set of matching rules is quite small, it could be improved. • Future research • Induce the rules automatically from a corpus of examples.

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department of Information Management

Presentation Transcript

Travel and Tourism Program, International College

SURF : Speeded-Up Robust Features

Robust Object Tracking with Online Multiple Instance Learning

Ben Datema, Student Sustainability Advisor University of Missouri Department of Student Life

Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling

CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG, CHUNG

Advisor: Jun Zhu Speaker: Xuerui Wang

Date: 2013/05/27 Instructor : Prof. Wang , Sheng- Jyh Student: Hung, Fei -Fan

Contour Detection and Hierarchical Image Segmentation

Articulated Human Detection

Student: Wenwei Wang Advisor: Prof. Tony Jebara Columbia University December 11, 2006

Ding-sheng Wang ( 王鼎盛 ) (Institute of Physics, Beijing) dswang@aphy.iphy.ac +86-10-82649423

Applying Data Mining Technique to Direct Marketing

UTSA Student Chapter Phase I Funding Fall 2011 – Spring 2012

Bayesian Learning for Latent Semantic Analysis

David Yen Department of Mathematics, Fu-Jen Catholic University, Taiwan, R.O.C. Lien-Hsuan Lin

An Introduction to Opposing Viewpoints in Context Dr. Jun Wang

Student: Ying-Yen Hsu Advisor: Hung-Lin Fu Department of Applied Mathematics

Presenter : Shiu , Jia-Hau Advisor : Wang, Sheng-Jyh

Yi Wang, M.S. Student Advisor: Professors Dan M. Ionel and Adel Nasiri

Student Information Management The Office of Information and Technology