Automated Acquisition of Japanese Katakana Variants Using Web Data

Web-based acquisition of Japanese katakana variants 2005, SIGIR

Outline • Motivation • Objective • Introduction • ACQUISITION OF STRING PENALTY WITH WEB DATA • EXTRACTION OF KATAKANA VARIANT PAIRS • CONCLUSIONS AND FUTURE WORK • Personal Opinion

Motivation • Previous works manually : • defined Katakana rewrite rules. • %Y(be) and %t%’(ve) being replaceable with each other • defined the weight of each operation to edit one string into another to detect these variants. • The weight of substitutions %Y(be) and %t%’(ve) is 0.8 • However, these previous researches have not been able to keep up with the ever-increasing number of loanwords and their variants.

Objective • Acquire new weights of edit operations automatically • keep up with new Katakana loanwords only by collecting text data from Web and.

ACQUISITION OF STRINGWITH WEB DATA (%&%)%C%+(wholtuka), %&%)%H%+(wholtoka)), (%&%)%C%+(wholtuka), %&%*%C%+(uoltuka)), (%&%)%C%+(wholtuka), %t%)%C%+(voltuka)) Collect candidate Katakana variant pairs threshold of edit distance : 2 Vodka and %&%)%C%+(wholtuka) Google threshold: 0.00006 Calculate the string penalty (SP) stop-words Extract Katakana variant pairs CLC : character-level context e.g. f(oltuka)=2 f(oltuka , w←>u)=1 f(oltuka , w←>v)=1

EXTRACTION OF KATAKANAVARIANT PAIRS %_%M%i%k%&%)!<%?!<(mineraruwho-ta- for “mineral water”) %_%M%i%k%&%*!<%?(mineraruuo-ta for “mineral water”) Extract candidate Katakana variant pairs threshold of string penalty (SP) : 4 We collect Katakana words from the corpus. We used the pattern matching of a Katakana character set. threshold of cosine similarity : 0.05 e.g. !&(“bullet”), !<(“macron-1”), !](“macron-2”), !=(“macron-3”) to collect Katakana words such as %_%M%i%k%&%)!<%?!< (mineraruwho-ta- for “mineral water”). Extract Katakana variant pairs

Experiment We conducted paired t-test (rejection region: 5%) for the cases of SP = 1, 2, and 3 and no significant difference is detected.

Introduction • The pronunciation of loanwords does not necessarily coincide with that in their original language.

Introduction (cont.) • We tried to find how many documents were retrieved by Google when each Katakana variant for spaghetti was used as a query.

Introduction(cont.) • We will first describe methods based on rewrite rules, which are described in Table 3. Henceforth, ↔ denotes substitution, ∅ denotes an empty string,… • For example, when they inputted %Y%M%A%" (benechia for “Venezia”) into their system which applies rewrite rules, • %Y %M %D%# %“ (benetsia) • %t%’ %M %A %“ (venechia) • %t%’ %M %D%# %“ (venetsia)

Introduction (cont.) • It is difficult to keep up with the ever-increasing number of loanwords and their variants, since they define rewrite rules manually or assign weights to the edit distance manually. • We propose a method of mechanically determining the weights of the string penalty to overcome this problem.

Calculation of a string penalty • We used the following five types as character-level contexts (CLC) of each character targeted by the edit operation. • The preceding two characters of the target character, • The preceding character of the target character, • The succeeding two characters of the target character, • The succeeding character of the target character, and • The preceding character and the succeeding character of the target character.

Experimental evaluation of a stringpenalty Table 6: Correlation of the mechanically determined SP and the manually determined SP. Cov(XY)=E(XY)-E(X)E(Y) We calculated coefficient of correlation of Table 6 and the value was 0.76.=> strong

Experimental evaluation of Katakana variant pairs (cont.)

Comparative results for task of detecting Katakana variants • Table 10 compares the results for Mechanical, Word, Google, and Yahoo! in terms of detecting Katakana variants of “spaghetti.”

Error Analyses • Mechanical could not extract the variant pair %0%j%:%j!<%Y%"(gurizuri-bea) and %0%j%:%j!<!&%Y%"(gurizuri-!&bea) , both of which denoted “grizzly bear,” since their document-level contexts were completely different.

CONCLUSIONS AND FUTURE WORK • We proposed a method of mechanically determining the weight of each edit operation for identifying Katakana variants, based on Web data. • Unlike methods presented in previous work, ours could easily keep up with the increasing number of loanwords. • We also proposed a method of extracting Japanese Katakana variant pairs from a large corpus based on similarities in spelling and context. • In our future work, we are planning to calculate SPwith a list of words in other languages and Katakana loanwords.

Personal Opinion • Strength • automatic method • Application • 柯林頓 • 科林頓 • 克林頓

Automated Acquisition of Japanese Katakana Variants Using Web Data