1 / 18

Automated Acquisition of Japanese Katakana Variants Using Web Data

Explore the automatic acquisition of Katakana variant pairs leveraging web-based data. Learn about the methodology for extracting Katakana variants and the future prospects. Compare mechanical, manual, and web-based approaches. Discover the potential of this research for cross-language applications.

macc
Download Presentation

Automated Acquisition of Japanese Katakana Variants Using Web Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web-based acquisition of Japanese katakana variants 2005, SIGIR

  2. Outline • Motivation • Objective • Introduction • ACQUISITION OF STRING PENALTY WITH WEB DATA • EXTRACTION OF KATAKANA VARIANT PAIRS • CONCLUSIONS AND FUTURE WORK • Personal Opinion

  3. Motivation • Previous works manually : • defined Katakana rewrite rules. • %Y(be) and %t%’(ve) being replaceable with each other • defined the weight of each operation to edit one string into another to detect these variants. • The weight of substitutions %Y(be) and %t%’(ve) is 0.8 • However, these previous researches have not been able to keep up with the ever-increasing number of loanwords and their variants.

  4. Objective • Acquire new weights of edit operations automatically • keep up with new Katakana loanwords only by collecting text data from Web and.

  5. ACQUISITION OF STRINGWITH WEB DATA (%&%)%C%+(wholtuka), %&%)%H%+(wholtoka)), (%&%)%C%+(wholtuka), %&%*%C%+(uoltuka)), (%&%)%C%+(wholtuka), %t%)%C%+(voltuka)) Collect candidate Katakana variant pairs threshold of edit distance : 2 Vodka and %&%)%C%+(wholtuka) Google threshold: 0.00006 Calculate the string penalty (SP) stop-words Extract Katakana variant pairs CLC : character-level context e.g. f(oltuka)=2 f(oltuka , w←>u)=1 f(oltuka , w←>v)=1

  6. EXTRACTION OF KATAKANAVARIANT PAIRS %_%M%i%k%&%)!<%?!<(mineraruwho-ta- for “mineral water”) %_%M%i%k%&%*!<%?(mineraruuo-ta for “mineral water”) Extract candidate Katakana variant pairs threshold of string penalty (SP) : 4 We collect Katakana words from the corpus. We used the pattern matching of a Katakana character set. threshold of cosine similarity : 0.05 e.g. !&(“bullet”), !<(“macron-1”), !](“macron-2”), !=(“macron-3”) to collect Katakana words such as %_%M%i%k%&%)!<%?!< (mineraruwho-ta- for “mineral water”). Extract Katakana variant pairs

  7. Experiment We conducted paired t-test (rejection region: 5%) for the cases of SP = 1, 2, and 3 and no significant difference is detected.

  8. Introduction • The pronunciation of loanwords does not necessarily coincide with that in their original language.

  9. Introduction (cont.) • We tried to find how many documents were retrieved by Google when each Katakana variant for spaghetti was used as a query.

  10. Introduction(cont.) • We will first describe methods based on rewrite rules, which are described in Table 3. Henceforth, ↔ denotes substitution, ∅ denotes an empty string,… • For example, when they inputted %Y%M%A%" (benechia for “Venezia”) into their system which applies rewrite rules, • %Y %M %D%# %“ (benetsia) • %t%’ %M %A %“ (venechia) • %t%’ %M %D%# %“ (venetsia)

  11. Introduction (cont.) • It is difficult to keep up with the ever-increasing number of loanwords and their variants, since they define rewrite rules manually or assign weights to the edit distance manually. • We propose a method of mechanically determining the weights of the string penalty to overcome this problem.

  12. Calculation of a string penalty • We used the following five types as character-level contexts (CLC) of each character targeted by the edit operation. • The preceding two characters of the target character, • The preceding character of the target character, • The succeeding two characters of the target character, • The succeeding character of the target character, and • The preceding character and the succeeding character of the target character.

  13. Experimental evaluation of a stringpenalty Table 6: Correlation of the mechanically determined SP and the manually determined SP. Cov(XY)=E(XY)-E(X)E(Y) We calculated coefficient of correlation of Table 6 and the value was 0.76.=> strong

  14. Experimental evaluation of Katakana variant pairs (cont.)

  15. Comparative results for task of detecting Katakana variants • Table 10 compares the results for Mechanical, Word, Google, and Yahoo! in terms of detecting Katakana variants of “spaghetti.”

  16. Error Analyses • Mechanical could not extract the variant pair %0%j%:%j!<%Y%"(gurizuri-bea) and %0%j%:%j!<!&%Y%"(gurizuri-!&bea) , both of which denoted “grizzly bear,” since their document-level contexts were completely different.

  17. CONCLUSIONS AND FUTURE WORK • We proposed a method of mechanically determining the weight of each edit operation for identifying Katakana variants, based on Web data. • Unlike methods presented in previous work, ours could easily keep up with the increasing number of loanwords. • We also proposed a method of extracting Japanese Katakana variant pairs from a large corpus based on similarities in spelling and context. • In our future work, we are planning to calculate SPwith a list of words in other languages and Katakana loanwords.

  18. Personal Opinion • Strength • automatic method • Application • 柯林頓 • 科林頓 • 克林頓

More Related