1 / 8

Support Vector Machine Based Orthographic Dis ambiguation

Support Vector Machine Based Orthographic Dis ambiguation. Eiji ARAMAKI , Takeshi IMAI, Kengo MIYO, Kazuhiko OHE Hospital. “ center ” and “ centre ” are equivalent?. We focus on Japanese, but the proposed method does not depend on languages. Background.

rodney
Download Presentation

Support Vector Machine Based Orthographic Dis ambiguation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Support Vector Machine Based Orthographic Disambiguation Eiji ARAMAKI, Takeshi IMAI, Kengo MIYO, Kazuhiko OHE Hospital “center” and “centre” are equivalent? We focus on Japanese, but the proposed method does not depend on languages

  2. Background • Japanese in particular contains orthographic variation, becauseof tons of transliterations アボガドロ (A BO GA DO RO) Transliteration Avogadro Equivalent or not? Transliteration アヴォガドロ (A VO GA DO RO) SVM-based classifier (1) To build training-sets (2) To define features

  3. (1) Training-set アボガドロ Avogadro in multiple translation dictionaries アヴォガドロ Avogadro • Positive example: a term pair, which are spelled differently, buthave the same meaning • Negative example: a term pair, which are spelled differently andhave different meanings. Same English Translation Different English Translation

  4. (2) Features for SVM • different characters & its surrounding characters (window size=1; pre-context & post-context) Diff. Pre-context Post-context アヴ ォ ガ ド ロ アボガ ド ロ (ア,ヴォ⇔ボ, ガ) (ア, ヴォ⇔ボ) (ヴォ⇔ボ, ガ) (ヴォ⇔ボ) label term1 term2 True アヴォガドロ アボガドロ 1 1 1 1 • Their combinations = features

  5. Experiments • Test-set: 883 Medical term pairs (312 positive) • Methods: • (1) EDIT DISTANCE (th): IF SIM > th THEN +1 • (2) BYHAND: SVM + 4,130 handmade training-set • (3) AUTOMATIC: SVM + 68,608 automatically built training-set • (4) COMBINATION: SVM + all training-set (BYHAND+AUTOMATIC) • Evaluation: • Results:

  6. Conclusion • Discussion • Why AUTOMATIC < BYHAND • Because of Corpus specific styles (hepatitis-B or Hepatitis=B) • BYHAND corpus = test-set corpus ≠ AUTOMATIC corpus • Conclusion • We proposed a discriminative orthographic disambiguation method. • We also proposed a method for collecting both positive & negative examples. • Experimental results yielded high levels of accuracy (87.8%), demonstrating the feasibility of the proposed approach. Unfortunately Bergsma [ACL2007] proposed similar methods In the future, we will employ more features to boost the accuracy

  7. Support Vector Machine Based Orthographic Disambiguation Eiji ARAMAKI, Takeshi IMAI, Kengo MIYO, Kazuhiko OHE Hospital “term1” and “term2” are equivalent? We focus on Japanese, but the proposed method does not depend on languages

More Related