1 / 6

Automatic Name Transliteration via OCR and NLP

Automatic Name Transliteration via OCR and NLP. Yu Cao Tao Wang. Integration. Optical Character Recognition (OCR). ICDAR 2011 dataset character embedded in natural scene histogram of oriented gradients (HOG) 8x8 window sliding across at step of 2 linear kernel SVM

Download Presentation

Automatic Name Transliteration via OCR and NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Name Transliteration via OCR and NLP Yu Cao Tao Wang

  2. Integration

  3. Optical Character Recognition (OCR) • ICDAR 2011 dataset • character embedded in natural scene • histogram of oriented gradients (HOG) • 8x8 window sliding across at step of 2 • linear kernel SVM • 52 classes, i.e. capital and small letters • overall character-level accuracy 74%

  4. Bayesian Correction • Char-level bigram language model • Char-level accuracy improved to 75.3%

  5. Named Entity Recognition (NER) • essentially two types of labels, “PERSON” and “NONPERSON” • MUC 7 corpora • maximum entropy Markov model • set of features: “CUR_WORD”, “PREV_ LABEL”, “MID_INITIAL”, “IN_DICT”, “IN_NAME DATABASE”, “NEXT_WORD” • F1 score of 77.5% (Precision 76.9% & Recall 78.1%)

  6. F r a n c i s c o 弗 朗 西 斯 科 Transliteration • character-level translation model • training data: 4,256 English – Chinese name pairs obtained online • trigram Chinese language model • alignment model IBM model 1,3,4 • human evaluation • 120 English names obtained by NER for testing • acceptance score 100 ± 2 /120

More Related