60 likes | 111 Views
Automatic Name Transliteration via OCR and NLP. Yu Cao Tao Wang. Integration. Optical Character Recognition (OCR). ICDAR 2011 dataset character embedded in natural scene histogram of oriented gradients (HOG) 8x8 window sliding across at step of 2 linear kernel SVM
E N D
Automatic Name Transliteration via OCR and NLP Yu Cao Tao Wang
Optical Character Recognition (OCR) • ICDAR 2011 dataset • character embedded in natural scene • histogram of oriented gradients (HOG) • 8x8 window sliding across at step of 2 • linear kernel SVM • 52 classes, i.e. capital and small letters • overall character-level accuracy 74%
Bayesian Correction • Char-level bigram language model • Char-level accuracy improved to 75.3%
Named Entity Recognition (NER) • essentially two types of labels, “PERSON” and “NONPERSON” • MUC 7 corpora • maximum entropy Markov model • set of features: “CUR_WORD”, “PREV_ LABEL”, “MID_INITIAL”, “IN_DICT”, “IN_NAME DATABASE”, “NEXT_WORD” • F1 score of 77.5% (Precision 76.9% & Recall 78.1%)
F r a n c i s c o 弗 朗 西 斯 科 Transliteration • character-level translation model • training data: 4,256 English – Chinese name pairs obtained online • trigram Chinese language model • alignment model IBM model 1,3,4 • human evaluation • 120 English names obtained by NER for testing • acceptance score 100 ± 2 /120