Re-organization of IR/CSC team

Re-organization of IR/CSC team • Hongchao He • Conf. follow up TREC-10, NTCIR • Paper follow up ICCLP, SIGIR paper • Guihong Cao • MSKK-III – Clustering for technique transfer • Yang Wen • MSKK-III – Distance word dependency • Min Zhang • MSKK/CSC – Entropy based pruning for applications of (Pinyin/Hiragana) input system

Chinese Spelling Checking(or, the Big CSC) Jianfeng Gao NLC Group, MSRCN

Outline • Introduction • Chinese spelling checking • Our approach • Key techniques and experiments • Millstone

Introduction Goal:Automatically correct Chinese spelling errors using MS-Pinyin (MSPY) input system • Chinese spelling errors using MS-Pinyin input system • Chinese spelling error patterns • English spelling checking • Why CSC is difficult?

Chinese spelling errors using MSPY Text in the brain Pinyin (phonetic) errors Syllable Typographic errors Key stroke (Typing) System errors Converted text

Chinese spelling errors patterns • Substitution errors • Pinyin error • System error (include Pinyin error in some systems) • Non-substitution errors  word segmentation errors • Typographic errors – insertion/deletion/transposition

English spelling checking • Non-word error detection (“the”  “hte”) • N-gram (letter) analysis • Dictionary lookup • Real-word error detection (“from”  “form”) • NLP – parser driven • Statistical approach – data/error driven • Local – n-gram language model, depend on pre-defined confusion set • Global – Winnow, Bayesian, TBL, etc. • Problem – lack of error detection

Why CSC is difficult? • Word segmentation • Ambiguous • OOV – Proper noun detection (personal name, location, organization, etc.) • Segmentation error propagation • Non-word errors (in sense of English) do not exist • MSPY makes good use of word trigram language model

Chinese spelling checking • CSC – related works • Template matching – long distance, e.g. <之所以> <是因为> • Pattern matching – long words (n>=3), e.g. 一文不明  一文不名, 忠耿耿  忠心耿耿 • N-gram models – substitution errors • CSC – challenges • Long distance, coverage issue of template/pattern set • High-frequent-used confusion set, e.g. {像，象} {在，再} • OOV, especially the proper nouns • N-gram, has been fully used by MSPY

Chinese spelling errors patterns in MSPY • Proper noun • Personal name • Location • organization • Non-word errors: context independent • Insertion/deletion/transposition/substitution • E.g. 一文不明  一文不名, 忠耿耿  忠心耿耿 • Real-word errors: context sensitive • E.g. 像  象, 在  再, 实施  事实

Flowchart of our approach Text with errors Proper noun detection Word segmentation Word fuzzy matching Trigger: single char string , low prob Non-word error correction Context sensitive disambiguation Real-word error correction

Word segmentation and proper noun detection • Language model based word segmentation • Class-based language model • P(W) = Poutside(W) Pinsidea(W|<PN>), a = ? • Outside probability – PN tagged training data • Using NLPWIN to tag the corpus • Filtering, rule base • EM? • Inside probability – PN list training data • Using cache (or, dynamic dictionary)

Experiments and Findings • Measure: precision/recall – definition • Training data – People Daily • Tag tool – NLPWIN • Test data – spec. • Results and Findings

Long word fuzzy matching • Definition of Distance(s1, s2) • Long word, n>=3, • Sum of delete/insert/substitute a character • Fast fuzzy matching • Global – Lei Zhang’s ACL • Local – trigger, (single char, or low n-gram probability ) • Search – error detection/correction • Viterbi • Simplified version • Long word + Local matching

Experiments and Findings • Contact: 100 person, 3000 -- 5000 characters/person • Error analysis • Algorithm … • Measure: precision/recall • Large lexicon, acquisition. • Trigger/threshold ? • Results and Findings

Context sensitive disambiguation • Building confusion set – specific to MSPY • Feature selection – Context vector • Collocation – contiguous POS or words/characters • Context words – words/characters within a K-size window • Triple ? • Weighting schema and Classifier • Context Vector, TFIDF • Winnow, Bayesian, TBL, etc. • Scaling up • Enlarge confusion set • Feature pruning • Adaptation

Experiments and Findings • Measure: precision/recall • Training data • Test data (XXX confusion set) • Results and Findings

Experiments and Findings • Current Work • Pseudo-training set based on MSPY IME • Preliminary data processing (400M PD) • Unigram error model (10,000 Words useful) • 使是/69484 市/10289 诗/2394 …… • Trigram error pattern (980,000 useful) • 共[度]难关=>渡 / 不够[英]，=>硬 • Experiments based on basic approaches • Pseudo-test set from 南方周末 • Continuous pair (Recall = 50%, Precision = 25%) • Pattern Matching (??) • Future Work • Hybrid approaches • Pattern Clustering + Continuous pair • Functional words error detection

System evaluation – put it all together • Evaluation toolset • Measure: precision/recall • Training data • Test data • Results and Findings

Prototype • Demo … • Online & offline CSC • Right click • Spelling error detection/correction • Proper noun detection/correction

Assignment • Jianfeng Gao – overall, fuzzy matching • Mu Li – context sensitive disambiguation • Jian Sun – PN detection • Yang Wen – system evaluation • Yulin Kang – demo • Lei Zhang – senior consultant

Millstone • Oct. 2001, Ming says “Yes” (TAB demo) • Dec. 2001, Dong says “Yes” (Transfer) • Aug. 2002, HJ says “Yes” (Party)

Information • Access at \\msrcn4p3\rootD\gaojf\spell • Contact me if any problems • Jianfeng Gao, Tel: 86-10-62617711-5778, Email: jfgao@microsoft.com

Re-organization of IR/CSC team