Applying Conditional Random Fields to Japanese Morphological Analysis

Applying Conditional Random Fields to Japanese Morphological Analysis Taku Kudo 1*, Kaoru Yamamoto 2, Yuji Matsumoto 1 1 Nara Institute of Science and Technology 2 CREST, Tokyo Institute of Technology * Currently, NTT Communication Science Labs.

Backgrounds • Conditional Random Fields [Lafferty 01] • A variant of Markov Random Fields • Many applications POS tagging [Lafferty01], shallow parsing [Sha 03], NE recognition [McCallum 03], IE [Pinto 03, Peng 04] • Japanese Morphological Analysis • Must cope with word segmentation • Must incorporate many features • Must minimize the influence of the length bias

Japanese Morphological Analysis • word segmentation (no explicit spaces in Japanese) • POS tagging • lemmatization, stemming INPUT: 東京都に住む (I live in Metropolis of Tokyo.) 東京 / 都 / に / 住む東京 (Tokyo) NOUN-PROPER-LOC-GENERAL 都 (Metro.) NOUN-SUFFIX-LOC に (in) PARTICLE-GENERAL 住む (live) VERB BASE-FORM

Simple approach for JMA • Character-based begin / inside tagging • non standard method in JMA • cannot directly reflect lexicons • over 90% accuracy can be achieved using the naïve longest prefix matching with a lexicon • decoding is slow 東京 / 都 / に / 住む B I B B B I

Our approach for JMA • Assume that a lexicon is available • word lattice • represents all candidate outputs • reduces redundant outputs • Unknown word processing • invoked when no matching word can be found in a lexicon • character types e.g., Chinese character, hiragana, katakana, number .. etc

GOAL: select the optimal path out of all candidates Input: Output: lexicon Problem Setting に particle, verb 東 noun 京　 noun 東京　 noun 京都 noun … Input:“東京都に住む(I live in Metropolis of Tokyo)” Lattice: 京都 (Kyoto) [noun] に (in) [particle] 東 (east) [noun] 京 (capital) [noun] 都 (Metro.) [suffix] 住む (live) [verb] BOS EOS に (resemble) [verb] 東京 (Tokyo) [noun] NOTE: the number of tokens #Y varies

Long-standing Problems in JMA

Complex tagset 京都 (Kyoto) Noun Proper Loc General Kyoto • Hierarchical tagset • HMMs cannot capture them • How to select the hidden classes? • TOP level → lack of granularity • Bottom level → data sparseness • Some functional particles should be lexicalized • Semi-automatic hidden class selections [Asahara 00]

Complex tagset, cont. • Must capture a variety of features 京都 (Kyoto) noun proper loc general Kyoto に(in) particle general φ φ に住む(live) verb independent φ φ live base-form overlapping features POS hierarchy character types prefix, suffix lexicalization inflections These features are important to JMA

P(に, particle|都,suffix) > P(に,verb|都,suffix) に (particle) [particle] P(東|BOS) < P(東京|BOS) に (resemble) [verb] 東 (east) [noun] 都 (capital) [suffix] BOS 東京 (Tokyo) [noun] JMA with MEMMs [Uchimoto 00-03] • Use discriminative model, e.g., maximum entropy model, to capture a variety of features • sequential application of ME models

Problems of MEMMs • Label bias[Lafferty 01] 0.4 1.0 C 0.6 BOS A D EOS 1.0 0.6 0.4 1.0 B E 1.0 P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36 P(B, E | x) = 0.4 * 1.0 * 1.0 = 0.4 P(A,D|x) < P(B,E|x) paths with low-entropy are preferred

Problems of MEMMs in JMA • Length bias 0.4 1.0 C 0.6 BOS A D EOS 1.0 0.6 0.4 1.0 B P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36 P(B | x) = 0.4 * 1.0 = 0.4 P(A,D|x) < P(B|x) long words are preferred length bias has been ignored in JMA !

Can CRFs solve these problems? Yes! Long-standing problems • must incorporate a variety of features • overlapping features, POS hierarchy, lexicalization, character-types • HMMs are not sufficient • must minimize the influence of length bias • another bias observed especially in JMA • MEMMs are not sufficient

Use of CRFs to JMA

Lattice: 京都 (Kyoto) [noun] に (in) [particle] 東 (east) [noun] 京 (capital) [noun] 都 (Metro.) [suffix] 住む (live) [verb] BOS EOS に (resemble) [verb] 東京 (Tokyo) [noun] : a set of all candidate paths CRFs for word lattice encodes a variety of uni- or bi-gram features in a path BOS - noun noun - suffix noun / Tokyo Global FeatureF(Y,X) = (… 1 …… 1 …… 1 … ) ParameterΛ = (… 3 …… 20 … 20 ... )

CRFs for word lattice, cont. • single exponential model for the entire paths • fewer restrictions in the feature design • can incorporate a variety of features • can solve the problems of HMMs

Encoding • Maximum Likelihood estimation • all candidate paths are taken in encoding • influence of length bias will be minimized • can solve the problems of MEMMs • A variant of Forward-Backward [Lafferty 01] can also be applied to word lattice

MAP estimation • L2-CRF (Gaussian prior) • non-sparse solution (all features have non-zero weight) • good if most given features are relevant • non-constrained optimizers, e.g., L-BFGS, are used • L1-CRF (Laplacian prior) • sparse solution (most features have zero-weight) • good if most given features are irrelevant • constrained optimizers, e.g., L-BFGS-B, are used • C is a hyper-parameter

Decoding • Viterbi algorithm • essentially the same architecture as HMMs and MEMMs

Experiments

Data KC and RWCP, widely-used Japanese annotated corpora

Features 京都 (Kyoto) noun proper loc general Kyoto に(in) particle general φ φ に住む(live) verb independent φ φ live base-form overlapping features POS hierarchy character types prefix, suffix lexicalization inflections

Evaluation 2・recall・precision F = • three criteria of correctness • seg: word segmentation only • top: word segmentation + top level of POS • all: all information recall + precision # correct tokens recall = # tokens in test corpus # correct tokens precision= # tokens in system output

Results Significance Tests: McNemar’s paired test on the labeling disagreements • L1/L2-CRFs outperform HMM and MEMM • L2-CRFs outperform L1-CRFs

Influence of the length bias • HMM, CRFs: relative ratios are not much different • MEMM: # of long word errors is large → influenced by the length bias

L1-CRFs v.s L2-CRFs • L2-CRFs > L1-CRFs • most given features are relevant (POS hierarchies, suffixes/prefixes, character types) • L1-CRFs produce a compact model • # of active features • L2: 791,798 v.s L1: 90,163 11% • L1-CRFs are worth being examined if there exist practical constraints

Conclusions • An application of CRFs to JMA • Not use character-based begin / inside tags but use word lattice with a lexicon • CRFs offer an elegant solution to the problems with HMMs and MEMMs • can use a wide variety of features (hierarchical POS tags, inflections, character types, …etc) • can minimize the influence of the length bias (length bias has been ignored in JMA!)

Future work • Tri-gram features • Use of all tri-grams is impractical as they make the decoding speed significantly slower • need to use a practical feature selection e.g., [McCallum 03] • Apply to other non-segmented languages e.g., Chinese or Thai

α’ α w’,t’ α’ w,t w’,t’ BOS EOS α exp() β w’,t’ w,t α’ w’,t’ CRFs encoding • A variant of Forward-Backward [Lafferty 01] can also be applied to word lattice

Influence of the length bias, cont. MEMMs select ロマンは romanticist 海 sea に particle かけた bet ロマン romance は particle The romance on the sea they bet is … MEMMs select ない心 one’s heart 荒波 rough waves に particle 負け loose ない not 心 heart A heart which beats rough waves is … • caused rather by the influence of the length bias (CRFs can correctly analyze these sentences)

京都 [noun] に [particle] 東 [noun] 京 [noun] 都 [suffix] BOS に [particle] 東京 [non] Cause of label and length bias • MEMM only use correct path in encoding • transition probabilities of unobserved paths will be distributed uniformly

Applying Conditional Random Fields to Japanese Morphological Analysis