1 / 31

Applying Conditional Random Fields to Japanese Morphological Analysis

Applying Conditional Random Fields to Japanese Morphological Analysis. Taku Kudo 1* , Kaoru Yamamoto 2 , Yuji Matsumoto 1 1 Nara Institute of Science and Technology 2 CREST, Tokyo Institute of Technology * Currently, NTT Communication Science Labs. Backgrounds.

daire
Download Presentation

Applying Conditional Random Fields to Japanese Morphological Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Applying Conditional Random Fields to Japanese Morphological Analysis Taku Kudo 1*, Kaoru Yamamoto 2, Yuji Matsumoto 1 1 Nara Institute of Science and Technology 2 CREST, Tokyo Institute of Technology * Currently, NTT Communication Science Labs.

  2. Backgrounds • Conditional Random Fields [Lafferty 01] • A variant of Markov Random Fields • Many applications POS tagging [Lafferty01], shallow parsing [Sha 03], NE recognition [McCallum 03], IE [Pinto 03, Peng 04] • Japanese Morphological Analysis • Must cope with word segmentation • Must incorporate many features • Must minimize the influence of the length bias

  3. Japanese Morphological Analysis • word segmentation (no explicit spaces in Japanese) • POS tagging • lemmatization, stemming INPUT: 東京都に住む (I live in Metropolis of Tokyo.) 東京 / 都 / に / 住む 東京 (Tokyo) NOUN-PROPER-LOC-GENERAL 都 (Metro.) NOUN-SUFFIX-LOC に (in) PARTICLE-GENERAL 住む (live) VERB BASE-FORM

  4. Simple approach for JMA • Character-based begin / inside tagging • non standard method in JMA • cannot directly reflect lexicons • over 90% accuracy can be achieved using the naïve longest prefix matching with a lexicon • decoding is slow 東 京 / 都 / に / 住 む B I B B B I

  5. Our approach for JMA • Assume that a lexicon is available • word lattice • represents all candidate outputs • reduces redundant outputs • Unknown word processing • invoked when no matching word can be found in a lexicon • character types e.g., Chinese character, hiragana, katakana, number .. etc

  6. GOAL: select the optimal path out of all candidates Input: Output: lexicon Problem Setting に particle, verb 東 noun 京  noun 東京  noun 京都 noun … Input:“東京都に住む(I live in Metropolis of Tokyo)” Lattice: 京都 (Kyoto) [noun] に (in) [particle] 東 (east) [noun] 京 (capital) [noun] 都 (Metro.) [suffix] 住む (live) [verb] BOS EOS に (resemble) [verb] 東京 (Tokyo) [noun] NOTE: the number of tokens #Y varies

  7. Long-standing Problems in JMA

  8. Complex tagset 京都 (Kyoto) Noun Proper Loc General Kyoto • Hierarchical tagset • HMMs cannot capture them • How to select the hidden classes? • TOP level → lack of granularity • Bottom level → data sparseness • Some functional particles should be lexicalized • Semi-automatic hidden class selections [Asahara 00]

  9. Complex tagset, cont. • Must capture a variety of features 京都 (Kyoto) noun proper loc general Kyoto に(in) particle general φ φ に 住む(live) verb independent φ φ live base-form overlapping features POS hierarchy character types prefix, suffix lexicalization inflections These features are important to JMA

  10. P(に, particle|都,suffix) > P(に,verb|都,suffix) に (particle) [particle] P(東|BOS) < P(東京|BOS) に (resemble) [verb] 東 (east) [noun] 都 (capital) [suffix] BOS 東京 (Tokyo) [noun] JMA with MEMMs [Uchimoto 00-03] • Use discriminative model, e.g., maximum entropy model, to capture a variety of features • sequential application of ME models

  11. Problems of MEMMs • Label bias[Lafferty 01] 0.4 1.0 C 0.6 BOS A D EOS 1.0 0.6 0.4 1.0 B E 1.0 P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36 P(B, E | x) = 0.4 * 1.0 * 1.0 = 0.4 P(A,D|x) < P(B,E|x) paths with low-entropy are preferred

  12. Problems of MEMMs in JMA • Length bias 0.4 1.0 C 0.6 BOS A D EOS 1.0 0.6 0.4 1.0 B P(A, D | x) = 0.6 * 0.6 * 1.0 = 0.36 P(B | x) = 0.4 * 1.0 = 0.4 P(A,D|x) < P(B|x) long words are preferred length bias has been ignored in JMA !

  13. Can CRFs solve these problems? Yes! Long-standing problems • must incorporate a variety of features • overlapping features, POS hierarchy, lexicalization, character-types • HMMs are not sufficient • must minimize the influence of length bias • another bias observed especially in JMA • MEMMs are not sufficient

  14. Use of CRFs to JMA

  15. Lattice: 京都 (Kyoto) [noun] に (in) [particle] 東 (east) [noun] 京 (capital) [noun] 都 (Metro.) [suffix] 住む (live) [verb] BOS EOS に (resemble) [verb] 東京 (Tokyo) [noun] : a set of all candidate paths CRFs for word lattice encodes a variety of uni- or bi-gram features in a path BOS - noun noun - suffix noun / Tokyo Global FeatureF(Y,X) = (… 1 …… 1 …… 1 … ) ParameterΛ = (… 3 …… 20 … 20 ... )

  16. CRFs for word lattice, cont. • single exponential model for the entire paths • fewer restrictions in the feature design • can incorporate a variety of features • can solve the problems of HMMs

  17. Encoding • Maximum Likelihood estimation • all candidate paths are taken in encoding • influence of length bias will be minimized • can solve the problems of MEMMs • A variant of Forward-Backward [Lafferty 01] can also be applied to word lattice

  18. MAP estimation • L2-CRF (Gaussian prior) • non-sparse solution (all features have non-zero weight) • good if most given features are relevant • non-constrained optimizers, e.g., L-BFGS, are used • L1-CRF (Laplacian prior) • sparse solution (most features have zero-weight) • good if most given features are irrelevant • constrained optimizers, e.g., L-BFGS-B, are used • C is a hyper-parameter

  19. Decoding • Viterbi algorithm • essentially the same architecture as HMMs and MEMMs

  20. Experiments

  21. Data KC and RWCP, widely-used Japanese annotated corpora

  22. Features 京都 (Kyoto) noun proper loc general Kyoto に(in) particle general φ φ に 住む(live) verb independent φ φ live base-form overlapping features POS hierarchy character types prefix, suffix lexicalization inflections

  23. Evaluation 2・recall・precision F = • three criteria of correctness • seg: word segmentation only • top: word segmentation + top level of POS • all: all information recall + precision # correct tokens recall = # tokens in test corpus # correct tokens precision= # tokens in system output

  24. Results Significance Tests: McNemar’s paired test on the labeling disagreements • L1/L2-CRFs outperform HMM and MEMM • L2-CRFs outperform L1-CRFs

  25. Influence of the length bias • HMM, CRFs: relative ratios are not much different • MEMM: # of long word errors is large → influenced by the length bias

  26. L1-CRFs v.s L2-CRFs • L2-CRFs > L1-CRFs • most given features are relevant (POS hierarchies, suffixes/prefixes, character types) • L1-CRFs produce a compact model • # of active features • L2: 791,798 v.s L1: 90,163 11% • L1-CRFs are worth being examined if there exist practical constraints

  27. Conclusions • An application of CRFs to JMA • Not use character-based begin / inside tags but use word lattice with a lexicon • CRFs offer an elegant solution to the problems with HMMs and MEMMs • can use a wide variety of features (hierarchical POS tags, inflections, character types, …etc) • can minimize the influence of the length bias (length bias has been ignored in JMA!)

  28. Future work • Tri-gram features • Use of all tri-grams is impractical as they make the decoding speed significantly slower • need to use a practical feature selection e.g., [McCallum 03] • Apply to other non-segmented languages e.g., Chinese or Thai

  29. α’ α w’,t’ α’ w,t w’,t’ BOS EOS α exp() β w’,t’ w,t α’ w’,t’ CRFs encoding • A variant of Forward-Backward [Lafferty 01] can also be applied to word lattice

  30. Influence of the length bias, cont. MEMMs select ロマンは romanticist 海 sea に particle かけた bet ロマン romance は particle The romance on the sea they bet is … MEMMs select ない心 one’s heart 荒波 rough waves に particle 負け loose ない not 心 heart A heart which beats rough waves is … • caused rather by the influence of the length bias (CRFs can correctly analyze these sentences)

  31. 京都 [noun] に [particle] 東 [noun] 京 [noun] 都 [suffix] BOS に [particle] 東京 [non] Cause of label and length bias • MEMM only use correct path in encoding • transition probabilities of unobserved paths will be distributed uniformly

More Related