580 likes | 705 Views
Improving Statistical Parsing Using Cross-Corpus Data. Xiaoqiang Luo IBM T.J. Watson Research Center (joint work with Min Tang of MIT). NLP Technologies. Statistical parsing Natural language understanding in spoken dialog systems Information extraction, and translingual question answering
E N D
Improving Statistical Parsing Using Cross-Corpus Data Xiaoqiang Luo IBM T.J. Watson Research Center (joint work with Min Tang of MIT)
NLP Technologies • Statistical parsing • Natural language understanding in spoken dialog systems • Information extraction, and translingual question answering • Automatic extraction of entities and relations from text • Statistical machine translation • Chinese => English, Arabic => English • Cross-lingual search • Topic detection and tracking • Text categorization • Multilingual and translingual taxonomies • Audio-Indexing • Combine speech recognition and search (mono- and cross-lingual)
Content • Motivation • Cross Domain Data • EM Algorithm • Experiments • Future Work
Unsupervised: related work • Charniak ’97 • Blum&Mitchell 98: co-training • WS02: co-training • McCallum and Nigam ’98: document classification • What we did • Unsupervised adaptation: ASRU’99, ICASSP’00 • Active learning (ACL’02)
Goal Active learning: select what to annotate This work: make use of cross-domain (corpus) data -- labeled, but for (other) purpose
Content • Motivation • Cross Domain Data • EM Algorithm • Experiments • Future Work
Cross Domain/Corpus Data • Claim: cross domain data provides some but NOT all information
AP vs. WSJ Upenn TB • Cross-bracketing:
PKU POS data • 1MW free (50MW) by Beijing Univ.
PKU -> UPenn Mapping PKU: 在_p这_r辞旧迎新_l的_u美好_a时刻_n ,_w我_r …. UPenn: 在_P这_BNDRY辞旧迎新_RM 的_BNDRY美好_VA 时刻_NN ,_PU我_BNDRY …. English: Atthis goodbye-old-welcome-new ‘sbeautiful moment ,I … [at this beautiful moment when we say good-bye to the old year and welcome new year, I..] • Mapping: • Map 1-1, m-1 tags • Frequent 1-n: limited context; o/w untagged • m-n: keep word boundary, untagged • Style difference: drop word boundary • Result: • 93% words with Upenn tags • 6% words: keep boundary • 1%: no tag, no word boundary
Utilize Cross Domain Data • Existing information • Convert into appropriate format • Properties: granularity, reliability, etc • Missing information • EM algorithm
Content • Motivation • CrossDomain Data • EM Algorithm • Experiments • Discussion & Future Work
Definitions • Incomplete Data (partial parse trees) : tpTp • Complete Data (full parse trees) : t T , where t = < tm ,tp >, tm is the missing part • F : TTp where F(t) = tp is a many-one relation • P(t) : distribution on T for a given sentence • P(tp) : P(t) induced on Tp
Algorithm Find that maximizes P(tp) given tp: 0 initialized by “seed” data
Implementation • Constrained decoding • Treat partial tree labels (tp) as constraints • Find missing labels (tm) consistent with tp • Pruning: top-N training • Speed up of the decoder • 2x~16x speed up
Cross Domain Data Update Model Model Pre-processing Partial Trees Full Trees Constrained Decode The Recipe
Content • Motivation • CrossDomain Data • EM Algorithm • Experiments • Discussion & Future Work
Experiments Setup • MaxEnt Parser (Ratnaparkhi 97) • Chinese • Upenn (100K + 120K) Treebank • PKU (1M): POS • English • Upenn (1M) Treebank • AP treebank (1M)
Experiment Settings * Improved baseline ** in-domain
CE-2: Relative Error Reduction (% relative error reduction before/after 100K PKU POS data)
CE2: PKU data Lots of partially labeled data helped 100K model a little
EE-2 • AP data: use all brackets or brackets of highest constituent • Results: • Not helpful to small model • Hurt performance if init model is well trained • Reason: • Information is under-used • Style diff: some constraints are wrong
Semi-supervised training • Cross-domain data • Noisy decoding output as training data • Training with noisy data • Constrain Model -- parameter tying
Parameter Tying • Decoding results are noisy • Constrain Model • Features classified: fi in Cj • Parameter: pi’ =pi + dj for all fi in Cj • Idea: change pi to pi’ only if evidence is strong
Preliminary Result Baseline: 200K-word char parser EM data: Chinese NE data
Result Summary • Semi-supervised learning • most helpful when initial model is insufficiently trained • Useful in early stage of system development
Content • Motivation • CrossDomain Data • EM Algorithm • Experiments • Future Work
Future Work • More on constraining model • Induce feature • Cross-domain data: new features • Sample selection • Voting (Multiple models) • Train on partial trees
Acknowledgements • Todd Ward (AP data) • Fei Xia (PKU -> Upenn mapping) • Brian (Chinese NE data) • Salim and Todd: ideas, discussions
PKU POS data (1M words) PKU-> UPenn POS Mapping (with help of Fei Xia) -- most are 1-1 -- m-1: vn,n -> NN -- 1-m: 的/u -> DEG/DEC (context dependent) -- m-n: r->DT, PN; Rg->DT,PN Other Issues: -- Word segmetation style: “lname fname” vs. “lnamefname”
Recipe: an example PKU: 输入/v 中文/n 是/v 轻而易举/i 的/u 事情/n 。/b UPenn: 输入/VV 中文/NN 是/VC 轻而易举/VA 的/DEG 事情/NN 。/PU Char:[VV 输_vvb 入_vve ] [NN 中_nnb 文_nne ] …. Decode: 0.7 [IP [IP[VP [VV 输_vvb 入_vve VV] [NN 中_nnb 文_nne NN]VP] IP] … 0.3 [IP [NP[VV 输_vvb 入_vve VV] [NN 中_nnb 文_nne NN]NP] … Retraining!
Experiment Settings *No Chunk Labels†Include Chunk Labels
CE-2: Use NE and Word Model • NE information alone does not help ( so far ) • Word sense information is important ( as shown in CE-1) • 1.1% relative improvement with tags from a word model
Improvement of Baseline * Chinese Char Parser: 5-10% relative
Constrained Decode on Cross Domain Data *Results on the data decodable(66.7%) using beam-width 500. †Include chunk labels.
WSJ and AP Treebanks • Similarities: • WSJ: 23.8 wd/s, depth 9, 76.6k PP’s • AP: 23.7 wd/s, depth 8, 68.9k PP’s • Differences: • WSJ: 26 labels / 36 tags, ADVP, [NP->NP+PP], 17.3k [V+S] or [V+SBAR]’s, 22.0k [S->VP] • AP: 46 labels /222 tags, Fn/Fr/S, Ti/Tg/Tn, 7.3k [V+Fn] or [V+S]’s • Word Level Differences: upper, regarding, etc.