Improving Statistical Parsing Using Cross-Corpus Data

Improving Statistical Parsing Using Cross-Corpus Data Xiaoqiang Luo IBM T.J. Watson Research Center (joint work with Min Tang of MIT)

NLP Technologies • Statistical parsing • Natural language understanding in spoken dialog systems • Information extraction, and translingual question answering • Automatic extraction of entities and relations from text • Statistical machine translation • Chinese => English, Arabic => English • Cross-lingual search • Topic detection and tracking • Text categorization • Multilingual and translingual taxonomies • Audio-Indexing • Combine speech recognition and search (mono- and cross-lingual)

Content • Motivation • Cross Domain Data • EM Algorithm • Experiments • Future Work

Impact of Training Data

Unsupervised: related work • Charniak ’97 • Blum&Mitchell 98: co-training • WS02: co-training • McCallum and Nigam ’98: document classification • What we did • Unsupervised adaptation: ASRU’99, ICASSP’00 • Active learning (ACL’02)

Goal Active learning: select what to annotate This work: make use of cross-domain (corpus) data -- labeled, but for (other) purpose

Content • Motivation • Cross Domain Data • EM Algorithm • Experiments • Future Work

Cross Domain/Corpus Data • Claim: cross domain data provides some but NOT all information

AP Treebank S

AP Treebank – WSJ Style

AP vs. WSJ Upenn TB • Cross-bracketing:

PKU POS data • 1MW free (50MW) by Beijing Univ.

PKU -> UPenn Mapping PKU: 在_p这_r辞旧迎新_l的_u美好_a时刻_n ，_w我_r …. UPenn: 在_P这_BNDRY辞旧迎新_RM 的_BNDRY美好_VA 时刻_NN ，_PU我_BNDRY …. English: Atthis goodbye-old-welcome-new ‘sbeautiful moment ,I … [at this beautiful moment when we say good-bye to the old year and welcome new year, I..] • Mapping: • Map 1-1, m-1 tags • Frequent 1-n: limited context; o/w untagged • m-n: keep word boundary, untagged • Style difference: drop word boundary • Result: • 93% words with Upenn tags • 6% words: keep boundary • 1%: no tag, no word boundary

Utilize Cross Domain Data • Existing information • Convert into appropriate format • Properties: granularity, reliability, etc • Missing information • EM algorithm

Content • Motivation • CrossDomain Data • EM Algorithm • Experiments • Discussion & Future Work

Definitions • Incomplete Data (partial parse trees) : tpTp • Complete Data (full parse trees) : t  T , where t = < tm ,tp >, tm is the missing part • F : TTp where F(t) = tp is a many-one relation • P(t) : distribution on T for a given sentence • P(tp) : P(t) induced on Tp

Solid: Tp dashed: Tm

Algorithm Find  that maximizes P(tp) given tp: 0 initialized by “seed” data

Implementation • Constrained decoding • Treat partial tree labels (tp) as constraints • Find missing labels (tm) consistent with tp • Pruning: top-N training • Speed up of the decoder • 2x~16x speed up

Cross Domain Data Update Model Model Pre-processing Partial Trees Full Trees Constrained Decode The Recipe

Content • Motivation • CrossDomain Data • EM Algorithm • Experiments • Discussion & Future Work

Experiments Setup • MaxEnt Parser (Ratnaparkhi 97) • Chinese • Upenn (100K + 120K) Treebank • PKU (1M): POS • English • Upenn (1M) Treebank • AP treebank (1M)

Experiment Settings * Improved baseline ** in-domain

EE-1: amount of supervision

CE-2: with PKU data

CE-2: Relative Error Reduction (% relative error reduction before/after 100K PKU POS data)

CE2: PKU data Lots of partially labeled data helped 100K model a little

EE-2 • AP data: use all brackets or brackets of highest constituent • Results: • Not helpful to small model • Hurt performance if init model is well trained • Reason: • Information is under-used • Style diff: some constraints are wrong

Semi-supervised training • Cross-domain data • Noisy decoding output as training data • Training with noisy data • Constrain Model -- parameter tying

Parameter Tying • Decoding results are noisy • Constrain Model • Features classified: fi in Cj • Parameter: pi’ =pi + dj for all fi in Cj • Idea: change pi to pi’ only if evidence is strong

Preliminary Result Baseline: 200K-word char parser EM data: Chinese NE data

Result Summary • Semi-supervised learning • most helpful when initial model is insufficiently trained • Useful in early stage of system development

Content • Motivation • CrossDomain Data • EM Algorithm • Experiments • Future Work

Future Work • More on constraining model • Induce feature • Cross-domain data: new features • Sample selection • Voting (Multiple models) • Train on partial trees

Acknowledgements • Todd Ward (AP data) • Fei Xia (PKU -> Upenn mapping) • Brian (Chinese NE data) • Salim and Todd: ideas, discussions

The End

End of Presentation

Syntactic Parsing Problem

PKU POS data (1M words) PKU-> UPenn POS Mapping (with help of Fei Xia) -- most are 1-1 -- m-1: vn,n -> NN -- 1-m: 的/u -> DEG/DEC (context dependent) -- m-n: r->DT, PN; Rg->DT,PN Other Issues: -- Word segmetation style: “lname fname” vs. “lnamefname”

Name Entity Data

Recipe: an example PKU: 输入/v 中文/n 是/v 轻而易举/i 的/u 事情/n 。/b UPenn: 输入/VV 中文/NN 是/VC 轻而易举/VA 的/DEG 事情/NN 。/PU Char:[VV 输_vvb 入_vve ] [NN 中_nnb 文_nne ] …. Decode: 0.7 [IP [IP[VP [VV 输_vvb 入_vve VV] [NN 中_nnb 文_nne NN]VP] IP] … 0.3 [IP [NP[VV 输_vvb 入_vve VV] [NN 中_nnb 文_nne NN]NP] … Retraining!

Experiment Settings *No Chunk Labels†Include Chunk Labels

CE-2: Use Raw Text

CE-2: Use NE Information

CE-2: Use NE and Word Model • NE information alone does not help ( so far ) • Word sense information is important ( as shown in CE-1) • 1.1% relative improvement with tags from a word model

Improvement of Baseline * Chinese Char Parser: 5-10% relative

Constrained Decode on Cross Domain Data *Results on the data decodable(66.7%) using beam-width 500. †Include chunk labels.

WSJ and AP Treebanks • Similarities: • WSJ: 23.8 wd/s, depth 9, 76.6k PP’s • AP: 23.7 wd/s, depth 8, 68.9k PP’s • Differences: • WSJ: 26 labels / 36 tags, ADVP, [NP->NP+PP], 17.3k [V+S] or [V+SBAR]’s, 22.0k [S->VP] • AP: 46 labels /222 tags, Fn/Fr/S, Ti/Tg/Tn, 7.3k [V+Fn] or [V+S]’s • Word Level Differences: upper, regarding, etc.

Improving Statistical Parsing Using Cross-Corpus Data

Improving Statistical Parsing Using Cross-Corpus Data

Presentation Transcript

Seven Lectures on Statistical Parsing

Presenting Statistical Data Using XML

Seven Lectures on Statistical Parsing

Seven Lectures on Statistical Parsing

Statistical Natural Language Parsing

Statistical Natural Language Parsing

Seven Lectures on Statistical Parsing

Seven Lectures on Statistical Parsing

Seven Lectures on Statistical Parsing

Statistical Parsing Chapter 14

Seven Lectures on Statistical Parsing

Chapter 14: Statistical Parsing

Seven Lectures on Statistical Parsing

Parsing the NEGRA corpus

Seven Lectures on Statistical Parsing