1 / 19

Semi-supervised Training of Statistical Parsers

Semi-supervised Training of Statistical Parsers. CMSC 35100 Natural Language Processing January 26, 2006. Roadmap. Motivation: Resource Bottleneck Co-training Co-training with different parsers CFG & LTAG Experiments: Initial seed set size Parse selection Domain porting

diella
Download Presentation

Semi-supervised Training of Statistical Parsers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semi-supervised Training of Statistical Parsers CMSC 35100 Natural Language Processing January 26, 2006

  2. Roadmap • Motivation: • Resource Bottleneck • Co-training • Co-training with different parsers • CFG & LTAG • Experiments: • Initial seed set size • Parse selection • Domain porting • Results and discussion

  3. Motivation: Issues • Current statistical parsers • Many grammatical models • Significant progress: F-score ~ 93% • Issues: • Trained on ~1M words Penn WSJ treebank • Annotation: significant investment: time & money • Portability: • Single genre – business news • Later treebanks – smaller, still news • Training resource bottleneck

  4. Motivation: Approach • Goal: • Enhance portability, performance without large amounts of additional training data • Observations: • “Self-training”: Train parser on own output • Very small improvement (better counts for heads) • Limited to slightly refining current model • Ensemble methods, voting: useful • Approach: Co-training

  5. Co-Training • Co-Training (Blum & Mitchell 1998) • Weakly supervised training technique • Successful for basic classification • Materials • Small “seed” set of labeled examples • Large set of unlabeled examples • Training: Evidence from multiple models • Optimize degree of agreement b/t models on unlabeled data • Train several models on seed data • Run on unlabeled data • Use new “reliable” labeled examples to train others • Iterate

  6. Co-training Issues • Challenge: • Picking reliable novel examples • No guaranteed, simple approach • Rely on heuristics • Intersection: Highly ranked by other; low by self • Difference: Score by other exceeds self by some margin • Possibly employ parser confidence measures

  7. Experimental Structure • Approach (Steedman et al, 2003) • Focus here: Co-training with different parsers • Also examined reranking, supertaggers &parsers • Co-train CFG (Collins) & LTAG • Data: Penn Treebank WSJ, Brown, NA News • Questions: • How select reliable novel samples? • How does labeled seed size affect co-training? • How effective in co-training w/in, across genre?

  8. System Architecture • Two “different” parsers • “Views” – can be different by feature space • Here Collins CFG & LTAG • Comparable performance, different formalisms • Cache Manager • Draws labeled sentences for parsers to label • Selects subset of newly labeled to training set

  9. Two Different Parsers • Both train on treebank input • Lexicalized, head information percolated • Collins-CFG • Lexicalized CFG parser • “Bi-lexical”: each pair of non-terminals leads to bigram relation b/t pair of lexical items • Ph= head percolation; Pm=modifiers of head dtr • LTAG: • Lexicalized TAG parser • Bigram relations b/t trees • Ps=substitution probability; Pa=adjunction probability • Different in tree creation and lexical reln depth

  10. Selecting Labeled Examples • Scoring the parse • Ideal – true – score impossible • F-prob: trust the parser; F-norm-prob: norm by len • F-entropy: Diff b/t parse score distr and uniform • Baseline: # of parses, sentence length • Selecting (newly labeled) sentences • Goal: minimize noise, maximize training utility • S-base: n highest scores (both parsers use same) • Asymmetric: teacher/student • S-topn: teacher’s top n • S-intersect: sentences in teacher’s top n, student’s bottom n • S-diff: teacher’s score higher than student’s by some amount

  11. Experiments: Initial Seed Size • Typically evaluate after all training • Consider convergence rate • Initial rapid growth – tailing off w/more • Largest improvement: 500-1000 instances • Collins-CFG plateaus at 40K (89.3) • LTAG still improving • Will benefit from additional training • Co-training w/500 vs 1000 instances • Less data, greater benefit • Enhance coverage • However, 500 seed doesn’t reach level of 1000 seed

  12. Experiments: Parse Selection • Contrast: • Select-all newly labeled vs S-intersect (67%) • Co-training experiments: • 500 seed set • LTAG performs better w/S-intersect • Reduces noise, LTAG sensitive to noisy trees • CFG performs better w/S-select-all • CFG needs to increase coverage, more samples

  13. Experiments: Cross-domain • Train on Brown corpus -1000 seed • Co-train on WSJ • CFG, w/s-intersect improves, 76.6-> 78.3 • Mostly in first 5 iterations • Lexicalizing for new domain vocab • Train on Brown + 100 WSJ seed • Co-train on other WSJ • Base improves to 78.7, co-train to 80 • Gradual improvement, new constructs?

  14. Summary • Semi-supervised parser training • Co-training • Two different parse formalisms provide diff’t views • Enhances effectiveness • Biggest gains with small seed sets • Cross-domain enhancement • Selection methods dependent on • Parse model, amount of seed data

  15. Findings • Co-training enhances parsing when trained on small datasets: 500-10000 sentences • Co-training aids genre porting w/o labels • Co-training improved w/ANY labels for genre • Approaches for crucial sample selection

More Related