1 / 19

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank. Aaron Meyers Linguistics 490 Winter 2009. Syntax 101. Given a sentence, produce a syntax tree ( parse ) Example: ‘Mary like books’ Software which does this known as a parser. Grammars.

thelma
Download Presentation

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009

  2. Syntax 101 • Given a sentence, produce a syntax tree (parse) • Example: ‘Mary like books’ • Software which does this known as a parser

  3. Grammars • Context-Free Grammar (CFG) • Simple rules describing potential configurations • From example: • S → NP VP • NP → Mary • VP → V NP • V → likes • NP → books • Problems with ambiguity

  4. Tree Substitution Grammar (TSG) • Incorporates larger tree fragments • Substitution operator (◦) combines fragments • Context-free grammar is a trivial TSG ◦ ◦ =

  5. Treebanks • Database of sentences and corresponding syntax trees • Trees are hand-annotated • Penn Treebanks among most commonly used • Grammars can be created automatically from a treebank (training) • Extract rules (CFG) or fragments (TSG) directly from trees

  6. Learning Grammar from Treebank • Many rules or fragments will occur repeatedly • Incorporate frequencies into grammar • Probabilistic Context-Free Grammar (PCFG), Stochastic Tree Substitution Grammar (STSG) • Data-Oriented Parsing (DOP) model • DOP1 (1992): Type of STSG • Describes how to extract fragments from a treebank for inclusion in grammar (model) • Generally limit fragments to a certain max depth

  7. Penn Chinese Treebank • Latest version 6.0 (2007) • Xinhua newswire (7339 sentences) • Sinorama news magazine (7106 sentences) • Hong Kong news (519 sentences) • ACE Chinese broadcast news (9246 sentences)

  8. Penn Chinese Treebank and DOP • Latest version 6.0 (2007) • Xinhua newswire (7339 sentences) • Sinorama news magazine (7106 sentences) • Hong Kong news (519 sentences) • ACE Chinese broadcast news (9246 sentences) • Previous experiments (2004) with Penn Chinese Treebank and DOP1 • 1473 trees selected from Xinhua newswire • Fragment depth limited to three levels or less

  9. An improved DOP model: DOP* • Challenges with DOP1 model • Computationally inefficient (exponential increase in number of fragments extracted) • Statistically inconsistent • A new estimator: DOP* (2005) • Limits fragment extraction by estimating optimal fragments using subsets of training corpus • Linear rather than exponential increase in fragments • Statistically consistent (accuracy increases as size of training corpus increases)

  10. Research Question & Hypothesis • Will a DOP* parser applied to the Penn Chinese Treebank show significant improvement in accuracy for a model incorporating fragments up to depth five compared to a model incorporating only fragments up to depth three? • Hypothesis: Yes, accuracy will significantly increase • Deeper fragments allow parser to capture non-local dependencies in syntax usage/preference

  11. Selecting training and testing data • Subset of Xinhua newswire (2402 sentences) • Includes only IP trees (no headlines or fragments) • Excluded sentences of average or greater length • Remaining 1402 sentences divided three times into random training/test splits • Each test split has 140 sentences • Other 1262 sentences used for training

  12. Preparing the trees • Penn Treebank converted to dopdis format • Chinese characters converted to alphanumeric codes • Standard tree normalizations • Removed empty nodes • Removed A over A and X over A unaries • Stripped functional tags • Original: (IP (NP-PN-SBJ (NR 上海) (NR 浦东)) (VP … • Converted: (ip,[(np,[(nr,[(hmeiahodpp_,[])]),(nr,[(hodoohmejc_,[])])]),(vp, …

  13. Training & testing the parser • DOP* parser is created by training a model with the training trees • The parser is then tested by processing the test sentences • Parse trees returned by parser are compared with original parse trees from treebank • Standard evaluation metrics computed: labeled recall, labeled precision, and f-score (mean) • Repeated for each depth level, test/training split

  14. Parsing Results

  15. Other interesting statistics • Training time at depth-3 and depth-5 is similar, even though depth-5 has much higher fragment count • Testing time though at depth-5 is ten times higher than testing time at depth-3!

  16. Conclusion • Obtain parsing results for other two testing / training splits, if similar: • Increasing fragment extraction depth from three to five does not significantly improve accuracy for a DOP* parser over the Penn Chinese Treebank • Determine statistical significance • Practical benefit is negated by increased parsing time

  17. Future Work • Increase size of training corpus • DOP* estimation consistency: accuracy should increase as larger training corpus used • Perform experiment with DOP1 model • Accuracy obtained with DOP* lower than previous experiments using DOP1 (Hearne & Way 2004) • Qualitative analysis • What constructions are captured more accurately?

  18. Future Work • Perform experiments with other corpora • Other sections of Chinese Treebank • Other treebanks: Penn Arabic Treebank, … • Increase capacity and stability of dopdis system • Encountered various failures on larger runs, crashing after as long as 36 hours • Efficiency could be increased by larger memory support (64-bit architecture), storage and indexing using a relational database system

More Related