190 likes | 352 Views
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank. Aaron Meyers Linguistics 490 Winter 2009. Syntax 101. Given a sentence, produce a syntax tree ( parse ) Example: ‘Mary like books’ Software which does this known as a parser. Grammars.
E N D
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009
Syntax 101 • Given a sentence, produce a syntax tree (parse) • Example: ‘Mary like books’ • Software which does this known as a parser
Grammars • Context-Free Grammar (CFG) • Simple rules describing potential configurations • From example: • S → NP VP • NP → Mary • VP → V NP • V → likes • NP → books • Problems with ambiguity
Tree Substitution Grammar (TSG) • Incorporates larger tree fragments • Substitution operator (◦) combines fragments • Context-free grammar is a trivial TSG ◦ ◦ =
Treebanks • Database of sentences and corresponding syntax trees • Trees are hand-annotated • Penn Treebanks among most commonly used • Grammars can be created automatically from a treebank (training) • Extract rules (CFG) or fragments (TSG) directly from trees
Learning Grammar from Treebank • Many rules or fragments will occur repeatedly • Incorporate frequencies into grammar • Probabilistic Context-Free Grammar (PCFG), Stochastic Tree Substitution Grammar (STSG) • Data-Oriented Parsing (DOP) model • DOP1 (1992): Type of STSG • Describes how to extract fragments from a treebank for inclusion in grammar (model) • Generally limit fragments to a certain max depth
Penn Chinese Treebank • Latest version 6.0 (2007) • Xinhua newswire (7339 sentences) • Sinorama news magazine (7106 sentences) • Hong Kong news (519 sentences) • ACE Chinese broadcast news (9246 sentences)
Penn Chinese Treebank and DOP • Latest version 6.0 (2007) • Xinhua newswire (7339 sentences) • Sinorama news magazine (7106 sentences) • Hong Kong news (519 sentences) • ACE Chinese broadcast news (9246 sentences) • Previous experiments (2004) with Penn Chinese Treebank and DOP1 • 1473 trees selected from Xinhua newswire • Fragment depth limited to three levels or less
An improved DOP model: DOP* • Challenges with DOP1 model • Computationally inefficient (exponential increase in number of fragments extracted) • Statistically inconsistent • A new estimator: DOP* (2005) • Limits fragment extraction by estimating optimal fragments using subsets of training corpus • Linear rather than exponential increase in fragments • Statistically consistent (accuracy increases as size of training corpus increases)
Research Question & Hypothesis • Will a DOP* parser applied to the Penn Chinese Treebank show significant improvement in accuracy for a model incorporating fragments up to depth five compared to a model incorporating only fragments up to depth three? • Hypothesis: Yes, accuracy will significantly increase • Deeper fragments allow parser to capture non-local dependencies in syntax usage/preference
Selecting training and testing data • Subset of Xinhua newswire (2402 sentences) • Includes only IP trees (no headlines or fragments) • Excluded sentences of average or greater length • Remaining 1402 sentences divided three times into random training/test splits • Each test split has 140 sentences • Other 1262 sentences used for training
Preparing the trees • Penn Treebank converted to dopdis format • Chinese characters converted to alphanumeric codes • Standard tree normalizations • Removed empty nodes • Removed A over A and X over A unaries • Stripped functional tags • Original: (IP (NP-PN-SBJ (NR 上海) (NR 浦东)) (VP … • Converted: (ip,[(np,[(nr,[(hmeiahodpp_,[])]),(nr,[(hodoohmejc_,[])])]),(vp, …
Training & testing the parser • DOP* parser is created by training a model with the training trees • The parser is then tested by processing the test sentences • Parse trees returned by parser are compared with original parse trees from treebank • Standard evaluation metrics computed: labeled recall, labeled precision, and f-score (mean) • Repeated for each depth level, test/training split
Other interesting statistics • Training time at depth-3 and depth-5 is similar, even though depth-5 has much higher fragment count • Testing time though at depth-5 is ten times higher than testing time at depth-3!
Conclusion • Obtain parsing results for other two testing / training splits, if similar: • Increasing fragment extraction depth from three to five does not significantly improve accuracy for a DOP* parser over the Penn Chinese Treebank • Determine statistical significance • Practical benefit is negated by increased parsing time
Future Work • Increase size of training corpus • DOP* estimation consistency: accuracy should increase as larger training corpus used • Perform experiment with DOP1 model • Accuracy obtained with DOP* lower than previous experiments using DOP1 (Hearne & Way 2004) • Qualitative analysis • What constructions are captured more accurately?
Future Work • Perform experiments with other corpora • Other sections of Chinese Treebank • Other treebanks: Penn Arabic Treebank, … • Increase capacity and stability of dopdis system • Encountered various failures on larger runs, crashing after as long as 36 hours • Efficiency could be increased by larger memory support (64-bit architecture), storage and indexing using a relational database system