100 likes | 110 Views
Portability, Parallelism and Efficiency in Parsing. Dan Bikel University of Pennsylvania March 11th, 2002. Parsing: Where are we now?. Pounding away at Penn Treebank, §23 Collins (1999): LR 88.0, LP 88.3 Charniak (2000): LR 89.6, LP 89.5 Collins (2000): LR 89.6, LP 89.9
E N D
Portability, Parallelism and Efficiency in Parsing Dan Bikel University of Pennsylvania March 11th, 2002
Parsing: Where are we now? • Pounding away at Penn Treebank, §23 • Collins (1999): LR 88.0, LP 88.3 • Charniak (2000): LR 89.6, LP 89.5 • Collins (2000): LR 89.6, LP 89.9 • Henderson & Brill (1999) on §22: LR 90.1, LP 92.4 • Room to grow: new domains, better performance
Language DecoderServer 1 Language package ModelCollection CKY Client 1 CKY Client 2 DecoderServer N CKY Client N ModelCollection Switchboard Object server The Right Architecture forParallel Parsing M M
Architecture for Parallel Parsing II • Highly parallel, multi-threaded • New cluster about to come on-line; poised to take advantage • Fully fault-tolerant • Significant flexibility: layers of abstraction • Optimized for speed • Highly portable for new domains, including new languages
P(th,wh) L Mi(ti,wi) Mi-1(ti-1,wi-1) H (th,wh) Collins BBN Layer of Abstraction:Probability Structure
Plug-’n’-play Probability Models • New engine capable of implementing a wide variety of models, including Collins, BBN • Have meticulously replicated Collins’ model and performance • Cleaned up probabilistic “oddities” • Code is thoroughly documented • Will release to public
Fast Portability to New Data Sets • Parsers operate over augmented tree space, T+ • Generative models define joint probability P(S,T,T+) • Chiang & Bikel (2002, in submission) provide • New, portable syntax for augmenting tree nodes • Method for reestimating parser models in the augmented space such that P(S,T) is maximized
Rapid Portability to New Languages with High Accuracy • Bikel & Chiang (2000) described porting two parsing models developed for English to Chinese • BBN: LR 69.0, LP 74.8 (≤ 40 words) • Chiang: LR 76.8, LP 77.8 (≤ 40 words) • New engine designed from ground up for multi-lingual processing: language package • Original design goal for new parsing engine: develop new language packages in 1–2 weeks • Developed Chinese language package for new engine in one and a half days • Compared to other known Chinese parsers on the CTB, recall is equivalent and precision is significantly superior • LR 77.0, LP 81.6 (≤ 40 words)
What’s in store… • Incorporating richer lexical information into parsing/language processing, specifically… • Incorporating word sense information into a parsing model, building on both • previous work extending BBN parsing model to include word sense • recent work with David Chiang, viewing word sense as yet another component of “hidden” data in a Treebank