90 likes | 123 Views
Explore the significance of information completeness in treebanks, the importance of grammar writing in parsing consistency, and lessons learned for treebanking methodology. Dive into the advantages and disadvantages of pre-processing data with an automatic parser and the benefits of phrase-structure and dependency treebanks for parsing accuracy.
E N D
Thoughts on Treebanks Christopher Manning Stanford University
Q1: What do you really care about when you're building a parser? • Completeness of information • There’s not much point in having a treebank if really you’re having to end up doing unsupervised learning • You want to be giving human value add • Classic bad example: • Noun compound structure in the Penn English Treebank • Consistency of information • If things are annotated inconsistently, you lose both in training (if it is widespread) and in evaluation • Bad example • Long ago constructions: as long ago as …; not so long ago • Mutual information • Categories should be as mutually informative as possible
Q3: What info (e.g., function tags, empty categories, coindexation) is useful, what is not? • Information on function is definitely useful • Should move to always having typed dependencies. • Clearest example in Penn English Treebank: temporal NPs • Empty categories don’t necessarily give much value in the dumbed-down world of Penn English Treebank parsing work • Though it should be tried again/more • But definitely useful if you want to know this stuff! • Subcategorization/argument structure determination • Natural Language Understanding!! • Cf. Johnson, Levy and Manning, etc. work on long distance dependencies • I’m sceptical that there is a categorical argument adjunct distinction to be make • Leave it to the real numbers • This means that subcategorization frames can only be statistical • Cf. Manning (2003) • I’ve got some more slides on this from another talk if you want…
Q3: What info (e.g., function tags, empty categories, coindexation) is useful, what is not? • Do you prefer a more refined tagset for parsing? • Yes. I mightn’t use it, but I often do • The transform-detransform framework: • RawInput TransformedInput Parser TransformedOutput DesiredOutput • I think everyone does this to some extent • Some like Johnson, Klein and Manning have exploited it very explicitly: NN-TMP, IN^T, NP-Poss, VP-VBG, NP-v, • Everyone else should think about it more • It’s easy to throw away too precise information, or to move information around deterministically (tag to phrase or vice versa), if it’s represented completely and consistently!
Q4: How does grammar writing interact with treebanking? • In practice, they often haven’t interacted much • I’m a great believer that they should • Having a grammar is a huge guide to how things should be parsed and to check parsing consistency • It also allows opportunities for analysis updating, etc. • Cf. the Redwoods Treebank, and subsequent efforts • The inability to automatically update treebanks is a growing problem • Current English treebanking isn’t having much impact because of annotation differences with original PTB • Feedback from users has only rarely been harvested
Q5: What methodological lessons can be drawn for treebanking? • Good guidelines (loosely, a grammar!) • Good, trained people • Annotator buy-in • Ann Bies said all this … I strongly agree! • I think there has been a real underexploitation of technology for treebank validation • Doing vertical searches/checks almost always turns up inconsistencies • Either these or a grammar should give vertical review
Q6: What are advantages and disadvantages of pre-processing the data to be treebanked with an automatic parser? • The economics are clear • You reduce annotation costs • The costs are clear • The parser places a large bias on the trees produced • Humans are lazy/reluctant to correct mistakes • Clear e.g.: I think it is fair to say that many POS errors in the Penn English Treebank can be traced to the POS tagger • E.g., sentence initial capitalized Separately, Frankly, Currently, Hopefully analyzed as NNP • Doesn’t look like a human being’s mistakes to me. • The answer: • More use of technology to validate and check humans
Q7: What are the advantages of a phrase-structure and/or a dependency treebank for parsing? • The current split in the literature between “phrase-structure” and “dependency” parsing is largely bogus (in my opinion) • The Collins/Bikel parser operates largely in the manner of a dependency parser • The Stanford parser contains a strict (untyped) dependency parser • Phrase structure parsers have the advantage of phrase structure labels • A dependency parser is just a phrase structure parser where you cannot refer to phrasal types or conditional on phrasal span • This extra info is useful; it’s silly not to use it • Labeling phrasal heads=dependencies is useful. Silly not to do it • Automatic “head rules” should have had their day by now!! • Scoring based on dependencies is much better than Parseval !!! • Labeling dependency types is useful • Especially, this will be the case in free-er word order languages