150 likes | 347 Views
MASC The Manually Annotated Sub-Corpus of American English. Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau. MASC. Manually Annotated Sub-Corpus NSF-funded project to provide a sharable, reusable annotated resource with rich linguistic annotations
E N D
MASCThe Manually Annotated Sub-Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau
MASC • Manually Annotated Sub-Corpus • NSF-funded project to provide a sharable, reusable annotated resource with rich linguistic annotations • Vassar, ICSI, Columbia, Princeton • texts from diverse genres • manual annotations or manually-validated annotations for multiple levels • WordNet senses • FrameNet frames and frame • shallow parses • named entities • Enables linking WordNet senses and FrameNet frames into more complex semantic structures • Enriches semantic and pragmatic information • detailed inter-annotator agreement measures
Contents • Texts drawn from the Open ANC • Several genres • Written (travel guides, blog, fiction, letters, newspaper, non-fiction, technical, journal, government documents) • Spoken (face-to-face, academic, telephone) • Free of license restrictions, redistributable • Download from ANC website • All MASC data and annotations will be freely downloadable
Annotation Process • Smaller portions of the sub-corpus manually annotated for specific phenomena • Maintain representativeness • Include as many annotations of different types as possible • Apply (semi)-automatic annotation techniques to determine the reliability of their results • Study inter-annotator agreement on manually-produced annotations • Determine benchmark of accuracy • Fine-tune annotator guidelines • Consider if accurate annotations for one phenomenon can improve performance of automatic annotation systems for another • E.G., Validated WN sense tags and noun chunks may improve automatic semantic role labeling
Process (continued) • Apply iterative process to maximize performance of automatic taggers ; • Manual annotation • Retrain automatic annotation software • Improved annotation software can later be applied to the entire ANC • Provide more accurate automatically-produced annotation of full corpus
Training examples FrameNet and WordNet full annotation Genre-representative core with validated entity, shallow parse annotations WSJ with PropBank, NomBank, PTB,TimeBank and PDTB annotations Composition Relative to Whole OANC WordNet annotations
MASC Core • Includes • 25K fully annotated (“all words”) for FrameNet frames and WordNet senses • ~40K corpus annotated by Unified Linguistic Annotation project • PropBank, NomBank, Penn Treebank, Penn Discourse Treebank, TimeBank • Small subset of WSJ with many annotation • Other annotations rendered into GrAF for compatibility
Representation • ISO TC37 SC4 Linguistic Annotation Framework • Graph of feature structures (GrAF) • isomorphic to other feature structure-based representations (e.g. UIMA CAS) • Each annotation in a separate stand-off document linked to primary data or other annotations • Merge annotations with ANC API • Output in any of several formats • XML • non-XML for use with systems such as NLTK and concordancing tools • UIMA CAS • Input to GraphViz • …
WordNet annotation • Updating WSD systems to use WordNet version 3.0 • Pederson’s SenseRelate • Mihalcea et al.’s SenseLearner • Apply to automatically assign WN sense tags to all content words (nouns, verbs, adjectives, and adverbs) in the entire OANC • Manually validate a set of words from whole OANC • Manually validate all words in 25K FN-annotated subset
FrameNet Annotation • Full manual annotation of 25K in FrameNet full-text manner • Application of automatic semantic role labeling software over entire MASC • Improve automatic semantic role labeling (ASRL) • Use active learning • ASRL system results evaluated to determine where the most errors occur • Extra manual annotation done to improve performance • Draw from entire OANC, possibly even other sources for examples
Alignment of Lexical Resources • Concurrent project investigating how and to what extent WordNet and FrameNet can be aligned • MASC annotations of 25K for FrameNet frames and frame elements and WordNet senses provide a ready-made testing ground
Interannotator agreement • Use a suite of metrics that measure different characteristics • Interannotator agreement coefficients such as Cohen’s Kappa • Average F-measure to determine proportion of the annotated data all annotators agree on
IAA • Determine impact of these two measures • consider the relation between the agreement coefficient values / F-measure and potential users of the planned annotations • Simultaneous investigations of interannotator agreement and measurable results of using different annotations of the same data provide a stronger picture of the integrity of annotated data (Passonneau et al. 2005; Passonneau et al. 2006 )
Overall Goal • Continually augment MASC with contributed annotations from the research community • Discourse structure, additional entities, events, opinions, etc. • Distribution of effort and integration of currently independent resources such as the ANC, WordNet, and FrameNet will enable progress in resource development • Less cost • No duplication of effort • Greater degree of accuracy and usability • Harmonization
Conclusion • MASC will provide a much-needed resource for computational linguistics research aimed at the development of robust language processing systems • MASC’s availability should have a major impact on the speed with which similar resources can be reliably annotated • MASC will be the largest semantically annotated corpus of English in existence • WN and FN annotation of the MASC will immediately create a massive multi-lingual resource network • Both WN and FN linked to corresponding resources in other languages • No existing resource approaches this scope