180 likes | 333 Views
Nancy Ide • Vassar College Catherine Macleod • New York University. Why we need an ANC. Brown Corpus of American English Too small to provide representative examples Pre-1960 only No spoken data British National Corpus Not representative of American English Texts up to 1993 only.
E N D
Nancy Ide • Vassar College Catherine Macleod • New York University
Why we need an ANC • Brown Corpus of American English • Too small to provide representative examples • Pre-1960 only • No spoken data • British National Corpus • Not representative of American English • Texts up to 1993 only
British vs. American English • Lexical Items • Bobby vs. cop, underground vs. subway, lorry vs. truck, pavement vs. sidewalk, football vs. soccer… • Grammatical structures • “She could not endure to live with him” vs. “She could not endure living with him.” • “Have you a pen?” vs. “Do you have a pen?” • Modals • “shall” vs. “should” vs. “ought” vs. “will” vs. “would” vs. “should” • Adverbial Usage • “Immediately I get home” vs. “As soon as I get home” • Support Verbs • “take a decision” vs. “make a decision”
ANC Background • June 1998 • ANC proposed at LREC’98 by Charles Fillmore, Nancy Ide, Daniel Jurafsky, Catherine Macleod • May 1998 • Publisher’s Day in Berkeley in conjunction with DSNA • November 1999 • Organizational meeting, New York University
ANC Consortium • Pearson Education • Random House Publishers • Langenscheidt Publishing Group • Harper Collins Publishers • Cambridge University Press • LexiQuest • Microsoft Corporation • Shogakukan,Inc. • Associated Liberal Creators Press • Taishukan Publishers • Oxford University Press • Kenkyusha Publishers • IBM Corporation
Contributors • “Founding” consortium members • $21,000 over 3 years • Texts • Linguistic Data Consortium • Management and distribution of the ANC • Manpower and expertise to create initial version • NYU and Vassar • Expertise and manpower for corpus creation and annotation
ANC Makeup • Core “static” corpus • Texts and transcriptions of spoken data • 1990 onwards • Comparable in balance to BNC • Enables comparative studies • At least 100 million words • Snapshot of American English at the end of the millenium
“Dynamic” component • Not necessarily balanced • Dictated by availability • Includes email, ephemera, rap lyrics, newsgroups, etc. plus historically important works from various time periods • Add 10% every five years • Layered organization • Dynamic component layered chronologically as added
Eventual components • annotated and aligned speech data • dialects of American and Canadian English • other major languages of North America • Spanish,French Canadian • aligned to parallel translations inEnglish. High costs of production prevent inclusion at this stage
Encoding and annotation • Markup compliant with the XML Corpus Encoding Standard (XCES) • Annotation • part of speech • Sub-paragraph elements • E.g., tokens, names, dates, numbers • Produced in a two-stage process
Stage 1: Base level corpus • Produced after year 1, using limited resources • XML markup compliant with XCES level 0 • Markup produced by automatic transduction from original formats • Automatically tagged for part of speech • Only spot checking for validity • Minimal header • hand-produced • Includes domain information • Useful for concordance generation, collocation analysis
Stage 2: Final corpus • Available after year 3 • XML markup conformant to XCES level 1 • Full header • Markup for major structural divisions, paragraphs, sentence boundaries • Markup for some sub-paragraph elements, where can be done automatically • E.g., tokens, names, dates, numbers • 10% markup and annotation hand-validated • “gold standard” corpus
Data architecture • Follow XCES specifications for “stand-off” markup • Annotations in separate XML documents, linked to original • Easy to modify and/or add to • Enables a distributed development model • Different sites independently add annotation • Suitable for delivery over the WWW
Software • ANC project will provide search and access software • Encoding via XML and layered architecture enables exploiting the evolving XML environment for search, access, manipulation of ANC data • XML Transformation Language (XSLT) • Resource Description Framework (RDF)
Availability • Freely available to non-profit educational and research organizations from the outset • No restrictions on obtaining the corpus based on geographical location • Consortium members have exclusive access for commercial exploitation for 5 years • Distributed by LDC
Licensing • LDC • obtains licenses from text providers • issues licenses to users • no redistribution without publisher’s permission • “open sub-corpus” portion of the ANC • licensed on the model of open-source software
ANC Status • Founding memberships closed March 31 2001 • Consortium membership now $40K • Text gathering, format transduction, header production underway • Base corpus due March 31 2002 • Preparing production of level 1 corpus • Gathering technical input from research community • ANLP/NAACL workshop (Seattle, April 2000) • LREC workshop (Athens, June, 2000) • Seeking major funding • Final core corpus due March 31 2004
Information • ANC: • http://AmericanNationalCorpus.org • Project Director: • Catherine Macleod <macleod@cs.nyu.edu> • Technical Director: • Nancy Ide <ide@cs.vassar.edu> • XCES: • http://www.cs.vassar.edu/XCES