Nancy Ide • Vassar College Catherine Macleod • New York University

Nancy Ide • Vassar College Catherine Macleod • New York University

Why we need an ANC • Brown Corpus of American English • Too small to provide representative examples • Pre-1960 only • No spoken data • British National Corpus • Not representative of American English • Texts up to 1993 only

British vs. American English • Lexical Items • Bobby vs. cop, underground vs. subway, lorry vs. truck, pavement vs. sidewalk, football vs. soccer… • Grammatical structures • “She could not endure to live with him” vs. “She could not endure living with him.” • “Have you a pen?” vs. “Do you have a pen?” • Modals • “shall” vs. “should” vs. “ought” vs. “will” vs. “would” vs. “should” • Adverbial Usage • “Immediately I get home” vs. “As soon as I get home” • Support Verbs • “take a decision” vs. “make a decision”

ANC Background • June 1998 • ANC proposed at LREC’98 by Charles Fillmore, Nancy Ide, Daniel Jurafsky, Catherine Macleod • May 1998 • Publisher’s Day in Berkeley in conjunction with DSNA • November 1999 • Organizational meeting, New York University

ANC Consortium • Pearson Education • Random House Publishers • Langenscheidt Publishing Group • Harper Collins Publishers • Cambridge University Press • LexiQuest • Microsoft Corporation • Shogakukan,Inc. • Associated Liberal Creators Press • Taishukan Publishers • Oxford University Press • Kenkyusha Publishers • IBM Corporation

Contributors • “Founding” consortium members • $21,000 over 3 years • Texts • Linguistic Data Consortium • Management and distribution of the ANC • Manpower and expertise to create initial version • NYU and Vassar • Expertise and manpower for corpus creation and annotation

ANC Makeup • Core “static” corpus • Texts and transcriptions of spoken data • 1990 onwards • Comparable in balance to BNC • Enables comparative studies • At least 100 million words • Snapshot of American English at the end of the millenium

“Dynamic” component • Not necessarily balanced • Dictated by availability • Includes email, ephemera, rap lyrics, newsgroups, etc. plus historically important works from various time periods • Add 10% every five years • Layered organization • Dynamic component layered chronologically as added

Eventual components • annotated and aligned speech data • dialects of American and Canadian English • other major languages of North America • Spanish,French Canadian • aligned to parallel translations inEnglish. High costs of production prevent inclusion at this stage

Encoding and annotation • Markup compliant with the XML Corpus Encoding Standard (XCES) • Annotation • part of speech • Sub-paragraph elements • E.g., tokens, names, dates, numbers • Produced in a two-stage process

Stage 1: Base level corpus • Produced after year 1, using limited resources • XML markup compliant with XCES level 0 • Markup produced by automatic transduction from original formats • Automatically tagged for part of speech • Only spot checking for validity • Minimal header • hand-produced • Includes domain information • Useful for concordance generation, collocation analysis

Stage 2: Final corpus • Available after year 3 • XML markup conformant to XCES level 1 • Full header • Markup for major structural divisions, paragraphs, sentence boundaries • Markup for some sub-paragraph elements, where can be done automatically • E.g., tokens, names, dates, numbers • 10% markup and annotation hand-validated • “gold standard” corpus

Data architecture • Follow XCES specifications for “stand-off” markup • Annotations in separate XML documents, linked to original • Easy to modify and/or add to • Enables a distributed development model • Different sites independently add annotation • Suitable for delivery over the WWW

Software • ANC project will provide search and access software • Encoding via XML and layered architecture enables exploiting the evolving XML environment for search, access, manipulation of ANC data • XML Transformation Language (XSLT) • Resource Description Framework (RDF)

Availability • Freely available to non-profit educational and research organizations from the outset • No restrictions on obtaining the corpus based on geographical location • Consortium members have exclusive access for commercial exploitation for 5 years • Distributed by LDC

Licensing • LDC • obtains licenses from text providers • issues licenses to users • no redistribution without publisher’s permission • “open sub-corpus” portion of the ANC • licensed on the model of open-source software

ANC Status • Founding memberships closed March 31 2001 • Consortium membership now $40K • Text gathering, format transduction, header production underway • Base corpus due March 31 2002 • Preparing production of level 1 corpus • Gathering technical input from research community • ANLP/NAACL workshop (Seattle, April 2000) • LREC workshop (Athens, June, 2000) • Seeking major funding • Final core corpus due March 31 2004

Information • ANC: • http://AmericanNationalCorpus.org • Project Director: • Catherine Macleod <macleod@cs.nyu.edu> • Technical Director: • Nancy Ide <ide@cs.vassar.edu> • XCES: • http://www.cs.vassar.edu/XCES

Nancy Ide • Vassar College Catherine Macleod • New York University

Nancy Ide • Vassar College Catherine Macleod • New York University

Presentation Transcript

Nancy Louise Holt

How was it for you? A College perspective of the IQER Catherine Hill Director of Quality and Standards

Seneca College Seneca @ York

NASA / New York Space Grant Consortium

Presented by The Center for Achievement in Science Education Brooklyn College, City University of New York Louise Hainl

Edna St. Vincent Millay The Courage That My Mother Had

The Fish

A Web Application for Customized Corpus Delivery

Spiro D. Alexandratos Hunter College of the City University of New York Department of Energy

Margaret Macleod Keithia Wilson Griffith University

Hayley N . Schiebel 1 , Robert F . Chen 1 and Catherine Cramer 2

The Selected Recognitions To Dr. Nancy Ho

Jan-Peter Muller and Catherine Naud (University College London)

Improving Transitions from Community College to University

MECSAT

Matthew Wiener Mawiener@vassar Vassar College, Poughkeepsie, NY

Catherine , Called Birdy

By: Catherine Churilla

Participatory Design of Academic Libraries International Conference KRE 11, Prague

FACE AIDS Campaigns 2011-2012