The 360 million word BYU Corpus of American English (1990-2007)

The 360 million wordBYU Corpus of American English (1990-2007) Mark Davies Brigham Young Universitywww.americancorpus.org

Why have a new corpus of American English? • Already have the web (e.g. via Google), but: • No POS tagging, lemmatization, etc • Very difficult to determine genre and date • American National Corpus • Only 22 million words; hasn’t been updated in last two years • Not very balanced in terms of genres and sources (1/2 million words fiction, 2 magazines, 1 newspaper, 2 academic journals) • Need something comparable to the British National Corpus (BNC)

BYU Corpus of American English(www.americancorpus.org) • 360+ million words • From nearly 150,000 texts • 20 million words each year from 1990-2007 • Will be updated twice a year; unique linguistic history • Tagged by CLAWS (same tagger as for the BNC) • Uses the same architecture and interface as BYU-BNC, TIME Corpus, Corpus del Español, Corpus do Português, etc. (See corpus.byu.edu) • For each year (and therefore overall, as well), evenly divided between Spoken, Fiction, Popular Magazines, Newspapers, and Academic Journals

Composition of the corpus • Spoken: (76+ million words) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). • Fiction: (70 million words) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts. • Popular Magazines: (78+ million words) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc. • Newspapers: (73+ million words) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc. • Academic Journals: (73+ million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year

Interface: same as for (our version of the) BNC, and the TIME Corpus (100m words)

Charts: five main genres + time: perfect storm

KWIC: Normal display

KWIC: Expanded display and full source information

Charts: five main genres + time: [end] up [vvg]

Charts: click to see matching strings in any genre or time period: [end] up [vvg]

Charts: sub-genres (bling)

Frequency of each matching string in each genre and time period: [vv*] * ground

Collocates (up to 10 words L / R): [nn*] collocates of chip.[nn*]

Collocates: sort by MI score / view by genre: [vvg] collocates of [feel] like

Sections: frequency by genre and time period

Sections: frequency by genre and time period (compared to second section)

Sections: frequency by genre and time period: Verbs in Magazine: Sports

Sections: frequency by genre and time period: Magazine: Sports vs Magazines

Sections: by time period: *dom in 2000s vs 1990s

Sections: by time period: [vvi] in 2007 vs 1990s (neologisms)

Sections: frequency by genre and time period: we [vv*] that in ACADEMIC

Sections: frequency by genre and time period: collocates of chair (ACAD / FIC)

Word comparisons

Word comparisons: utter vs sheer + [NN*]

Word comparisons: Democrats vs Republicans + [AJ*]

Synonyms (“a thesaurus on steroids”): by genre and time periods: 60,000+ entries

Synonyms: [[=clean]].[v*] the [n*]

Synonyms: [=strong] in ACADEMIC vs MAGAZINES

Customized word lists (semantic, syntactic, etc)

Customized wordlists: [davies:clothes]].[nn*] NEAR [davies:colors]

BYU Corpus of American English • Offers an extremely wide range of queries • Only large corpus of contemporary American English • Only American corpus with texts from a wide variety of genres and sources • Almost four times as big as BNC, more recent, and will be updated • Largest, most diverse, publically-available corpus of any language • Freely available !

The 360 million word BYU Corpus of American English (1990-2007)

The 360 million word BYU Corpus of American English (1990-2007)

Presentation Transcript

Psychosocial Aspects of Obesity

GDP (Trillions of Chained 2005 Dollars)

The Diachronic Electronic Corpus of Tyneside English

The European Union: 493 million people – 27 countries

Italian American 1960-1990

Estimated number of people living with HIV and adult HIV prevalence Global HIV epidemic, 1990–2007; and, HIV epidemic i

Million Word Challenge

Million Word Challenge

English 360

History of American Journalism- The 1990’s

Word/Definition of English

Salient features of the forthcoming ICRP recommendations-2007

Review of the 2007 Cotton Crop County Cotton Production Meeting 2008

32.6 million people in 1990 34.8 million people in 2000

Canadian English

English Corpus Linguistics Introducing the Diachronic Corpus of Present-Day Spoken English (DCPSE)

The Vowels of American English

Improved Word Alignments Using the Web as a Corpus

The Future of Phishing

Beginning (1990-2007)

The Sawa Corpus A Parallel Corpus English - Swahili

Evolution of the American Worker