The ‘London Corpora’ projects

The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

Motivating questions • What is meant by the phrase ‘a balanced corpus’? • How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data?

Motivating questions • What is meant by the phrase ‘a balanced corpus’? • How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? • Reviewing ICE-GB and DCPSE: • Should the data have been more sociolinguistic-ally representative, by social class and region?

Motivating questions • What is meant by the phrase ‘a balanced corpus’? • How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? • Reviewing ICE-GB and DCPSE: • Should the data have been more sociolinguistic-ally representative, by social class and region? • Should texts have been stratified: sampled so that speakers of all categories of gender and age were (equally) represented in each genre?

ICE-GB • British Component of ICE • Corpus of speech and writing (1990-1992) • 60% spoken, 40% written; 1 million words; orthographically transcribed speech, marked up, tagged and fully parsed • Sampling principles • International sampling scheme, including broad range of spoken and written categories • But: • Adults who had completed secondary education • ‘British corpus’ geographically limited • speakers mostly from London / SE UK (or sampled there)

DCPSE • Diachronic Corpus of Present-day Spoken English (late 1950s - early 1990s) • 800,000 words (nominal) • London-Lund component annotated as ICE-GB • orthographically transcribed and fully parsed • Created from subsamples of LLC and ICE-GB • Matching numbers of texts in text categories • Not sampled over equal duration • LLC (1958-1977) • ICE-GB (1990-1992) • Text passages in LLC larger than ICE-GB • LLC (5,000 words) • ICE-GB (2,000 words) • But text passages may include subtexts • telephone calls and newspaper articles are frequently short

DCPSE • Representative? • Text categories of unequal size • Broad range of text types sampled • Not balanced by speaker demography

A balanced corpus? • Corpora are reusable experimental datasets • Data collection (sampling) should avoid limiting future research goals • Samples should be representative • What are they representative of? • Quantity vs. quality • Large/lighter annotation vs. small/richer • Are larger corpora more (easily) representative? • Problems for historical corpora • Can we add samples to make the corpus more representative?

“Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language

“Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language • of text types? • Effort made to include examples of all types of language “text types” (including speech contexts)

“Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language • of text types? • Effort made to include examples of all types of language “text types” (including speech contexts) • of speaker types? • Sampling decisions made to include equal numbers (by gender, age, geography, etc.) of participants in each text category • Should subdivide data independently (stratification)

“Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language • of text types? • Effort made to include examples of all types of language “text types” (including speech contexts) • of speaker types? • Sampling decisions made to include equal numbers (by gender, age, geography, etc.) of participants in each text category • Should subdivide data independently (stratification) “random sample” “broad” “stratified”

Stratified sampling • Ideal • Corpus independently subdivided by each variable

Stratified sampling • Ideal • Corpus independently subdivided by each variable • Equal subdivisions?

Stratified sampling • Ideal • Corpus independently subdivided by each variable • Equal subdivisions? • Not required • Independent variables = constant probability in each subset • e.g. proportion of words spoken by women not affected by text genre • e.g. same ratio of women:men in age groups, etc.

Stratified sampling • Ideal • Corpus independently subdivided by each variable • Equal subdivisions? • Not required • Independent variables = constant probability in each subset • e.g. proportion of words spoken by women not affected by text genre • What is the reality?

ICE-GB: gender / written-spoken • Proportion of words in each category spoken by women and men • The authors of some texts are unspecified • Some written material may be jointly authored • female/male ratio varies slightly (=0.02) female written male spoken TOTAL p 0 0.2 0.4 0.6 0.8 1

unscripted speeches spontaneous commentaries legal presentations demonstrations unscripted non-broadcast speeches broadcast talks scripted monologue broadcast news mixed parliamentary debates legal cross-examinations classroom lessons business transactions broadcast interviews broadcast discussions public telephone calls direct conversations private dialogue TOTAL spoken ICE-GB: gender / spoken genres • Gender variation in spoken subcategories female male p 0 0.2 0.4 0.6 0.8 1

female press news reports male reportage press editorials <author unknown/joint> persuasive writing technology social sciences natural sciences humanities non-academic writing skills/hobbies administrative/regulatory instructional writing novels/stories creative writing technology social sciences natural sciences humanities academic writing printed untimed student essays student examination scripts non-professional writing social letters business letters correspondence non-printed p TOTAL written 0 0.2 0.4 0.6 0.8 1 ICE-GB: gender / written genres • Gender variation in written genres

ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science

ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science • Is this representative?

ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science • Is this representative? • When we compare • technology writing with creative writing • academic writing with student essays • are we also finding gender effects?

ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science • Is this representative? • When we compare • technology writing with creative writing • academic writing with student essays • are we also finding gender effects? • Difficult to compensate for absent data in analysis!

female male DCPSE: gender / genre • DCPSE has a simpler genre categorisation • also divided by time prepared speech assorted spontaneous legal cross-examination parliamentary language spontaneous commentary broadcast interviews broadcast discussions telephone conversations informal formal face-to-face conversations TOTAL p 0 0.2 0.4 0.6 0.8 1

DCPSE: gender / time • DCPSE has a simpler genre categorisation • also divided by time • note the gap 1 p 0.8 0.6 0.4 0.2 0 1970 1974 1978 1980 1984 1988 1958 1962 1964 1968 1972 1976 1982 1986 1992 1960 1966 1990 time

DCPSE: genre / time • Proportion in each spoken genre, over time • sampled by matching LLC and ICE-GB overall • this is a ‘stratified sample’ (but only LLC:ICE-GB) • uneven sampling over 5-year periods (within LLC) p formal face-to-face ICE-GBtarget for LLC 0.6 0.4 prepared speech informal face-to-face 0.2 spontaneous commentary telephone conversations 0 1960 1965 1970 1975 1980 1985 1990

DCPSE • LLC sampling not stratified • Issue not considered, data collected over extended period • Some data was surreptitiously recorded

DCPSE • LLC sampling not stratified • Issue not considered, data collected over extended period • Some data was surreptitiously recorded • DCPSE matched samples by ‘genre’ • Same text category sizes in ICE-GB and LLC • But problems in LLC (and ICE) percolate

DCPSE • LLC sampling not stratified • Issue not considered, data collected over extended period • Some data was surreptitiously recorded • DCPSE matched samples by ‘genre’ • Same text category sizes in ICE-GB and LLC • But problems in LLC (and ICE) percolate • No stratification by speaker • Result: difficult and sometimes impossible to separate out speaker-demographic effects from text category

Conclusions • Ideal would be that: • the corpus was “representative” in all 3 ways: • a genuine random sample • a broad range of text types • a stratified sampling of speakers • But these principles are unlikely to be compatible • e.g. speaker age and utterance context • Some compensatory approaches may be employed at research (data analysis) stage • what about absent or atypical data? • what if we have few speakers/writers? • So...

Conclusions • …pay attention to stratification in deciding which texts to include in subcategories • consider replacing texts in outlying categories • …justify anddocument non-inclusion of stratum by evidence • e.g. “there are no published articles attributable to authors of this age in this time period”

The ‘London Corpora’ projects

The ‘London Corpora’ projects

Presentation Transcript

CompeteFor – London 2012 business opportunities

Incremental references in dialogue: data from corpora, questions for psychologists

Corpora in the classroom: Forging new paths

Chinese learner corpora and second language research

A case for improving cancer services in London?

Oracle Projects Supply Chain

Corpora and Statistical Methods

Interesting facts about London

LONDON CALLING!

Some of my XML/Internet Research Projects

Corpora in literary and stylistic studies

Corpora in language education

ECONOMIC EVALUATION OF IRRIGATION PROJECTS

London

Jack London

Lesson : « PLACES TO VISIT IN LONDON .»

The Failure of the London Ambulance Service

London

Picking the Right Projects: Investment Analysis

Settling into King’s and London

London fog