320 likes | 409 Views
The ‘London Corpora’ projects. - the benefits of hindsight - some lessons for diachronic corpus design. Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk. Motivating questions. What is meant by the phrase ‘a balanced corpus’?
E N D
The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk
Motivating questions • What is meant by the phrase ‘a balanced corpus’? • How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data?
Motivating questions • What is meant by the phrase ‘a balanced corpus’? • How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? • Reviewing ICE-GB and DCPSE: • Should the data have been more sociolinguistic-ally representative, by social class and region?
Motivating questions • What is meant by the phrase ‘a balanced corpus’? • How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? • Reviewing ICE-GB and DCPSE: • Should the data have been more sociolinguistic-ally representative, by social class and region? • Should texts have been stratified: sampled so that speakers of all categories of gender and age were (equally) represented in each genre?
ICE-GB • British Component of ICE • Corpus of speech and writing (1990-1992) • 60% spoken, 40% written; 1 million words; orthographically transcribed speech, marked up, tagged and fully parsed • Sampling principles • International sampling scheme, including broad range of spoken and written categories • But: • Adults who had completed secondary education • ‘British corpus’ geographically limited • speakers mostly from London / SE UK (or sampled there)
DCPSE • Diachronic Corpus of Present-day Spoken English (late 1950s - early 1990s) • 800,000 words (nominal) • London-Lund component annotated as ICE-GB • orthographically transcribed and fully parsed • Created from subsamples of LLC and ICE-GB • Matching numbers of texts in text categories • Not sampled over equal duration • LLC (1958-1977) • ICE-GB (1990-1992) • Text passages in LLC larger than ICE-GB • LLC (5,000 words) • ICE-GB (2,000 words) • But text passages may include subtexts • telephone calls and newspaper articles are frequently short
DCPSE • Representative? • Text categories of unequal size • Broad range of text types sampled • Not balanced by speaker demography
A balanced corpus? • Corpora are reusable experimental datasets • Data collection (sampling) should avoid limiting future research goals • Samples should be representative • What are they representative of? • Quantity vs. quality • Large/lighter annotation vs. small/richer • Are larger corpora more (easily) representative? • Problems for historical corpora • Can we add samples to make the corpus more representative?
“Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language
“Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language • of text types? • Effort made to include examples of all types of language “text types” (including speech contexts)
“Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language • of text types? • Effort made to include examples of all types of language “text types” (including speech contexts) • of speaker types? • Sampling decisions made to include equal numbers (by gender, age, geography, etc.) of participants in each text category • Should subdivide data independently (stratification)
“Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language • of text types? • Effort made to include examples of all types of language “text types” (including speech contexts) • of speaker types? • Sampling decisions made to include equal numbers (by gender, age, geography, etc.) of participants in each text category • Should subdivide data independently (stratification) “random sample” “broad” “stratified”
Stratified sampling • Ideal • Corpus independently subdivided by each variable
Stratified sampling • Ideal • Corpus independently subdivided by each variable
Stratified sampling • Ideal • Corpus independently subdivided by each variable • Equal subdivisions?
Stratified sampling • Ideal • Corpus independently subdivided by each variable • Equal subdivisions? • Not required • Independent variables = constant probability in each subset • e.g. proportion of words spoken by women not affected by text genre • e.g. same ratio of women:men in age groups, etc.
Stratified sampling • Ideal • Corpus independently subdivided by each variable • Equal subdivisions? • Not required • Independent variables = constant probability in each subset • e.g. proportion of words spoken by women not affected by text genre • What is the reality?
ICE-GB: gender / written-spoken • Proportion of words in each category spoken by women and men • The authors of some texts are unspecified • Some written material may be jointly authored • female/male ratio varies slightly (=0.02) female written male spoken TOTAL p 0 0.2 0.4 0.6 0.8 1
unscripted speeches spontaneous commentaries legal presentations demonstrations unscripted non-broadcast speeches broadcast talks scripted monologue broadcast news mixed parliamentary debates legal cross-examinations classroom lessons business transactions broadcast interviews broadcast discussions public telephone calls direct conversations private dialogue TOTAL spoken ICE-GB: gender / spoken genres • Gender variation in spoken subcategories female male p 0 0.2 0.4 0.6 0.8 1
female press news reports male reportage press editorials <author unknown/joint> persuasive writing technology social sciences natural sciences humanities non-academic writing skills/hobbies administrative/regulatory instructional writing novels/stories creative writing technology social sciences natural sciences humanities academic writing printed untimed student essays student examination scripts non-professional writing social letters business letters correspondence non-printed p TOTAL written 0 0.2 0.4 0.6 0.8 1 ICE-GB: gender / written genres • Gender variation in written genres
ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science
ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science • Is this representative?
ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science • Is this representative? • When we compare • technology writing with creative writing • academic writing with student essays • are we also finding gender effects?
ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science • Is this representative? • When we compare • technology writing with creative writing • academic writing with student essays • are we also finding gender effects? • Difficult to compensate for absent data in analysis!
female male DCPSE: gender / genre • DCPSE has a simpler genre categorisation • also divided by time prepared speech assorted spontaneous legal cross-examination parliamentary language spontaneous commentary broadcast interviews broadcast discussions telephone conversations informal formal face-to-face conversations TOTAL p 0 0.2 0.4 0.6 0.8 1
DCPSE: gender / time • DCPSE has a simpler genre categorisation • also divided by time • note the gap 1 p 0.8 0.6 0.4 0.2 0 1970 1974 1978 1980 1984 1988 1958 1962 1964 1968 1972 1976 1982 1986 1992 1960 1966 1990 time
DCPSE: genre / time • Proportion in each spoken genre, over time • sampled by matching LLC and ICE-GB overall • this is a ‘stratified sample’ (but only LLC:ICE-GB) • uneven sampling over 5-year periods (within LLC) p formal face-to-face ICE-GBtarget for LLC 0.6 0.4 prepared speech informal face-to-face 0.2 spontaneous commentary telephone conversations 0 1960 1965 1970 1975 1980 1985 1990
DCPSE • LLC sampling not stratified • Issue not considered, data collected over extended period • Some data was surreptitiously recorded
DCPSE • LLC sampling not stratified • Issue not considered, data collected over extended period • Some data was surreptitiously recorded • DCPSE matched samples by ‘genre’ • Same text category sizes in ICE-GB and LLC • But problems in LLC (and ICE) percolate
DCPSE • LLC sampling not stratified • Issue not considered, data collected over extended period • Some data was surreptitiously recorded • DCPSE matched samples by ‘genre’ • Same text category sizes in ICE-GB and LLC • But problems in LLC (and ICE) percolate • No stratification by speaker • Result: difficult and sometimes impossible to separate out speaker-demographic effects from text category
Conclusions • Ideal would be that: • the corpus was “representative” in all 3 ways: • a genuine random sample • a broad range of text types • a stratified sampling of speakers • But these principles are unlikely to be compatible • e.g. speaker age and utterance context • Some compensatory approaches may be employed at research (data analysis) stage • what about absent or atypical data? • what if we have few speakers/writers? • So...
Conclusions • …pay attention to stratification in deciding which texts to include in subcategories • consider replacing texts in outlying categories • …justify anddocument non-inclusion of stratum by evidence • e.g. “there are no published articles attributable to authors of this age in this time period”