1 / 32

The ‘London Corpora’ projects

The ‘London Corpora’ projects. - the benefits of hindsight - some lessons for diachronic corpus design. Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk. Motivating questions. What is meant by the phrase ‘a balanced corpus’?

Download Presentation

The ‘London Corpora’ projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. The ‘London Corpora’ projects - the benefits of hindsight - some lessons for diachronic corpus design Sean Wallis Survey of English Usage University College London s.wallis@ucl.ac.uk

  2. Motivating questions • What is meant by the phrase ‘a balanced corpus’? • How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data?

  3. Motivating questions • What is meant by the phrase ‘a balanced corpus’? • How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? • Reviewing ICE-GB and DCPSE: • Should the data have been more sociolinguistic-ally representative, by social class and region?

  4. Motivating questions • What is meant by the phrase ‘a balanced corpus’? • How do sampling decisions made by corpus builders affect the type of research questions that may be asked of the data? • Reviewing ICE-GB and DCPSE: • Should the data have been more sociolinguistic-ally representative, by social class and region? • Should texts have been stratified: sampled so that speakers of all categories of gender and age were (equally) represented in each genre?

  5. ICE-GB • British Component of ICE • Corpus of speech and writing (1990-1992) • 60% spoken, 40% written; 1 million words; orthographically transcribed speech, marked up, tagged and fully parsed • Sampling principles • International sampling scheme, including broad range of spoken and written categories • But: • Adults who had completed secondary education • ‘British corpus’ geographically limited • speakers mostly from London / SE UK (or sampled there)

  6. DCPSE • Diachronic Corpus of Present-day Spoken English (late 1950s - early 1990s) • 800,000 words (nominal) • London-Lund component annotated as ICE-GB • orthographically transcribed and fully parsed • Created from subsamples of LLC and ICE-GB • Matching numbers of texts in text categories • Not sampled over equal duration • LLC (1958-1977) • ICE-GB (1990-1992) • Text passages in LLC larger than ICE-GB • LLC (5,000 words) • ICE-GB (2,000 words) • But text passages may include subtexts • telephone calls and newspaper articles are frequently short

  7. DCPSE • Representative? • Text categories of unequal size • Broad range of text types sampled • Not balanced by speaker demography

  8. A balanced corpus? • Corpora are reusable experimental datasets • Data collection (sampling) should avoid limiting future research goals • Samples should be representative • What are they representative of? • Quantity vs. quality • Large/lighter annotation vs. small/richer • Are larger corpora more (easily) representative? • Problems for historical corpora • Can we add samples to make the corpus more representative?

  9. “Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language

  10. “Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language • of text types? • Effort made to include examples of all types of language “text types” (including speech contexts)

  11. “Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language • of text types? • Effort made to include examples of all types of language “text types” (including speech contexts) • of speaker types? • Sampling decisions made to include equal numbers (by gender, age, geography, etc.) of participants in each text category • Should subdivide data independently (stratification)

  12. “Representativeness” • Do we mean representative... • of the language? • A sample in the corpus is a genuine random sample of the type of text in the language • of text types? • Effort made to include examples of all types of language “text types” (including speech contexts) • of speaker types? • Sampling decisions made to include equal numbers (by gender, age, geography, etc.) of participants in each text category • Should subdivide data independently (stratification) “random sample” “broad” “stratified”

  13. Stratified sampling • Ideal • Corpus independently subdivided by each variable

  14. Stratified sampling • Ideal • Corpus independently subdivided by each variable

  15. Stratified sampling • Ideal • Corpus independently subdivided by each variable • Equal subdivisions?

  16. Stratified sampling • Ideal • Corpus independently subdivided by each variable • Equal subdivisions? • Not required • Independent variables = constant probability in each subset • e.g. proportion of words spoken by women not affected by text genre • e.g. same ratio of women:men in age groups, etc.

  17. Stratified sampling • Ideal • Corpus independently subdivided by each variable • Equal subdivisions? • Not required • Independent variables = constant probability in each subset • e.g. proportion of words spoken by women not affected by text genre • What is the reality?

  18. ICE-GB: gender / written-spoken • Proportion of words in each category spoken by women and men • The authors of some texts are unspecified • Some written material may be jointly authored • female/male ratio varies slightly (=0.02) female written male spoken TOTAL p 0 0.2 0.4 0.6 0.8 1

  19. unscripted speeches spontaneous commentaries legal presentations demonstrations unscripted non-broadcast speeches broadcast talks scripted monologue broadcast news mixed parliamentary debates legal cross-examinations classroom lessons business transactions broadcast interviews broadcast discussions public telephone calls direct conversations private dialogue TOTAL spoken ICE-GB: gender / spoken genres • Gender variation in spoken subcategories female male p 0 0.2 0.4 0.6 0.8 1

  20. female press news reports male reportage press editorials <author unknown/joint> persuasive writing technology social sciences natural sciences humanities non-academic writing skills/hobbies administrative/regulatory instructional writing novels/stories creative writing technology social sciences natural sciences humanities academic writing printed untimed student essays student examination scripts non-professional writing social letters business letters correspondence non-printed p TOTAL written 0 0.2 0.4 0.6 0.8 1 ICE-GB: gender / written genres • Gender variation in written genres

  21. ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science

  22. ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science • Is this representative?

  23. ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science • Is this representative? • When we compare • technology writing with creative writing • academic writing with student essays • are we also finding gender effects?

  24. ICE-GB • Sampling was not stratified across variables • Women contribute 1/3 of corpus words • Some genres are all male (where specified) • speech: spontaneous commentary, legal presentation • academic writing: technology, natural sciences • non-academic writing: technology, social science • Is this representative? • When we compare • technology writing with creative writing • academic writing with student essays • are we also finding gender effects? • Difficult to compensate for absent data in analysis!

  25. female male DCPSE: gender / genre • DCPSE has a simpler genre categorisation • also divided by time prepared speech assorted spontaneous legal cross-examination parliamentary language spontaneous commentary broadcast interviews broadcast discussions telephone conversations informal formal face-to-face conversations TOTAL p 0 0.2 0.4 0.6 0.8 1

  26. DCPSE: gender / time • DCPSE has a simpler genre categorisation • also divided by time • note the gap 1 p 0.8 0.6 0.4 0.2 0 1970 1974 1978 1980 1984 1988 1958 1962 1964 1968 1972 1976 1982 1986 1992 1960 1966 1990 time

  27. DCPSE: genre / time • Proportion in each spoken genre, over time • sampled by matching LLC and ICE-GB overall • this is a ‘stratified sample’ (but only LLC:ICE-GB) • uneven sampling over 5-year periods (within LLC) p formal face-to-face ICE-GBtarget for LLC 0.6 0.4 prepared speech informal face-to-face 0.2 spontaneous commentary telephone conversations 0 1960 1965 1970 1975 1980 1985 1990

  28. DCPSE • LLC sampling not stratified • Issue not considered, data collected over extended period • Some data was surreptitiously recorded

  29. DCPSE • LLC sampling not stratified • Issue not considered, data collected over extended period • Some data was surreptitiously recorded • DCPSE matched samples by ‘genre’ • Same text category sizes in ICE-GB and LLC • But problems in LLC (and ICE) percolate

  30. DCPSE • LLC sampling not stratified • Issue not considered, data collected over extended period • Some data was surreptitiously recorded • DCPSE matched samples by ‘genre’ • Same text category sizes in ICE-GB and LLC • But problems in LLC (and ICE) percolate • No stratification by speaker • Result: difficult and sometimes impossible to separate out speaker-demographic effects from text category

  31. Conclusions • Ideal would be that: • the corpus was “representative” in all 3 ways: • a genuine random sample • a broad range of text types • a stratified sampling of speakers • But these principles are unlikely to be compatible • e.g. speaker age and utterance context • Some compensatory approaches may be employed at research (data analysis) stage • what about absent or atypical data? • what if we have few speakers/writers? • So...

  32. Conclusions • …pay attention to stratification in deciding which texts to include in subcategories • consider replacing texts in outlying categories • …justify anddocument non-inclusion of stratum by evidence • e.g. “there are no published articles attributable to authors of this age in this time period”

More Related