250 likes | 941 Views
Corpus design. See G Kennedy, Introduction to Corpus Linguistics , Ch .2 CF Meyer, English Corpus Linguistics , Ch. 2. What is a corpus?. Corpus (pl. corpora) = ‘body’ Collection of written text or transcribed speech Usually but not necessarily purposefully collected
E N D
Corpus design See G Kennedy, Introduction to Corpus Linguistics, Ch.2 CF Meyer, English Corpus Linguistics, Ch. 2
What is a corpus? • Corpus (pl. corpora) = ‘body’ • Collection of written text or transcribed speech • Usually but not necessarily purposefully collected • Usually but not necessarily structured • Usually but not necessarily annotated • (Usually stored on and accessible via computer) • Corpus ~ text archive
Issues in corpus design • General purpose vs specialized • Dynamic (monitor) vs static • Representativeness and balance • Size • Storage and access • Permission • Text capture and markup • Organizations
General purpose vs specialized • Probably obvious how to assemble specialized corpus: appropriateness of texts for inclusion is self-defined • General-purpose corpus implies very careful planning to ensure balance • Implies making some assumptions about the nature of language, even though (as corpus linguists) that may go against the grain
Dynamic vs static • Static corpus will give a snapshot of language use at a given time • Easier to control balance of content • May limit usefulness, esp. as time passes (eg Brown corpus now of historical interest, in some respects BNC already out of date) • Dynamic corpus ever-changing • Called “monitor” corpus because allows us to monitor langauge change over time • But more or less impossible to ensure balance
Planned balance: example of BNC • Sampling and representativeness very difficult to ensure • BNCdesigners very explicit about their assumptions • Acknowledge that many decisions are subjective in the end • 100 m words of contemporary spoken and written British English • Representative of BrE “as a whole” • Balanced with regard to genre, subject matter and style • Also designed to be appropriate for a variety of uses: lexicography, education, research, commercial applications (computational tools)
BNC • 4,124 texts: 90% written, 10% spoken • Largest collection of spoken English ever collected (10m words), but reflects typical imbalance in favour of written text (for understandable practical reasons) • Written portion: 75% informative, 25% imaginative • Amount of fiction is slightly disproportionately high compared to amount published during the sampling period, justified because of cultural importance of fiction and creative writing
Subject coverage • Planned to reflect pattern of book publishing in UK over last 20 years Subject Number of texts % of total written Imaginative 625 22 World affairs 453 18 Social science 510 15 Leisure 374 11 Applied science 364 8 Commerce 284 8 Arts 259 8 Natural science 144 4 Belief & thought 146 3 Unclassified 50 3
Sources of written material • 60% books • 25% periodicals • 5% brochures and other ephemera • eg bus tickets, produce containers, junk mail • 5% unpublished letters, essays, minutes • 5% plays, speeches (written to be spoken)
Register “levels” • 30% literary or technical “high” • 45% “middle” • 25% informal “low” • Obvious difficulty of how to judge levels a priori
Spoken corpus • Context-governed material • Lectures, tutorials, classrooms • News reports • Product demonstrations, consultations, interviews • Sermons, political speeches, public meetings, parliamentary debates • Sports commentaries, phone-ins, chat shows • Samples from 12 different regions
Spoken corpus • Ordinary conversation • 2000 hrs from 124 volunteers, 38 different regions • Four different socio-economic groupings • Equal male and female, age range 15 to 60+ • All conversations over a 2-day period recorded • No secret recording, and allowed to erase • Systematic details kept of time, location, details of participants (sex, age, race, occupation, education, social group, ), topic, etc. • Transcription issues: • include false starts, hesitations, etc. • some paralinguistic features (shouting, whispering), • use of dialect words/grammar • but no phonetic information
Another example: ICE • Collection of samples of English as spoken/written around the world • Common design (as well as common annotation scheme, and shared tools for exploitation) • 500 texts of approximately 2,000 words each • 60% spoken, 40% written • Specific domains and genres prescribed • Prescribing common design in this way makes the corpora comparable
ICE text categories Each sample should be 2000 words
Length of corpus • Resources available to create and manage corpus determine how long it can be • Funding, researchers, computing facilities • Speech is easy to capture, but much more time-consuming to process that written language • Transcription and annotation requires 6 person-hours per 1 minute of speech (Santa Barbara Corpus of Spoken American English) • 4 person-hours per 1,000 words of written sample, but between 5 and 10 person-hours per 1,000 words of speech (more for dialogues due to overlapping speech) (International Corpus of English) • On this basis, American component of ICE would take one researcher working 40 hrs/week 3 years to complete • BNC is 100 times bigger than that
Length of corpus • Length is also determined on use to which it will be put • Corpora for lexicographic use need to be (much) bigger • Early corpora (1m words) seemed huge, mainly due to limitations of computers to process them • Sinclair (1991) described a 20m word corpus as “small but nevertheless useful” • Even in a billion-word corpus, data for some words/constructions would be sparse • How many tokens of a linguistic item are needed for descriptive adequacy? • Typically 40-50% of all word types occur only once in a given text (or corpus) • For polysemous words at least half of the possible meanings will occur only once (if at all)
“Type” and “token” • “Token” means individual occurrence of a word • “Type” means instance of a given word • The man saw the girl with the telescope • 8 tokens, 6 types • “Type” may refer to lexeme, or individual word form • run, runs, ran, running: 1 or 4 types?
Some attempts to base corpus size on known statistics of existing corpora • Biber (1993): “reliable information” on frequently occurring linguistic items such as nouns can be got from 120k-word sample, while an infrequently occurring construction such as conditional clause would need 2.4m words • How are such figures arrived at? • Observe point at which measures stabilise • Also, how much data can a lexicographer absorb?