580 likes | 977 Views
Introduction : corpora, corpus use, and the British National Corpus. Dr. Ylva Berglund Prytz ylva.berglund@oucs.ox.ac.uk http://www.natcorp.ox.ac.uk/. Outline. Presentation: Corpora, corpus use, and the BNC Demonstration: How to use BNC with Xaira Hands-on: BNC with Xaira
E N D
Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz ylva.berglund@oucs.ox.ac.uk http://www.natcorp.ox.ac.uk/
Outline • Presentation: Corpora, corpus use, and the BNC • Demonstration: How to use BNC with Xaira • Hands-on: BNC with Xaira • Presentation: Using the BNC for teaching and research • More hands-on: exploring more • Questions and answers
At the end of today you should • have a basic working knowledge about • corpora and corpus use • the BNC • Xaira • feel confident using Xaira • be able to explore area on your own • know where to turn for help and advice
Approaches to linguistic study Intuition “Feel” what is right/wrong/possible One person’s language Subjective Study of usage Examine what is actually said/written Several people Objective
How do you study usage? Need a sample of language, produced by different people in various contexts Examine naturally occurring language Draw conclusions Find a corpus!
What is a corpus? • A collection of naturally occurring language data compiled to mirror a language/language variety • (Usually) computer-readable • (Usually) contains more than text (annotation, meta-data)
What is a corpus? – some definitions A corpus is a collection of naturally-occurring language text, chosen to characterise a state or variety of language. (Sinclair 1991: 171) A corpus can be defined as a collection of texts assumed to be representative of a given language. (Tognini-Bonelli 2001: 2) All the material included in a corpus, whether spoken, written […] is assumed to be taken from genuine communications of people going about their normal business. (ibid: 55)
How can a corpus help? • Look for patterns to see regularities • Quantify • See several examples • Real language – language in use • Based on a variety of sources
Types of corpora • Balanced corpora (= Reference or general corpora) • Specialised corpora • Genre-specific, LSP (e.g. English for Academic Purposes) … • Varieties (dialectal, social, historical) • Learner language, English as a Lingua Franca • Multilingual corpora • Parallel corpora (translations; alignable) • Comparable corpora (similar texts) • Fixed size / monitor corpora • Mode and medium • Written, spoken and transcribed, spoken with audio, video
Famous corpora • Brown family (Brown, LOB, FLOB) • 1 million words, different text categories • Bank of English • Monitor corpus, grows with time • International Corpus of English (ICE) • Different national varieties of English. 1 million words written and spoken • British National Corpus • Reference corpus, fixed, 100 million words, written and spoken
What is the BNC? • A snapshot of British English, taken at the end of the 20th century • 100 million words in approx 4,000 different text samples, both spoken (10%) and written (90%) • Synchronic (1960-93), sampled, general purpose corpus • Available under licence; latest edition is BNC XML edition (March 2007)
More than text • Metadata • About text, author/speaker, audience • Structural & typographical information • Paragraph, sentence, heading, list, bolds • Extra-linguistic information • Voice quality, noise, pauses, overlap • Linguistic information • Part-of-speech
Who produced the BNC and why? • a consortium of dictionary publishers and academic researchers • OUP, Longman, Chambers • OUCS, UCREL, BL R&D • with funding from DTI/ SERC under JFIT 1990-1994 • Lexicographers, NLP researchers, • But not language teachers!
Stated Project Goals • A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production • of non-opportunistic design, for generic applicability • with word class annotation • and contextual information
Actual (?) project goals • Better ELT dictionaries • authoritative • both speech and writing • A model for European corpus work • design, and encoding • Industrial-academic co-operation • A REALLY BIG corpus
Production of the BNC • took three years (at least) • cost GBP 1.6 million (at least) • came about through an unusual coincidence of interests amongst: • Lexicographical publishers • Government (DTI) • Engineering and Science Research Council
Project consequences The BNC looks back to Brown and LOB in its design and markup, and forward to the Web in its scope and indeterminacy • industrial-scale text production system • necessary compromises? • technically over-ambitious? • IPR and profitability
How was the corpus created? • Corpus design • Text selection • Clearance • Capture • Add additional information • Merge • (documentation) • Distribution
The BNC “sausage machine” Selection, clearance, and capture Written (OUP/Chambers) Spoken (Longman) OUP Enrichment and encoding Initial CDIF Conversion and Validation (OUCS) Word Class Annotation (UCREL) Header generation and final validation (OUCS) Documentation, distribution, maintenance
Text selection • Design criteria • Types of texts • Sources • Number of samples • Size of samples • Descriptive criteria • Additional information where available
Selection criteria: written texts Domain imaginative (c 25%) informative Medium Book, periodicals, misc. published, unpublished, written to be spoken Time 1985-1993 (1960-75, 1975-84)
“Descriptive” criteria: written texts • Sample size (number of words) and extent (start and end points) • Topic or subject of the text • Author's name, age, gender, region of origin, and domicile • Target age group and gender • "Level" of writing (reading difficulty) : the more literary or technical a text, the "higher" its level
Selection criteria: spoken texts demographic (spoken conversation) • transcriptions of spontaneous natural conversations made by recruited volunteers • original recordings are available from British Library context-governed (other spoken material) • transcriptions of recordings made at specific types of meeting and event.
Spoken texts: context-governed Four broad categories of social context: • Educational and informative events, such as lectures, news broadcasts, classroom discussion, tutorials • Business events such as sales demonstrations, trades union meetings, consultations, interviews • Institutional and public events, such as sermons, political speeches, council meetings • Leisure events, such as sports commentaries, after-dinner speeches, club meetings, radio phone-ins
Descriptive criteria: spoken texts • Features relating to the speaker (age, sex, social class, dialect) • Context of recording (place, time) • Features of the recording (non-verbal events, paralinguistic phenomena, unclear instances) • Included when known • Sometimes provided by respondent
What is the BNC? • 4,000+ texts • Ca. 100,000,000 words • 10% spoken • Information about • the texts • the speakers/writers • the words • Delivered with a search tool: XAIRA
Format Corpus header (1) <corpus> <corpusHeader></corpusHeader> <corpusText> <textHeader></textHeader> <text></text> </corpusText> <corpusText> <textHeader></textHeader> <text></text> </corpusText> … </corpus> Corpus texts (4,000+) Text Text header
Annotation, encoding, markup • A means of making explicit, and thus processable: • structure • texts, sections, paragraphs, turns, sentences, words... • metadata • text-type, situational parameters, context • analysis • morphology, syntactic function, translation
Word class annotation • CLAWS (Leech, Garside et al) approach • What counts as a word? • In BNC-XML, each word is explicitly marked and annotated with • a root form or lemma • an automatically assigned C5 word class code • a simplified POS code This isn't prima facie obvious, in spite of spelling conventions.
Example: word class annotation <s n="11"><w c5="NN1" hw="difficulty" pos="SUBST">Difficulty </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="VBG" hw="be" pos="VERB">being </w><w c5="VVN" hw="express" pos="VERB">expressed </w><w c5="PRP" hw="with" pos="PREP">with </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="method" pos="SUBST">method </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VBI" hw="be" pos="VERB">be </w><w c5="VVN" hw="use" pos="VERB">used </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="launch" pos="VERB">launch </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="scheme" pos="SUBST">scheme</w><c c5="PUN">.</c></s>
<s n="11"> <w c5="NN1"hw="difficulty"pos="SUBST">Difficulty </w><w c5="VBZ"hw="be"pos="VERB">is </w><w c5="VBG"hw="be"pos="VERB">being </w><w c5="VVN"hw="express"pos="VERB">expressed </w><w c5="PRP"hw="with"pos="PREP">with </w><w c5="AT0"hw="the"pos="ART">the </w><w c5="NN1"hw="method"pos="SUBST">method </w><w c5="TO0"hw="to"pos="PREP">to </w><w c5="VBI"hw="be"pos="VERB">be </w><w c5="VVN"hw="use"pos="VERB">used </w><w c5="TO0"hw="to"pos="PREP">to </w><w c5="VVI"hw="launch"pos="VERB">launch </w><w c5="AT0"hw="the"pos="ART">the </w><w c5="NN1"hw="scheme"pos="SUBST">scheme</w><c c5="PUN">.</c> </s> c5 = detailed part-of-speech hw = head word (new) pos = simple part-of-speech (new)
Some BNC-XML elements • <wtext> or <stext> • <div> = section • <p> = paragraph or <u> = utterance • <s> = “sentence” • <w> = word and <c> = punctuation • <mw> = multiword unit
What is the markup for? • It makes it possible for you to • distinguish aids=SUBST from aids=VERB • distinguish occurrences in writing from ones in speech • distinguish occurrences in headings from ones in paragraphs • identify contextual units like sentences and paragraphs • FACTSHEET WHAT IS AIDS? • AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).
Who uses the BNC (and how?) • Linguists • Research on (English) language • Teachers • Reference, Generate teaching materials, In classroom • Publishers • Dictionaries, EFL text books • Language engineers • Language + computer tools, AI, NLP • Students/language learners • Computer scientists • Information retrieval • Psychologists/neurologists • General ‘norm’ or reference Lexicographers NLP researchers
What makes the BNC so special? ...in these respects, the BNC remains distinctive, twenty years on! • Size • Design • General availability • Standardized markup system • Structural annotation • Word class annotation • Contextual information • Model for other projects
The BNC can be used in different ways and with different tools • User needs to know • What information is available • Where/how is information coded XAIRA can help
Search for • Words or phrases • Word class information • Annotation/mark-up • or a combination of them
Display • Search term with context • with or without mark-up • Information about text • Collocations (co-occurring words) • Distribution across parts of the corpus and much more
XAIRA – XML-aware retrieval application • Searches an index of the corpus • Uses information in the headers and the texts • Often more than one way to make a search • Can be used with other corpora (if they are indexed first)
Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz ylva.berglund@oucs.ox.ac.uk http://www.natcorp.ox.ac.uk/