1 / 47

Introduction : corpora, corpus use, and the British National Corpus

Introduction : corpora, corpus use, and the British National Corpus. Dr. Ylva Berglund Prytz ylva.berglund@oucs.ox.ac.uk http://www.natcorp.ox.ac.uk/. Outline. Presentation: Corpora, corpus use, and the BNC Demonstration: How to use BNC with Xaira Hands-on: BNC with Xaira

susane
Download Presentation

Introduction : corpora, corpus use, and the British National Corpus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz ylva.berglund@oucs.ox.ac.uk http://www.natcorp.ox.ac.uk/

  2. Outline • Presentation: Corpora, corpus use, and the BNC • Demonstration: How to use BNC with Xaira • Hands-on: BNC with Xaira • Presentation: Using the BNC for teaching and research • More hands-on: exploring more • Questions and answers

  3. At the end of today you should • have a basic working knowledge about • corpora and corpus use • the BNC • Xaira • feel confident using Xaira • be able to explore area on your own • know where to turn for help and advice

  4. Approaches to linguistic study Intuition “Feel” what is right/wrong/possible One person’s language Subjective Study of usage Examine what is actually said/written Several people Objective

  5. How do you study usage? Need a sample of language, produced by different people in various contexts Examine naturally occurring language Draw conclusions Find a corpus!

  6. What is a corpus? • A collection of naturally occurring language data compiled to mirror a language/language variety • (Usually) computer-readable • (Usually) contains more than text (annotation, meta-data)

  7. What is a corpus? – some definitions A corpus is a collection of naturally-occurring language text, chosen to characterise a state or variety of language. (Sinclair 1991: 171) A corpus can be defined as a collection of texts assumed to be representative of a given language. (Tognini-Bonelli 2001: 2) All the material included in a corpus, whether spoken, written […] is assumed to be taken from genuine communications of people going about their normal business. (ibid: 55)

  8. How can a corpus help? • Look for patterns to see regularities • Quantify • See several examples • Real language – language in use • Based on a variety of sources

  9. Types of corpora • Balanced corpora (= Reference or general corpora) • Specialised corpora • Genre-specific, LSP (e.g. English for Academic Purposes) … • Varieties (dialectal, social, historical) • Learner language, English as a Lingua Franca • Multilingual corpora • Parallel corpora (translations; alignable) • Comparable corpora (similar texts) • Fixed size / monitor corpora • Mode and medium • Written, spoken and transcribed, spoken with audio, video

  10. Famous corpora • Brown family (Brown, LOB, FLOB) • 1 million words, different text categories • Bank of English • Monitor corpus, grows with time • International Corpus of English (ICE) • Different national varieties of English. 1 million words written and spoken • British National Corpus • Reference corpus, fixed, 100 million words, written and spoken

  11. British National Corpus (BNC)

  12. What is the BNC? • A snapshot of British English, taken at the end of the 20th century • 100 million words in approx 4,000 different text samples, both spoken (10%) and written (90%) • Synchronic (1960-93), sampled, general purpose corpus • Available under licence; latest edition is BNC XML edition (March 2007)

  13. More than text • Metadata • About text, author/speaker, audience • Structural & typographical information • Paragraph, sentence, heading, list, bolds • Extra-linguistic information • Voice quality, noise, pauses, overlap • Linguistic information • Part-of-speech

  14. Who produced the BNC and why? • a consortium of dictionary publishers and academic researchers • OUP, Longman, Chambers • OUCS, UCREL, BL R&D • with funding from DTI/ SERC under JFIT 1990-1994 • Lexicographers, NLP researchers, • But not language teachers!

  15. Stated Project Goals • A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production • of non-opportunistic design, for generic applicability • with word class annotation • and contextual information

  16. Actual (?) project goals • Better ELT dictionaries • authoritative • both speech and writing • A model for European corpus work • design, and encoding • Industrial-academic co-operation • A REALLY BIG corpus

  17. Production of the BNC • took three years (at least) • cost GBP 1.6 million (at least) • came about through an unusual coincidence of interests amongst: • Lexicographical publishers • Government (DTI) • Engineering and Science Research Council

  18. Project consequences The BNC looks back to Brown and LOB in its design and markup, and forward to the Web in its scope and indeterminacy • industrial-scale text production system • necessary compromises? • technically over-ambitious? • IPR and profitability

  19. How was the corpus created?

  20. How was the corpus created? • Corpus design • Text selection • Clearance • Capture • Add additional information • Merge • (documentation) • Distribution

  21. The BNC “sausage machine” Selection, clearance, and capture Written (OUP/Chambers)‏ Spoken (Longman)‏ OUP Enrichment and encoding Initial CDIF Conversion and Validation (OUCS)‏ Word Class Annotation (UCREL)‏ Header generation and final validation (OUCS)‏ Documentation, distribution, maintenance

  22. Text selection • Design criteria • Types of texts • Sources • Number of samples • Size of samples • Descriptive criteria • Additional information where available

  23. Selection criteria: written texts Domain imaginative (c 25%) informative Medium Book, periodicals, misc. published, unpublished, written to be spoken Time 1985-1993 (1960-75, 1975-84)

  24. “Descriptive” criteria: written texts • Sample size (number of words) and extent (start and end points) • Topic or subject of the text • Author's name, age, gender, region of origin, and domicile • Target age group and gender • "Level" of writing (reading difficulty) : the more literary or technical a text, the "higher" its level

  25. Selection criteria: spoken texts demographic (spoken conversation) • transcriptions of spontaneous natural conversations made by recruited volunteers • original recordings are available from British Library context-governed (other spoken material) • transcriptions of recordings made at specific types of meeting and event.

  26. Spoken texts: context-governed Four broad categories of social context: • Educational and informative events, such as lectures, news broadcasts, classroom discussion, tutorials • Business events such as sales demonstrations, trades union meetings, consultations, interviews • Institutional and public events, such as sermons, political speeches, council meetings • Leisure events, such as sports commentaries, after-dinner speeches, club meetings, radio phone-ins

  27. Descriptive criteria: spoken texts • Features relating to the speaker (age, sex, social class, dialect) • Context of recording (place, time) • Features of the recording (non-verbal events, paralinguistic phenomena, unclear instances) • Included when known • Sometimes provided by respondent

  28. What is the result?

  29. What is the BNC? • 4,000+ texts • Ca. 100,000,000 words • 10% spoken • Information about • the texts • the speakers/writers • the words • Delivered with a search tool: XAIRA

  30. What's in the BNC?

  31. What topics?‏

  32. Post-hoc text-type classification

  33. Format Corpus header (1) <corpus> <corpusHeader></corpusHeader> <corpusText> <textHeader></textHeader> <text></text> </corpusText> <corpusText> <textHeader></textHeader> <text></text> </corpusText> … </corpus> Corpus texts (4,000+) Text Text header

  34. Annotation, encoding, markup • A means of making explicit, and thus processable: • structure • texts, sections, paragraphs, turns, sentences, words... • metadata • text-type, situational parameters, context • analysis • morphology, syntactic function, translation

  35. Word class annotation • CLAWS (Leech, Garside et al) approach • What counts as a word? • In BNC-XML, each word is explicitly marked and annotated with • a root form or lemma • an automatically assigned C5 word class code • a simplified POS code This isn't prima facie obvious, in spite of spelling conventions.

  36. Example: word class annotation <s n="11"><w c5="NN1" hw="difficulty" pos="SUBST">Difficulty </w><w c5="VBZ" hw="be" pos="VERB">is </w><w c5="VBG" hw="be" pos="VERB">being </w><w c5="VVN" hw="express" pos="VERB">expressed </w><w c5="PRP" hw="with" pos="PREP">with </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="method" pos="SUBST">method </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VBI" hw="be" pos="VERB">be </w><w c5="VVN" hw="use" pos="VERB">used </w><w c5="TO0" hw="to" pos="PREP">to </w><w c5="VVI" hw="launch" pos="VERB">launch </w><w c5="AT0" hw="the" pos="ART">the </w><w c5="NN1" hw="scheme" pos="SUBST">scheme</w><c c5="PUN">.</c></s>

  37. <s n="11"> <w c5="NN1"hw="difficulty"pos="SUBST">Difficulty </w><w c5="VBZ"hw="be"pos="VERB">is </w><w c5="VBG"hw="be"pos="VERB">being </w><w c5="VVN"hw="express"pos="VERB">expressed </w><w c5="PRP"hw="with"pos="PREP">with </w><w c5="AT0"hw="the"pos="ART">the </w><w c5="NN1"hw="method"pos="SUBST">method </w><w c5="TO0"hw="to"pos="PREP">to </w><w c5="VBI"hw="be"pos="VERB">be </w><w c5="VVN"hw="use"pos="VERB">used </w><w c5="TO0"hw="to"pos="PREP">to </w><w c5="VVI"hw="launch"pos="VERB">launch </w><w c5="AT0"hw="the"pos="ART">the </w><w c5="NN1"hw="scheme"pos="SUBST">scheme</w><c c5="PUN">.</c> </s> c5 = detailed part-of-speech hw = head word (new) pos = simple part-of-speech (new)

  38. Some BNC-XML elements • <wtext> or <stext> • <div> = section • <p> = paragraph or <u> = utterance • <s> = “sentence” • <w> = word and <c> = punctuation • <mw> = multiword unit

  39. What is the markup for? • It makes it possible for you to • distinguish aids=SUBST from aids=VERB • distinguish occurrences in writing from ones in speech • distinguish occurrences in headings from ones in paragraphs • identify contextual units like sentences and paragraphs • FACTSHEET WHAT IS AIDS? • AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).

  40. Who uses the BNC (and how?) • Linguists • Research on (English) language • Teachers • Reference, Generate teaching materials, In classroom • Publishers • Dictionaries, EFL text books • Language engineers • Language + computer tools, AI, NLP • Students/language learners • Computer scientists • Information retrieval • Psychologists/neurologists • General ‘norm’ or reference Lexicographers NLP researchers

  41. What makes the BNC so special? ...in these respects, the BNC remains distinctive, twenty years on! • Size • Design • General availability • Standardized markup system • Structural annotation • Word class annotation • Contextual information • Model for other projects

  42. How to use the BNC (with Xaira)

  43. The BNC can be used in different ways and with different tools • User needs to know • What information is available • Where/how is information coded XAIRA can help

  44. Search for • Words or phrases • Word class information • Annotation/mark-up • or a combination of them

  45. Display • Search term with context • with or without mark-up • Information about text • Collocations (co-occurring words) • Distribution across parts of the corpus and much more

  46. XAIRA – XML-aware retrieval application • Searches an index of the corpus • Uses information in the headers and the texts • Often more than one way to make a search • Can be used with other corpora (if they are indexed first)

  47. Introduction : corpora, corpus use, and the British National Corpus Dr. Ylva Berglund Prytz ylva.berglund@oucs.ox.ac.uk http://www.natcorp.ox.ac.uk/

More Related