1 / 12

Paul Thompson Applied Linguistics (p.a.thompson@reading.ac.uk)

Corpora: Resources for the study of language. Paul Thompson Applied Linguistics (p.a.thompson@reading.ac.uk). British Academic Spoken English corpus (BASE). 160 lectures, 39 seminars Transcripts, video and audio 199 XML files: Transcripts with detailed annotation

waneta
Download Presentation

Paul Thompson Applied Linguistics (p.a.thompson@reading.ac.uk)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpora: Resources for the study of language Paul Thompson Applied Linguistics (p.a.thompson@reading.ac.uk)

  2. British Academic Spoken English corpus (BASE) • 160 lectures, 39 seminars • Transcripts, video and audio • 199 XML files: • Transcripts with detailed annotation • Metadata included in header • 160 lecture transcripts are tagged for Part-of-Speech • www.reading.ac.uk/AcaDepts/ll/base_corpus/ • Funded by AHRB, Euralex, BALEAP and university sources

  3. British Academic Written English corpus (BAWE) • A corpus of assessed student writing at university level • Texts collected at Warwick, Reading and Oxford Brookes University • Funded by Economic and Social Research Council of England (ESRC) RES-000-23-0800

  4. BAWE figures • 6.5 million words • 2,896 texts • 2,761 assignments • XML files, POS-tagged • 30+ disciplines • 4 levels of study

  5. Query interface: Sketch Engine Commercial service: Applied Linguistics pays annual subscription

  6. BAWE: it BE ADJ that(eg, ‘it is important that’)

  7. Further possibilities • BASE: Linking audio and video to the transcripts, either online or on hard drives • Insertion of timestamp data into transcripts • Example • Why? • Access to temporal, spatial, paralinguistic, phonological information • Studies of speech rate, for example

  8. Uses of corpora • Comparison between languages • Historical linguistics • Stylistics • Studies of language in use • Specialised language use [eg, doctor-patient interactions] • Investigations of multimodality

  9. Projects in mind • PhD thesis corpus • Electronic submission • Academic speech events • Seminars, tutorials, etc • Student use of computers in preparing assignments [video and text] • Reading and writing of undergraduates

  10. Desiderata • Hosting corpus resources at Reading or other university – preferably on Linux servers – with customisable interfaces • BASE, BAWE, and other corpora that Reading possesses • For use by all departments at Reading and also elsewhere • Varied levels of user access • Centralised support needed – lack of continuity with project staff

More Related