120 likes | 262 Views
Corpora: Resources for the study of language. Paul Thompson Applied Linguistics (p.a.thompson@reading.ac.uk). British Academic Spoken English corpus (BASE). 160 lectures, 39 seminars Transcripts, video and audio 199 XML files: Transcripts with detailed annotation
E N D
Corpora: Resources for the study of language Paul Thompson Applied Linguistics (p.a.thompson@reading.ac.uk)
British Academic Spoken English corpus (BASE) • 160 lectures, 39 seminars • Transcripts, video and audio • 199 XML files: • Transcripts with detailed annotation • Metadata included in header • 160 lecture transcripts are tagged for Part-of-Speech • www.reading.ac.uk/AcaDepts/ll/base_corpus/ • Funded by AHRB, Euralex, BALEAP and university sources
British Academic Written English corpus (BAWE) • A corpus of assessed student writing at university level • Texts collected at Warwick, Reading and Oxford Brookes University • Funded by Economic and Social Research Council of England (ESRC) RES-000-23-0800
BAWE figures • 6.5 million words • 2,896 texts • 2,761 assignments • XML files, POS-tagged • 30+ disciplines • 4 levels of study
Query interface: Sketch Engine Commercial service: Applied Linguistics pays annual subscription
Further possibilities • BASE: Linking audio and video to the transcripts, either online or on hard drives • Insertion of timestamp data into transcripts • Example • Why? • Access to temporal, spatial, paralinguistic, phonological information • Studies of speech rate, for example
Uses of corpora • Comparison between languages • Historical linguistics • Stylistics • Studies of language in use • Specialised language use [eg, doctor-patient interactions] • Investigations of multimodality
Projects in mind • PhD thesis corpus • Electronic submission • Academic speech events • Seminars, tutorials, etc • Student use of computers in preparing assignments [video and text] • Reading and writing of undergraduates
Desiderata • Hosting corpus resources at Reading or other university – preferably on Linux servers – with customisable interfaces • BASE, BAWE, and other corpora that Reading possesses • For use by all departments at Reading and also elsewhere • Varied levels of user access • Centralised support needed – lack of continuity with project staff